SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

TOP Literature Database SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2310.03684

PDF

https://arxiv.org/pdf/2310.03684

Paper Information

Author: Alexander Robey;Eric Wong;Hamed Hassani;George J. Pappas
Published: 10-6-2023
Updated: 6-12-2024
Affiliation: University of Pennsylvania
Country: United States of America
Conference: Trans. Mach. Learn. Res.

Labels Estimated by AI

Defense Method Prompt Injection LLM Performance Evaluation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-llm}.

External Datasets

AdvBench

JBB-Behaviors

InstructionFollowing

PIQA

OpenBookQA

ToxiGen

harmful_behaviors.csv