These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Jailbreaking is an emerging adversarial attack that bypasses the safety
alignment deployed in off-the-shelf large language models (LLMs). A
considerable amount of research exists proposing more effective jailbreak
attacks, including the recent Greedy Coordinate Gradient (GCG) attack,
jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and
multilingual jailbreak. In contrast, the defensive side has been relatively
less explored. This paper proposes a lightweight yet practical defense called
SELFDEFEND, which can defend against all existing jailbreak attacks with
minimal delay for jailbreak prompts and negligible delay for normal user
prompts. Our key insight is that regardless of the kind of jailbreak strategies
employed, they eventually need to include a harmful prompt (e.g., "how to make
a bomb") in the prompt sent to LLMs, and we found that existing LLMs can
effectively recognize such harmful prompts that violate their safety policies.
Based on this insight, we design a shadow stack that concurrently checks
whether a harmful prompt exists in the user prompt and triggers a checkpoint in
the normal stack once a token of "No" or a harmful prompt is output. The latter
could also generate an explainable LLM response to adversarial prompts. We
demonstrate our idea of SELFDEFEND works in various jailbreak scenarios through
manual analysis in GPT-3.5/4. We also list three future directions to further
enhance SELFDEFEND.