These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which
attempt to elicit harmful responses from LLMs. The evolving nature and
diversity of these attacks pose many challenges for defense systems, including
(1) adaptation to counter emerging attack strategies without costly retraining,
and (2) control of the trade-off between safety and utility. To address these
challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for
jailbreak detection that incorporates a database of known attack examples into
Retrieval-Augmented Generation, which is used to infer the underlying,
malicious user query and jailbreak strategy used to attack the system. RAD
enables training-free updates for newly discovered jailbreak strategies and
provides a mechanism to balance safety and utility. Experiments on StrongREJECT
show that RAD substantially reduces the effectiveness of strong jailbreak
attacks such as PAP and PAIR while maintaining low rejection rates for benign
queries. We propose a novel evaluation scheme and show that RAD achieves a
robust safety-utility trade-off across a range of operating points in a
controllable manner.