These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) are being rapidly developed, and a key component
of their widespread deployment is their safety-related alignment. Many
red-teaming efforts aim to jailbreak LLMs, where among these efforts, the
Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest
in the study of optimization-based jailbreaking techniques. Although GCG is a
significant milestone, its attacking efficiency remains unsatisfactory. In this
paper, we present several improved (empirical) techniques for
optimization-based jailbreaks like GCG. We first observe that the single target
template of "Sure" largely limits the attacking performance of GCG; given this,
we propose to apply diverse target templates containing harmful self-suggestion
and/or guidance to mislead LLMs. Besides, from the optimization aspects, we
propose an automatic multi-coordinate updating strategy in GCG (i.e.,
adaptively deciding how many tokens to replace in each step) to accelerate
convergence, as well as tricks like easy-to-hard initialisation. Then, we
combine these improved technologies to develop an efficient jailbreak method,
dubbed I-GCG. In our experiments, we evaluate on a series of benchmarks (such
as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved
techniques can help GCG outperform state-of-the-art jailbreaking attacks and
achieve nearly 100% attack success rate. The code is released at
https://github.com/jiaxiaojunQAQ/I-GCG.