These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Jailbreak attack can be used to access the vulnerabilities of Large Language
Models (LLMs) by inducing LLMs to generate the harmful content. And the most
common method of the attack is to construct semantically ambiguous prompts to
confuse and mislead the LLMs. To access the security and reveal the intrinsic
relation between the input prompt and the output for LLMs, the distribution of
attention weight is introduced to analyze the underlying reasons. By using
statistical analysis methods, some novel metrics are defined to better describe
the distribution of attention weight, such as the Attention Intensity on
Sensitive Words (Attn_SensWords), the Attention-based Contextual Dependency
Score (Attn_DepScore) and Attention Dispersion Entropy (Attn_Entropy). By
leveraging the distinct characteristics of these metrics, the beam search
algorithm and inspired by the military strategy "Feint and Attack", an
effective jailbreak attack strategy named as Attention-Based Attack (ABA) is
proposed. In the ABA, nested attack prompts are employed to divert the
attention distribution of the LLMs. In this manner, more harmless parts of the
input can be used to attract the attention of the LLMs. In addition, motivated
by ABA, an effective defense strategy called as Attention-Based Defense (ABD)
is also put forward. Compared with ABA, the ABD can be used to enhance the
robustness of LLMs by calibrating the attention distribution of the input
prompt. Some comparative experiments have been given to demonstrate the
effectiveness of ABA and ABD. Therefore, both ABA and ABD can be used to access
the security of the LLMs. The comparative experiment results also give a
logical explanation that the distribution of attention weight can bring great
influence on the output for LLMs.