These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Security alignment enables the Large Language Model (LLM) to gain the
protection against malicious queries, but various jailbreak attack methods
reveal the vulnerability of this security mechanism. Previous studies have
isolated LLM jailbreak attacks and defenses. We analyze the security protection
mechanism of the LLM, and propose a framework that combines attack and defense.
Our method is based on the linearly separable property of LLM intermediate
layer embedding, as well as the essence of jailbreak attack, which aims to
embed harmful problems and transfer them to the safe area. We utilize
generative adversarial network (GAN) to learn the security judgment boundary
inside the LLM to achieve efficient jailbreak attack and defense. The
experimental results indicate that our method achieves an average jailbreak
success rate of 88.85\% across three popular LLMs, while the defense success
rate on the state-of-the-art jailbreak dataset reaches an average of 84.17\%.
This not only validates the effectiveness of our approach but also sheds light
on the internal security mechanisms of LLMs, offering new insights for
enhancing model security The code and data are available at
https://github.com/NLPGM/CAVGAN.