Dagger Behind Smile: Fool LLMs with a Happy Ending Story

TOP Literature Database Dagger Behind Smile: Fool LLMs with a Happy Ending Story

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2501.13115

PDF

https://arxiv.org/pdf/2501.13115

Paper Information

Author: Xurui Song,Zhixin Xie,Shuo Huai,Jiayi Kong,Jun Luo
Published: 1-19-2025
Updated: 9-30-2025
Affiliation: S-Lab, Nanyang Technological University
Country: Singapore
Conference: Conference on Empirical Methods in Natural Language Processing (EMNLP)

Labels Estimated by AI

攻撃手法の効果(Fail to translate) Disabling Safety Mechanisms of LLM Malicious Prompt

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

The wide adoption of Large Language Models (LLMs) has attracted significant attention from $\textit{jailbreak}$ attacks, where adversarial prompts crafted through optimization or manual design exploit LLMs to generate malicious contents. However, optimization-based attacks have limited efficiency and transferability, while existing manual designs are either easily detectable or demand intricate interactions with LLMs. In this paper, we first point out a novel perspective for jailbreak attacks: LLMs are more responsive to $\textit{positive}$ prompts. Based on this, we deploy Happy Ending Attack (HEA) to wrap up a malicious request in a scenario template involving a positive prompt formed mainly via a $\textit{happy ending}$, it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request. This has made HEA both efficient and effective, as it requires only up to two turns to fully jailbreak LLMs. Extensive experiments show that our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79% attack success rate on average. We also provide quantitative explanations for the success of HEA.

External Datasets

AdvBench Dataset