These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The wide adoption of Large Language Models (LLMs) has attracted significant
attention from $\textit{jailbreak}$ attacks, where adversarial prompts crafted
through optimization or manual design exploit LLMs to generate malicious
contents. However, optimization-based attacks have limited efficiency and
transferability, while existing manual designs are either easily detectable or
demand intricate interactions with LLMs. In this paper, we first point out a
novel perspective for jailbreak attacks: LLMs are more responsive to
$\textit{positive}$ prompts. Based on this, we deploy Happy Ending Attack (HEA)
to wrap up a malicious request in a scenario template involving a positive
prompt formed mainly via a $\textit{happy ending}$, it thus fools LLMs into
jailbreaking either immediately or at a follow-up malicious request. This has
made HEA both efficient and effective, as it requires only up to two turns to
fully jailbreak LLMs. Extensive experiments show that our HEA can successfully
jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro,
and achieves 88.79% attack success rate on average. We also provide
quantitative explanations for the success of HEA.