Large Language Models (LLMs) have become increasingly integral to a wide
range of applications. However, they still remain the threat of jailbreak
attacks, where attackers manipulate designed prompts to make the models elicit
malicious outputs. Analyzing jailbreak methods can help us delve into the
weakness of LLMs and improve it. In this paper, We reveal a vulnerability in
large language models (LLMs), which we term Defense Threshold Decay (DTD), by
analyzing the attention weights of the model's output on input and subsequent
output on prior output: as the model generates substantial benign content, its
attention weights shift from the input to prior output, making it more
susceptible to jailbreak attacks. To demonstrate the exploitability of DTD, we
propose a novel jailbreak attack method, Sugar-Coated Poison (SCP), which
induces the model to generate substantial benign content through benign input
and adversarial reasoning, subsequently producing malicious content. To
mitigate such attacks, we introduce a simple yet effective defense strategy,
POSD, which significantly reduces jailbreak success rates while preserving the
model's generalization capabilities.