Large Language Model (LLM) jailbreak refers to a type of attack aimed to
bypass the safeguard of an LLM to generate contents that are inconsistent with
the safe usage guidelines. Based on the insights from the self-attention
computation process, this paper proposes a novel blackbox jailbreak approach,
which involves crafting the payload prompt by strategically injecting the
prohibited query into a carrier article. The carrier article maintains the
semantic proximity to the prohibited query, which is automatically produced by
combining a hypernymy article and a context, both of which are generated from
the prohibited query. The intuition behind the usage of carrier article is to
activate the neurons in the model related to the semantics of the prohibited
query while suppressing the neurons that will trigger the objectionable text.
Carrier article itself is benign, and we leveraged prompt injection techniques
to produce the payload prompt. We evaluate our approach using JailbreakBench,
testing against four target models across 100 distinct jailbreak objectives.
The experimental results demonstrate our method's superior effectiveness,
achieving an average success rate of 63% across all target models,
significantly outperforming existing blackbox jailbreak methods.