Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles

TOP 文献データベース Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2408.11182

PDF

https://arxiv.org/pdf/2408.11182

文献情報

作者: Zhilong Wang,Haizhou Wang,Nanqing Luo,Lan Zhang,Xiaoyan Sun,Yebo Cao,Peng Liu
公開日: 2024-8-21
更新日: 2025-2-7
所属機関: The Pennsylvania State University
所属の国: United States of America
会議名

AIにより推定されたラベル

プロンプトインジェクション大規模言語モデル攻撃シナリオ分析

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large Language Model (LLM) jailbreak refers to a type of attack aimed to bypass the safeguard of an LLM to generate contents that are inconsistent with the safe usage guidelines. Based on the insights from the self-attention computation process, this paper proposes a novel blackbox jailbreak approach, which involves crafting the payload prompt by strategically injecting the prohibited query into a carrier article. The carrier article maintains the semantic proximity to the prohibited query, which is automatically produced by combining a hypernymy article and a context, both of which are generated from the prohibited query. The intuition behind the usage of carrier article is to activate the neurons in the model related to the semantics of the prohibited query while suppressing the neurons that will trigger the objectionable text. Carrier article itself is benign, and we leveraged prompt injection techniques to produce the payload prompt. We evaluate our approach using JailbreakBench, testing against four target models across 100 distinct jailbreak objectives. The experimental results demonstrate our method's superior effectiveness, achieving an average success rate of 63% across all target models, significantly outperforming existing blackbox jailbreak methods.

外部データセット

JailbreakBench