Defending against Jailbreak through Early Exit Generation of Large Language Models

TOP 文献データベース Defending against Jailbreak through Early Exit Generation of Large Language Models

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2408.11308

PDF

https://arxiv.org/pdf/2408.11308

文献情報

作者: Chongwen Zhao,Zhihao Dou,Kaizhu Huang
公開日: 2024-8-21
更新日: 2025-8-25
所属機関: Duke Kunshan University
所属の国: China
会議名: International Conference on Neural Information Processing (ICONIP)

AIにより推定されたラベル

プロンプトインジェクション LLMセキュリティ防御手法

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation. In an effort to mitigate such risks, the concept of "Alignment" technology has been developed. However, recent studies indicate that this alignment can be undermined using sophisticated prompt engineering or adversarial suffixes, a technique known as "Jailbreak." Our research takes cues from the human-like generate process of LLMs. We identify that while jailbreaking prompts may yield output logits similar to benign prompts, their initial embeddings within the model's latent space tend to be more analogous to those of malicious prompts. Leveraging this finding, we propose utilizing the early transformer outputs of LLMs as a means to detect malicious inputs, and terminate the generation immediately. We introduce a simple yet significant defense approach called EEG-Defender for LLMs. We conduct comprehensive experiments on ten jailbreak methods across three models. Our results demonstrate that EEG-Defender is capable of reducing the Attack Success Rate (ASR) by a significant margin, roughly 85% in comparison with 50% for the present SOTAs, with minimal impact on the utility of LLMs.

外部データセット

Alpaca Eval

AdvBench

toxic-chat training dataset