Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

TOP 文献データベース Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2406.06622

PDF

https://arxiv.org/pdf/2406.06622

文献情報

作者: Fan Liu;Zhao Xu;Hao Liu
公開日: 2025-3-18
所属機関: AI Thrust
所属の国: Hong Kong
会議名

AIにより推定されたラベル

敵対的訓練プロンプトインジェクション LLMセキュリティ

Abstract

Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently and effectively generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts, further enhancing LLM's defense capabilities. We conducted comprehensive experiments on three widely used jailbreak datasets, comparing our framework with six defense baselines under five representative attack scenarios. The results underscore the superiority of our proposed methods. Furthermore, our adversarial tuning framework exhibits empirical generalizability across various attack strategies and target LLMs, highlighting its potential as a transferable defense mechanism.