AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

TOP 文献データベース AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2310.15140

PDF

https://arxiv.org/pdf/2310.15140

文献情報

作者: Sicheng Zhu;Ruiyi Zhang;Bang An;Gang Wu;Joe Barrow;Zichao Wang;Furong Huang;Ani Nenkova;Tong Sun
公開日: 2023-10-24
更新日: 2023-12-14
所属機関: University of Maryland, College Park
所属の国: United States of America
会議名

AIにより推定されたラベル

攻撃手法プロンプトインジェクション安全性アライメント

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters; manual jailbreak attacks craft readable prompts, but their limited number due to the necessity of human creativity allows for easy blocking. In this paper, we show that these solutions may be too optimistic. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types. Guided by the dual goals of jailbreak and readability, AutoDAN optimizes and generates tokens one by one from left to right, resulting in readable prompts that bypass perplexity filters while maintaining high attack success rates. Notably, these prompts, generated from scratch using gradients, are interpretable and diverse, with emerging strategies commonly seen in manual jailbreak attacks. They also generalize to unforeseen harmful behaviors and transfer to black-box LLMs better than their unreadable counterparts when using limited training data or a single proxy model. Furthermore, we show the versatility of AutoDAN by automatically leaking system prompts using a customized objective. Our work offers a new way to red-team LLMs and understand jailbreak mechanisms via interpretability.

外部データセット

AdvBench

prompt-leaking dataset