LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

TOP 文献データベース LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2408.15221

PDF

https://arxiv.org/pdf/2408.15221

文献情報

作者: Nathaniel Li;Ziwen Han;Ian Steneker;Willow Primack;Riley Goodside;Hugh Zhang;Zifan Wang;Cristina Menghini;Summer Yue
公開日: 2024-8-28
更新日: 2024-9-4
所属機関: Scale AI
所属の国: United States of America
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

プロンプトインジェクションユーザー教育攻撃手法

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.

外部データセット

Multi-Turn Human Jailbreaks (MHJ)

HarmBench

WMDP-Bio