Gandalf the Red: Adaptive Security for LLMs

TOP 文献データベース Gandalf the Red: Adaptive Security for LLMs

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2501.07927

PDF

https://arxiv.org/pdf/2501.07927

文献情報

作者: Niklas Pfister,Václav Volhejn,Manuel Knott,Santiago Arias,Julia Bazińska,Mykhailo Bichurin,Alan Commike,Janet Darling,Peter Dienes,Matthew Fiedler,David Haber,Matthias Kraft,Marco Lancini,Max Mathys,Damián Pascual-Ortiz,Jakub Podolak,Adrià Romero-López,Kyriacos Shiarlis,Andreas Signer,Zsolt Terek,Athanasios Theocharis,Daniel Timbrell,Samuel Trautwein,Samuel Watts,Yun-Han Wu,Mateo Rojas-Carulla
公開日: 2025-1-14
更新日: 2025-8-5
所属機関: Lakera
所属の国: Canada
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

ユーザ行動分析モデル抽出攻撃プロンプトインジェクション

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications.