LlamaFirewall: An open source guardrail system for building secure AI agents

TOP 文献データベース LlamaFirewall: An open source guardrail system for building secure AI agents

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2505.03574

PDF

https://arxiv.org/pdf/2505.03574

文献情報

作者: Sahana Chennabasappa,Cyrus Nikolaidis,Daniel Song,David Molnar,Stephanie Ding,Shengye Wan,Spencer Whitman,Lauren Deason,Nicholas Doucette,Abraham Montilla,Alekhya Gampa,Beto de Paola,Dominik Gabi,James Crnkovich,Jean-Christophe Testud,Kat He,Rashnil Chaturvedi,Wu Zhou,Joshua Saxe
公開日: 2025-5-6
所属機関: Meta
所属の国: United States of America
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

アライメント LLMセキュリティプロンプトインジェクション

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrails, do not fully address. Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor to serve as a final layer of defense, and support system level, use case specific safety policy definition and enforcement. We introduce LlamaFirewall, an open-source security focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. Our framework mitigates risks such as prompt injection, agent misalignment, and insecure code risks through three powerful guardrails: PromptGuard 2, a universal jailbreak detector that demonstrates clear state of the art performance; Agent Alignment Checks, a chain-of-thought auditor that inspects agent reasoning for prompt injection and goal misalignment, which, while still experimental, shows stronger efficacy at preventing indirect injections in general scenarios than previously proposed approaches; and CodeShield, an online static analysis engine that is both fast and extensible, aimed at preventing the generation of insecure or dangerous code by coding agents. Additionally, we include easy-to-use customizable scanners that make it possible for any developer who can write a regular expression or an LLM prompt to quickly update an agent's security guardrails.

外部データセット

AgentDojo

in-house benchmark specifically designed to assess indirect goal hijacking scenarios

in-house direct jailbreak evaluation dataset