These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) have evolved from simple chatbots into
autonomous agents capable of performing complex tasks such as editing
production code, orchestrating workflows, and taking higher-stakes actions
based on untrusted inputs like webpages and emails. These capabilities
introduce new security risks that existing security measures, such as model
fine-tuning or chatbot-focused guardrails, do not fully address. Given the
higher stakes and the absence of deterministic solutions to mitigate these
risks, there is a critical need for a real-time guardrail monitor to serve as a
final layer of defense, and support system level, use case specific safety
policy definition and enforcement. We introduce LlamaFirewall, an open-source
security focused guardrail framework designed to serve as a final layer of
defense against security risks associated with AI Agents. Our framework
mitigates risks such as prompt injection, agent misalignment, and insecure code
risks through three powerful guardrails: PromptGuard 2, a universal jailbreak
detector that demonstrates clear state of the art performance; Agent Alignment
Checks, a chain-of-thought auditor that inspects agent reasoning for prompt
injection and goal misalignment, which, while still experimental, shows
stronger efficacy at preventing indirect injections in general scenarios than
previously proposed approaches; and CodeShield, an online static analysis
engine that is both fast and extensible, aimed at preventing the generation of
insecure or dangerous code by coding agents. Additionally, we include
easy-to-use customizable scanners that make it possible for any developer who
can write a regular expression or an LLM prompt to quickly update an agent's
security guardrails.
External Datasets
AgentDojo
in-house benchmark specifically designed to assess indirect goal hijacking scenarios