Large language models (LLMs) have evolved from simple chatbots into
autonomous agents capable of performing complex tasks such as editing
production code, orchestrating workflows, and taking higher-stakes actions
based on untrusted inputs like webpages and emails. These capabilities
introduce new security risks that existing security measures, such as model
fine-tuning or chatbot-focused guardrails, do not fully address. Given the
higher stakes and the absence of deterministic solutions to mitigate these
risks, there is a critical need for a real-time guardrail monitor to serve as a
final layer of defense, and support system level, use case specific safety
policy definition and enforcement. We introduce LlamaFirewall, an open-source
security focused guardrail framework designed to serve as a final layer of
defense against security risks associated with AI Agents. Our framework
mitigates risks such as prompt injection, agent misalignment, and insecure code
risks through three powerful guardrails: PromptGuard 2, a universal jailbreak
detector that demonstrates clear state of the art performance; Agent Alignment
Checks, a chain-of-thought auditor that inspects agent reasoning for prompt
injection and goal misalignment, which, while still experimental, shows
stronger efficacy at preventing indirect injections in general scenarios than
previously proposed approaches; and CodeShield, an online static analysis
engine that is both fast and extensible, aimed at preventing the generation of
insecure or dangerous code by coding agents. Additionally, we include
easy-to-use customizable scanners that make it possible for any developer who
can write a regular expression or an LLM prompt to quickly update an agent's
security guardrails.
外部データセット
AgentDojo
in-house benchmark specifically designed to assess indirect goal hijacking scenarios