These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) have achieved remarkable progress, but their
deployment has exposed critical vulnerabilities, particularly to jailbreak
attacks that circumvent safety alignments. Guardrails--external defense
mechanisms that monitor and control LLM interactions--have emerged as a
promising solution. However, the current landscape of LLM guardrails is
fragmented, lacking a unified taxonomy and comprehensive evaluation framework.
In this Systematization of Knowledge (SoK) paper, we present the first holistic
analysis of jailbreak guardrails for LLMs. We propose a novel,
multi-dimensional taxonomy that categorizes guardrails along six key
dimensions, and introduce a Security-Efficiency-Utility evaluation framework to
assess their practical effectiveness. Through extensive analysis and
experiments, we identify the strengths and limitations of existing guardrail
approaches, provide insights into optimizing their defense mechanisms, and
explore their universality across attack types. Our work offers a structured
foundation for future research and development, aiming to guide the principled
advancement and deployment of robust LLM guardrails. The code is available at
https://github.com/xunguangwang/SoK4JailbreakGuardrails.