SoK: Evaluating Jailbreak Guardrails for Large Language Models

TOP Literature Database SoK: Evaluating Jailbreak Guardrails for Large Language Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2506.10597

PDF

https://arxiv.org/pdf/2506.10597

Paper Information

Author: Xunguang Wang,Zhenlan Ji,Wenxuan Wang,Zongjie Li,Daoyuan Wu,Shuai Wang
Published: 6-12-2025
Updated: 10-16-2025
Affiliation: The Hong Kong University of Science and Technology
Country: Hong Kong
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Prompt Injection Jailbreak Attack Techniques Trade-Off Between Safety And Usability

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety alignments. Guardrails--external defense mechanisms that monitor and control LLM interactions--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, provide insights into optimizing their defense mechanisms, and explore their universality across attack types. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at https://github.com/xunguangwang/SoK4JailbreakGuardrails.

External Datasets

JailbreakHub

JailbreakBench

SafeMTData

MultiJail

AlpacaEval

OR-Bench