GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

TOP Literature Database GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2507.07735

PDF

https://arxiv.org/pdf/2507.07735

Paper Information

Author: Peiyan Zhang,Haibo Jin,Liying Kang,Haohan Wang
Published: 7-10-2025
Affiliation: Hong Kong University of Science and Technology
Country: Hong Kong
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Performance Evaluation Metrics Large Language Model Prompt validation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the sophistication required in effectively probing their vulnerabilities. Current benchmarks and evaluation methods struggle to fully address these challenges, leaving gaps in the assessment of LLM vulnerabilities. In this paper, we review existing jailbreak evaluation practices and identify three assumed desiderata for an effective jailbreak evaluation protocol. To address these challenges, we introduce GuardVal, a new evaluation protocol that dynamically generates and refines jailbreak prompts based on the defender LLM's state, providing a more accurate assessment of defender LLMs' capacity to handle safety-critical situations. Moreover, we propose a new optimization method that prevents stagnation during prompt refinement, ensuring the generation of increasingly effective jailbreak prompts that expose deeper weaknesses in the defender LLMs. We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains. Our findings highlight distinct behavioral patterns among the models, offering a comprehensive view of their robustness. Furthermore, our evaluation process deepens the understanding of LLM behavior, leading to insights that can inform future research and drive the development of more secure models.

External Datasets

JAMBench

HarmBench

JailbreakBench

Chatbot Guardrails Arena