TOP Literature Database FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
arxiv
FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
AI Security Portal bot
Information in the literature database is collected automatically.
These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit
meticulously crafted prompts to elicit content that violates service
guidelines, have captured the attention of research communities. While model
owners can defend against individual jailbreak prompts through safety training
strategies, this relatively passive approach struggles to handle the broader
category of similar jailbreaks. To tackle this issue, we introduce FuzzLLM, an
automated fuzzing framework designed to proactively test and discover jailbreak
vulnerabilities in LLMs. We utilize templates to capture the structural
integrity of a prompt and isolate key features of a jailbreak class as
constraints. By integrating different base classes into powerful combo attacks
and varying the elements of constraints and prohibited questions, FuzzLLM
enables efficient testing with reduced manual effort. Extensive experiments
demonstrate FuzzLLM's effectiveness and comprehensiveness in vulnerability
discovery across various LLMs.