FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models

TOP Literature Database FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2309.05274

PDF

https://arxiv.org/pdf/2309.05274

Paper Information

Author: Dongyu Yao;Jianshu Zhang;Ian G. Harris;Marcel Carlsson
Published: 9-11-2023
Updated: 4-15-2024
Affiliation: Wuhan University
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Prompt Injection LLM Security Watermarking

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit meticulously crafted prompts to elicit content that violates service guidelines, have captured the attention of research communities. While model owners can defend against individual jailbreak prompts through safety training strategies, this relatively passive approach struggles to handle the broader category of similar jailbreaks. To tackle this issue, we introduce FuzzLLM, an automated fuzzing framework designed to proactively test and discover jailbreak vulnerabilities in LLMs. We utilize templates to capture the structural integrity of a prompt and isolate key features of a jailbreak class as constraints. By integrating different base classes into powerful combo attacks and varying the elements of constraints and prohibited questions, FuzzLLM enables efficient testing with reduced manual effort. Extensive experiments demonstrate FuzzLLM's effectiveness and comprehensiveness in vulnerability discovery across various LLMs.

External Datasets

Jailbreakchat prompts

empirical works on jailbreak taxonomy