AIセキュリティポータル K Program
FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
Share
Abstract
Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit meticulously crafted prompts to elicit content that violates service guidelines, have captured the attention of research communities. While model owners can defend against individual jailbreak prompts through safety training strategies, this relatively passive approach struggles to handle the broader category of similar jailbreaks. To tackle this issue, we introduce FuzzLLM, an automated fuzzing framework designed to proactively test and discover jailbreak vulnerabilities in LLMs. We utilize templates to capture the structural integrity of a prompt and isolate key features of a jailbreak class as constraints. By integrating different base classes into powerful combo attacks and varying the elements of constraints and prohibited questions, FuzzLLM enables efficient testing with reduced manual effort. Extensive experiments demonstrate FuzzLLM's effectiveness and comprehensiveness in vulnerability discovery across various LLMs.
GPT-4 Technical Report
OpenAI
Published: 2023
How long can context length of open-source llms truly promise?
D. Li, R. Shao, A. Xie, Y. Sheng, L. Zheng, J. Gonzalez, I. Stoica, X. Ma, H. Zhang
Published: 2023
GLM-130b: An open bilingual pre-trained model
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al.
Published: 2023
Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks
Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, Monojit Choudhury
Published: 2023
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, Jacob Steinhardt
Published: 2023.7.6
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, Yang Liu
Published: 2023.7.16
Pretraining language models with human preferences
Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, Ethan Perez
Published: 2023
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe
Published: 2022.3.4
The art, science, and engineering of fuzzing: A survey
Valentin J.M. Manes, HyungSeok Han, et al.
Published: 2021
The art of software testing
Glenford J. Myers, Corey Sandler, Tom Badgett
Published: 2012
Camel: Communicative agents for” mind” exploration of large language model society
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem
Published: 2023
Bloom: A 176b-parameter open-access multilingual language model
BigScience Workshop
Published: 2023
Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models
Huachuan Qiu, Shuai Zhang, et al.
Published: 2023
Removing rlhf protections in gpt-4 via fine-tuning
Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang
Published: 2023
Persistent anti-muslim bias in large language models
Abubakar Abid, Maheen Farooqi, James Zou
Published: 2021
Evasion attacks against machine learning at test time
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndić, Pavel Laskov, Giorgio Giacinto, Fabio Roli
Published: 2013
Adversarial examples are not easily detected: Bypassing ten detection methods
N. Carlini, D. Wagner
Published: 2017
The Limitations of Deep Learning in Adversarial Settings
Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, Ananthram Swami
Published: 2015.11.24
Prompt programming for large language models: Beyond the few-shot paradigm
Laria Reynolds, Kyle McDonell
Published: 2021
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang
Published: 2023.8.8
Share