Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

TOP Literature Database Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2402.18104

PDF

https://arxiv.org/pdf/2402.18104

Paper Information

Author: Tong Liu;Yingjie Zhang;Zhe Zhao;Yinpeng Dong;Guozhu Meng;Kai Chen
Published: 2-28-2024
Updated: 6-10-2024
Affiliation: Institute of Information Engineering, Chinese Academy of Sciences
Country: China
Conference: USENIX Security Symposium

Labels Estimated by AI

Prompt Injection LLM Security LLM Performance Evaluation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

In recent years, large language models (LLMs) have demonstrated notable success across various tasks, but the trustworthiness of LLMs is still an open problem. One specific threat is the potential to generate toxic or harmful responses. Attackers can craft adversarial prompts that induce harmful responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA (Disguise and Reconstruction Attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. We evaluate DRA across various open-source and closed-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA boasts a 91.1% attack success rate on OpenAI GPT-4 chatbot.

External Datasets

HarmBench