Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

TOP Literature Database Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2408.08924

PDF

https://arxiv.org/pdf/2408.08924

Paper Information

Author: Jiawei Zhao;Kejiang Chen;Xiaojian Yuan;Weiming Zhang
Published: 8-15-2024
Updated: 8-23-2024
Affiliation: University of Science and Technology of China
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

LLM Security Prompt Injection Defense Method

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

In recent years, the rapid development of large language models (LLMs) has achieved remarkable performance across various tasks. However, research indicates that LLMs are vulnerable to jailbreak attacks, where adversaries can induce the generation of harmful content through meticulously crafted prompts. This vulnerability poses significant challenges to the secure use and promotion of LLMs. Existing defense methods offer protection from different perspectives but often suffer from insufficient effectiveness or a significant impact on the model's capabilities. In this paper, we propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG), which guides the model to identify harmful prompts by directly setting the first few tokens of the model's output. This approach combines the model's inherent security capabilities with an external classifier to defend against jailbreak attacks. We demonstrate the effectiveness of PG across three models and five attack methods. Compared to baselines, our approach is generally more effective on average. Additionally, results on the Just-Eval benchmark further confirm PG's superiority to preserve the model's performance. our code is available at https://github.com/weiyezhimeng/Prefix-Guidance.

External Datasets

Advbench

Just-Eval

harmful-instruction

Alpaca