How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

TOP Literature Database How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2602.04294

PDF

https://arxiv.org/pdf/2602.04294

Paper Information

Author: Yanshu Wang,Shuaishuai Yang,Jingjing He,Tong Yang
Published: 2-4-2026
Affiliation: Peking University
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

LLM Performance Evaluation Large Language Model Prompt Injection

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP's safety rate by up to 4.5% through reinforcing role identity, while it degrades ToP's effectiveness by up to 21.2% through distracting attention from task instructions. Based on these findings, we provide practical recommendations for deploying prompt-based defenses in real-world LLM applications.

External Datasets

AdvBench

HarmBench

SG-Bench

XSTest