How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

Authors: Yanshu Wang, Shuaishuai Yang, Jingjing He, Tong Yang | Published: 2026-02-04

2026.02.042026.02.06

Authors: Yanshu Wang, Shuaishuai Yang, Jingjing He, Tong Yang
Published: 2026-02-04

Source: https://arxiv.org/abs/2602.04294

PDF: https://arxiv.org/pdf/2602.04294

Labels Predicted by AI

LLM Performance Evaluation Large Language Model Prompt Injection

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP’s safety rate by up to 4.5