Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

TOP Literature Database Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2604.19461

PDF

https://arxiv.org/pdf/2604.19461

Paper Information

Author: Alex Polyakov,Daniel Kuznetsov
Published: 4-21-2026
Affiliation: Adversa AI
Country: United States of America
Conference

Labels Estimated by AI

Prompt leaking Safety Alignment データ毒性(Fail to translate)

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Safety alignment in large language models relies on behavioral training that can be overridden when sufficiently strong in-context patterns compete with learned refusal behaviors. We introduce Involuntary In-Context Learning (IICL), an attack class that uses abstract operator framing with few-shot examples to force pattern completion that overrides safety training. Through 3479 probes across 10 OpenAI models, we identify the attack's effective components through a seven-experiment ablation study. Key findings: (1)~semantic operator naming achieves 100\,\% bypass rate (50/50, $p < 0.001$); (2)~the attack requires abstract framing, since identical examples in direct question-and-answer format yield 0\,\%; (3)~example ordering matters strongly (interleaved: 76\,\%, harmful-first: 6\,\%); (4)~temperature has no meaningful effect (46--56\,\% across 0.0--1.0). On the HarmBench benchmark, IICL achieves 24.0\,\% bypass $[18.6\%, 30.4\%]$ against GPT-5.4 with detailed 619-word responses, compared to 0.0\,\% for direct queries.