ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks

TOP Literature Database ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2507.01321

PDF

https://arxiv.org/pdf/2507.01321

Paper Information

Author: Zhiyao Ren,Siyuan Liang,Aishan Liu,Dacheng Tao
Published: 7-2-2025
Affiliation: Nanyang Technological University
Country: Singapore
Conference: International Conference on Machine Learning (ICML)

Labels Estimated by AI

Trigger Detection Backdoor Attack Techniques ICL Defense Mechanism

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs) due to its adaptability and parameter-free nature. However, it also introduces a critical vulnerability to backdoor attacks, where adversaries can manipulate LLM behaviors by simply poisoning a few ICL demonstrations. In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts within poisoned demonstrations, jointly influencing the probability of model outputs. Through theoretical analysis, we derive an upper bound for ICL backdoor effects, revealing that the vulnerability is dominated by the concept preference ratio between the task and the backdoor. Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio. Our method encourages LLMs to select clean demonstrations during the ICL phase by leveraging confidence and similarity scores, effectively mitigating susceptibility to backdoor attacks. Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (e.g., GPT-4).

External Datasets

SST-2

AG's News

Standford Alpaca

AdvBench

GSM8k

CSQA