These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Recent studies have shown that Large Language Models (LLMs) are vulnerable to
data poisoning attacks, where malicious training examples embed hidden
behaviours triggered by specific input patterns. However, most existing works
assume a phrase and focus on the attack's effectiveness, offering limited
understanding of trigger mechanisms and how multiple triggers interact within
the model. In this paper, we present a framework for studying poisoning in
LLMs. We show that multiple distinct backdoor triggers can coexist within a
single model without interfering with each other, enabling adversaries to embed
several triggers concurrently. Using multiple triggers with high embedding
similarity, we demonstrate that poisoned triggers can achieve robust activation
even when tokens are substituted or separated by long token spans. Our findings
expose a broader and more persistent vulnerability surface in LLMs. To mitigate
this threat, we propose a post hoc recovery method that selectively retrains
specific model components based on a layer-wise weight difference analysis. Our
method effectively removes the trigger behaviour with minimal parameter
updates, presenting a practical and efficient defence against multi-trigger
poisoning.