These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks
through data poisoning, yet the internal mechanisms governing these attacks
remain a black box. Previous research on interpretability for LLM safety tends
to focus on alignment, jailbreak, and hallucination, but overlooks backdoor
mechanisms, making it difficult to understand and fully eliminate the backdoor
threat. In this paper, aiming to bridge this gap, we explore the interpretable
mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a
tripartite causal analysis framework. We first introduce the Backdoor Probe
that proves the existence of learnable backdoor features encoded within the
representations. Building on this insight, we further develop Backdoor
Attention Head Attribution (BAHA), efficiently pinpointing the specific
attention heads responsible for processing these features. Our primary
experiments reveals these heads are relatively sparse; ablating a minimal
\textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success
Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these
findings to construct the Backdoor Vector derived from these attributed heads
as a master controller for the backdoor. Through only \textbf{1-point}
intervention on \textbf{single} representation, the vector can either boost ASR
up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely
neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)}
on triggered inputs. In conclusion, our work pioneers the exploration of
mechanistic interpretability in LLM backdoors, demonstrating a powerful method
for backdoor control and revealing actionable insights for the community.