These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Backdoor data poisoning, inserted within instruction examples used to
fine-tune a foundation Large Language Model (LLM) for downstream tasks
(\textit{e.g.,} sentiment prediction), is a serious security concern due to the
evasive nature of such attacks. The poisoning is usually in the form of a
(seemingly innocuous) trigger word or phrase inserted into a very small
fraction of the fine-tuning samples from a target class. Such backdoor attacks
can: alter response sentiment, violate censorship, over-refuse (invoke
censorship for legitimate queries), inject false content, or trigger nonsense
responses (hallucinations). In this work we investigate the efficacy of
instruction fine-tuning backdoor attacks as attack "hyperparameters" are varied
under a variety of scenarios, considering: the trigger location in the poisoned
examples; robustness to change in the trigger location, partial triggers, and
synonym substitutions at test time; attack transfer from one (fine-tuning)
domain to a related test domain; and clean-label vs. dirty-label poisoning.
Based on our observations, we propose and evaluate two defenses against these
attacks: i) a \textit{during-fine-tuning defense} based on word-frequency
counts that assumes the (possibly poisoned) fine-tuning dataset is available
and identifies the backdoor trigger tokens; and ii) a \textit{post-fine-tuning
defense} based on downstream clean fine-tuning of the backdoored LLM with a
small defense dataset. Finally, we provide a brief survey of related work on
backdoor attacks and defenses.