A Study of Backdoors in Instruction Fine-tuned Language Models

TOP Literature Database A Study of Backdoors in Instruction Fine-tuned Language Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2406.07778

PDF

https://arxiv.org/pdf/2406.07778

Paper Information

Author: Jayaram Raghuram;George Kesidis;David J. Miller
Published: 6-12-2024
Updated: 8-22-2024
Affiliation: CS Dept, Univ. of Wisconsin, Madison, WI
Country: United States of America
Conference

Labels Estimated by AI

Backdoor Attack LLM Security Defense Method

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Backdoor data poisoning, inserted within instruction examples used to fine-tune a foundation Large Language Model (LLM) for downstream tasks (\textit{e.g.,} sentiment prediction), is a serious security concern due to the evasive nature of such attacks. The poisoning is usually in the form of a (seemingly innocuous) trigger word or phrase inserted into a very small fraction of the fine-tuning samples from a target class. Such backdoor attacks can: alter response sentiment, violate censorship, over-refuse (invoke censorship for legitimate queries), inject false content, or trigger nonsense responses (hallucinations). In this work we investigate the efficacy of instruction fine-tuning backdoor attacks as attack "hyperparameters" are varied under a variety of scenarios, considering: the trigger location in the poisoned examples; robustness to change in the trigger location, partial triggers, and synonym substitutions at test time; attack transfer from one (fine-tuning) domain to a related test domain; and clean-label vs. dirty-label poisoning. Based on our observations, we propose and evaluate two defenses against these attacks: i) a \textit{during-fine-tuning defense} based on word-frequency counts that assumes the (possibly poisoned) fine-tuning dataset is available and identifies the backdoor trigger tokens; and ii) a \textit{post-fine-tuning defense} based on downstream clean fine-tuning of the backdoored LLM with a small defense dataset. Finally, we provide a brief survey of related work on backdoor attacks and defenses.

External Datasets

SST2

IMDB

Yelp polarity

Amazon polarity

FLAN dataset