These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Finetuning open-weight Large Language Models (LLMs) is standard practice for
achieving task-specific performance improvements. Until now, finetuning has
been regarded as a controlled and secure process in which training on benign
datasets leads to predictable behaviors. In this paper, we demonstrate, for the
first time, that an adversary can create compromised LLMs that are performant
and benign, yet exhibit adversarial behaviors once finetuned by downstream
users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial
Behaviors), which compromises an LLM via meta-learning techniques that simulate
downstream finetuning, explicitly optimizing for the emergence of adversarial
behaviors in the finetuned models. At the same time, the compromised LLM is
regularized to retain general capabilities and to exhibit no adversarial
behaviors prior to finetuning. As a result, when users finetune (e.g.,
instruction-tuning, distillation, DPO) the seemingly benign model on their own
datasets, they unknowingly trigger its dormant adversarial behavior. We
experimentally demonstrate the effectiveness of FAB across multiple LLMs and
three commonly considered target behaviors: unsolicited advertising,
jailbreakability, and over-refusal. We show that FAB-triggers are robust to
various finetuning choices made by the user (e.g., dataset, number of steps,
scheduler, post-training algorithm). Our findings challenge prevailing
assumptions on the security of finetuning, revealing a critical attack vector.