These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Despite being widely applied due to their exceptional capabilities, Large
Language Models (LLMs) have been proven to be vulnerable to backdoor attacks.
These attacks introduce targeted vulnerabilities into LLMs by poisoning
training samples and full-parameter fine-tuning (FPFT). However, this kind of
backdoor attack is limited since they require significant computational
resources, especially as the size of LLMs increases. Besides,
parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted
parameter updating may impede the alignment of triggers with target labels. In
this study, we first verify that backdoor attacks with PEFT may encounter
challenges in achieving feasible performance. To address these issues and
improve the effectiveness of backdoor attacks with PEFT, we propose a novel
backdoor attack algorithm from the weak-to-strong based on Feature
Alignment-enhanced Knowledge Distillation (FAKD). Specifically, we poison
small-scale language models through FPFT to serve as the teacher model. The
teacher model then covertly transfers the backdoor to the large-scale student
model through FAKD, which employs PEFT. Theoretical analysis reveals that FAKD
has the potential to augment the effectiveness of backdoor attacks. We
demonstrate the superior performance of FAKD on classification tasks across
four language models, four backdoor attack algorithms, and two different
architectures of teacher models. Experimental results indicate success rates
close to 100% for backdoor attacks targeting PEFT.