Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs

TOP Literature Database Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2409.17946

PDF

https://arxiv.org/pdf/2409.17946

Paper Information

Author: Shuai Zhao,Leilei Gan,Zhongliang Guo,Xiaobao Wu,Yanhao Jia,Luwei Xiao,Cong-Duy Nguyen,Luu Anh Tuan
Published: 9-27-2024
Updated: 7-9-2025
Affiliation: Nanyang Technological University
Country: Singapore
Conference

Labels Estimated by AI

Backdoor Attack Prompt Injection

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning (FPFT). However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from the weak-to-strong based on Feature Alignment-enhanced Knowledge Distillation (FAKD). Specifically, we poison small-scale language models through FPFT to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through FAKD, which employs PEFT. Theoretical analysis reveals that FAKD has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of FAKD on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.