A Survey of Recent Backdoor Attacks and Defenses in Large Language Models

TOP 文献データベース A Survey of Recent Backdoor Attacks and Defenses in Large Language Models

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2406.06852

PDF

https://arxiv.org/pdf/2406.06852

文献情報

作者: Shuai Zhao;Meihuizi Jia;Zhongliang Guo;Leilei Gan;Xiaoyu Xu;Xiaobao Wu;Jie Fu;Yichao Feng;Fengjun Pan;Luu Anh Tuan
公開日: 2024-6-11
更新日: 2025-1-4
所属機関: Nanyang Technological University
所属の国: Singapore
会議名

AIにより推定されたラベル

バックドア攻撃 LLMセキュリティ

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large Language Models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks, particularly in few-shot and zero-shot settings. Despite the demonstrable efficacy of LLMs, due to constraints on computational resources, users have to engage with open-source language models or outsource the entire training process to third-party platforms. However, research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into language models by poisoning training samples or model weights, allowing attackers to manipulate model responses through malicious triggers. While existing surveys on backdoor attacks provide a comprehensive overview, they lack an in-depth examination of backdoor attacks specifically targeting LLMs. To bridge this gap and grasp the latest trends in the field, this paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods. Specifically, we systematically classify backdoor attacks into three categories: full-parameter fine-tuning, parameter-efficient fine-tuning, and no fine-tuning Based on insights from a substantial review, we also discuss crucial issues for future research on backdoor attacks, such as further exploring attack algorithms that do not require fine-tuning, or developing more covert attack algorithms.

外部データセット

SST-2