BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

TOP 文献データベース BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2505.03501

PDF

https://arxiv.org/pdf/2505.03501

文献情報

作者: Zihan Wang,Hongwei Li,Rui Zhang,Wenbo Jiang,Kangjie Chen,Tianwei Zhang,Qingchuan Zhao,Guowen Xu
公開日: 2025-5-6
所属機関: University of Electronic Science and Technology of China
所属の国: China
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

バックドア攻撃対策 RAGへのポイズニング攻撃敵対的学習

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious entities. We first implement a baseline lingual-backdoor attack, which is carried out by poisoning a set of training data for specific downstream tasks through translation into the trigger language. However, this baseline attack suffers from poor task generalization and is impractical in real-world settings. To address this challenge, we design BadLingual, a novel task-agnostic lingual-backdoor, capable of triggering any downstream tasks within the chat LLMs, regardless of the specific questions of these tasks. We design a new approach using PPL-constrained Greedy Coordinate Gradient-based Search (PGCG) based adversarial training to expand the decision boundary of lingual-backdoor, thereby enhancing the generalization ability of lingual-backdoor across various tasks. We perform extensive experiments to validate the effectiveness of our proposed attacks. Specifically, the baseline attack achieves an ASR of over 90% on the specified tasks. However, its ASR reaches only 37.61% across six tasks in the task-agnostic scenario. In contrast, BadLingual brings up to 37.35% improvement over the baseline. Our study sheds light on a new perspective of vulnerabilities in LLMs with multilingual capabilities and is expected to promote future research on the potential defenses to enhance the LLMs' robustness

外部データセット

SST-2

AGNews

CommonsenseQA

GSM8K

SIQA

ARC-e

PIQA

BoolQ