Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

TOP 文献データベース Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2505.23561

PDF

https://arxiv.org/pdf/2505.23561

文献情報

作者: Zenghui Yuan,Yangming Xu,Jiawen Shi,Pan Zhou,Lichao Sun
公開日: 2025-5-30
所属機関: Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology
所属の国: China
会議名: Annual Meeting of the Association for Computational Linguistics (ACL)

AIにより推定されたラベル

LLMセキュリティポイズニング攻撃モデル保護手法

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives-effectiveness and utility-and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning).

外部データセット

SST-2

CoLA

MRPC

SMS Spam

QNLI

Agnews

Imdb

Dairemo

tweets_hate_speech_detection