Zenghui Yuan,Yangming Xu,Jiawen Shi,Pan Zhou,Lichao Sun
公開日
2025-5-30
所属機関
Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology
所属の国
China
会議名
Annual Meeting of the Association for Computational Linguistics (ACL)
Model merging for Large Language Models (LLMs) directly fuses the parameters
of different models finetuned on various tasks, creating a unified model for
multi-domain tasks. However, due to potential vulnerabilities in models
available on open-source platforms, model merging is susceptible to backdoor
attacks. In this paper, we propose Merge Hijacking, the first backdoor attack
targeting model merging in LLMs. The attacker constructs a malicious upload
model and releases it. Once a victim user merges it with any other models, the
resulting merged model inherits the backdoor while maintaining utility across
tasks. Merge Hijacking defines two main objectives-effectiveness and
utility-and achieves them through four steps. Extensive experiments demonstrate
the effectiveness of our attack across different models, merging algorithms,
and tasks. Additionally, we show that the attack remains effective even when
merging real-world models. Moreover, our attack demonstrates robustness against
two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time
defense (Fine-pruning).