Zenghui Yuan,Yangming Xu,Jiawen Shi,Pan Zhou,Lichao Sun
Published
5-30-2025
Affiliation
Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology
Country
China
Conference
Annual Meeting of the Association for Computational Linguistics (ACL)
These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Model merging for Large Language Models (LLMs) directly fuses the parameters
of different models finetuned on various tasks, creating a unified model for
multi-domain tasks. However, due to potential vulnerabilities in models
available on open-source platforms, model merging is susceptible to backdoor
attacks. In this paper, we propose Merge Hijacking, the first backdoor attack
targeting model merging in LLMs. The attacker constructs a malicious upload
model and releases it. Once a victim user merges it with any other models, the
resulting merged model inherits the backdoor while maintaining utility across
tasks. Merge Hijacking defines two main objectives-effectiveness and
utility-and achieves them through four steps. Extensive experiments demonstrate
the effectiveness of our attack across different models, merging algorithms,
and tasks. Additionally, we show that the attack remains effective even when
merging real-world models. Moreover, our attack demonstrates robustness against
two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time
defense (Fine-pruning).