Large language models (LLMs) with Mixture-of-Experts (MoE) architectures
achieve impressive performance and efficiency by dynamically routing inputs to
specialized subnetworks, known as experts. However, this sparse routing
mechanism inherently exhibits task preferences due to expert specialization,
introducing a new and underexplored vulnerability to backdoor attacks. In this
work, we investigate the feasibility and effectiveness of injecting backdoors
into MoE-based LLMs by exploiting their inherent expert routing preferences. We
thus propose BadSwitch, a novel backdoor framework that integrates task-coupled
dynamic trigger optimization with a sensitivity-guided Top-S expert tracing
mechanism. Our approach jointly optimizes trigger embeddings during pretraining
while identifying S most sensitive experts, subsequently constraining the Top-K
gating mechanism to these targeted experts. Unlike traditional backdoor attacks
that rely on superficial data poisoning or model editing, BadSwitch primarily
embeds malicious triggers into expert routing paths with strong task affinity,
enabling precise and stealthy model manipulation. Through comprehensive
evaluations across three prominent MoE architectures (Switch Transformer,
QwenMoE, and DeepSeekMoE), we demonstrate that BadSwitch can efficiently hijack
pre-trained models with up to 100% success rate (ASR) while maintaining the
highest clean accuracy (ACC) among all baselines. Furthermore, BadSwitch
exhibits strong resilience against both text-level and model-level defense
mechanisms, achieving 94.07% ASR and 87.18% ACC on the AGNews dataset. Our
analysis of expert activation patterns reveals fundamental insights into MoE
vulnerabilities. We anticipate this work will expose security risks in MoE
systems and contribute to advancing AI safety.