Mixture-of-Experts (MoE) have emerged as a powerful architecture for large
language models (LLMs), enabling efficient scaling of model capacity while
maintaining manageable computational costs. The key advantage lies in their
ability to route different tokens to different ``expert'' networks within the
model, enabling specialization and efficient handling of diverse input.
However, the vulnerabilities of MoE-based LLMs still have barely been studied,
and the potential for backdoor attacks in this context remains largely
unexplored. This paper presents the first backdoor attack against MoE-based
LLMs where the attackers poison ``dormant experts'' (i.e., underutilized
experts) and activate them by optimizing routing triggers, thereby gaining
control over the model's output. We first rigorously prove the existence of a
few ``dominating experts'' in MoE models, whose outputs can determine the
overall MoE's output. We also show that dormant experts can serve as dominating
experts to manipulate model predictions. Accordingly, our attack, namely
BadMoE, exploits the unique architecture of MoE models by 1) identifying
dormant experts unrelated to the target task, 2) constructing a routing-aware
loss to optimize the activation triggers of these experts, and 3) promoting
dormant experts to dominating roles via poisoned training data. Extensive
experiments show that BadMoE successfully enforces malicious prediction on
attackers' target tasks while preserving overall model utility, making it a
more potent and stealthy attack than existing methods.