These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Recent large language models (LLMs) have increasingly adopted the
Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily
depend on a superficial safety mechanism in which harmful inputs are routed
safety-critical experts. However, our analysis reveals that routing decisions
for harmful inputs drift significantly after fine-tuning, exposing a critical
vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses,
primarily designed for monolithic LLMs, are less effective for MoE LLMs as they
fail to prevent drift in harmful input routing. To address this limitation, we
propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE
directly mitigates routing drift by penalizing the gap between the routing
weights of a fine-tuned model and those of the initial safety-aligned model,
thereby preserving the safety-aligned routing of harmful inputs to
safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to
141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks,
reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while
maintaining task utility within 1% degradation and incurring only 2% overhead.
It significantly outperforms state-of-the-art defense methods for safeguarding
LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as
gpt-oss and Llama 4. Our implementation is available at
https://anonymous.4open.science/r/SafeMoE.