Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

AIにより推定されたラベル
Abstract

The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L3), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L3 learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L3 on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3

タイトルとURLをコピーしました