These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The widespread deployment of large language models (LLMs) has raised critical
concerns over their vulnerability to jailbreak attacks, i.e., adversarial
prompts that bypass alignment mechanisms and elicit harmful or policy-violating
outputs. While proprietary models like GPT-4 have undergone extensive
evaluation, the robustness of emerging open-source alternatives such as
DeepSeek remains largely underexplored, despite their growing adoption in
real-world applications. In this paper, we present the first systematic
jailbreak evaluation of DeepSeek-series models, comparing them with GPT-3.5 and
GPT-4 using the HarmBench benchmark. We evaluate seven representative attack
strategies across 510 harmful behaviors categorized by both function and
semantic domain. Our analysis reveals that DeepSeek's Mixture-of-Experts (MoE)
architecture introduces routing sparsity that offers selective robustness
against optimization-based attacks such as TAP-T, but leads to significantly
higher vulnerability under prompt-based and manually engineered attacks. In
contrast, GPT-4 Turbo demonstrates stronger and more consistent safety
alignment across diverse behaviors, likely due to its dense Transformer design
and reinforcement learning from human feedback. Fine-grained behavioral
analysis and case studies further show that DeepSeek often routes adversarial
prompts to under-aligned expert modules, resulting in inconsistent refusal
behaviors. These findings highlight a fundamental trade-off between
architectural efficiency and alignment generalization, emphasizing the need for
targeted safety tuning and modular alignment strategies to ensure secure
deployment of open-source LLMs.