Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks

TOP Literature Database Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2506.18543

PDF

https://arxiv.org/pdf/2506.18543

Paper Information

Author: Xiaodong Wu,Xiangman Li,Jianbing Ni
Published: 6-23-2025
Affiliation: Queen’s University
Country: Canada
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Prompt Injection Large Language Model Model Architecture

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

The widespread deployment of large language models (LLMs) has raised critical concerns over their vulnerability to jailbreak attacks, i.e., adversarial prompts that bypass alignment mechanisms and elicit harmful or policy-violating outputs. While proprietary models like GPT-4 have undergone extensive evaluation, the robustness of emerging open-source alternatives such as DeepSeek remains largely underexplored, despite their growing adoption in real-world applications. In this paper, we present the first systematic jailbreak evaluation of DeepSeek-series models, comparing them with GPT-3.5 and GPT-4 using the HarmBench benchmark. We evaluate seven representative attack strategies across 510 harmful behaviors categorized by both function and semantic domain. Our analysis reveals that DeepSeek's Mixture-of-Experts (MoE) architecture introduces routing sparsity that offers selective robustness against optimization-based attacks such as TAP-T, but leads to significantly higher vulnerability under prompt-based and manually engineered attacks. In contrast, GPT-4 Turbo demonstrates stronger and more consistent safety alignment across diverse behaviors, likely due to its dense Transformer design and reinforcement learning from human feedback. Fine-grained behavioral analysis and case studies further show that DeepSeek often routes adversarial prompts to under-aligned expert modules, resulting in inconsistent refusal behaviors. These findings highlight a fundamental trade-off between architectural efficiency and alignment generalization, emphasizing the need for targeted safety tuning and modular alignment strategies to ensure secure deployment of open-source LLMs.

External Datasets

HarmBench

JailbreakBench

EasyJailbreak