Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

TOP Literature Database Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2502.19041

PDF

https://arxiv.org/pdf/2502.19041

Paper Information

Author: Shiyu Xiang,Ansen Zhang,Yanfei Cao,Yang Fan,Ronghao Chen
Published: 2-26-2025
Updated: 5-28-2025
Affiliation: Sichuan University
Country: China
Conference: Annual Meeting of the Association for Computational Linguistics (ACL)

Labels Estimated by AI

Prompt Injection Attack Evaluation LLM Security

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Although Aligned Large Language Models (LLMs) are trained to refuse harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying "attack essence" remains the same. To address this issue, we introduce EDDF, an \textbf{E}ssence-\textbf{D}riven \textbf{D}efense \textbf{F}ramework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the "attack essence" from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing methods by reducing the Attack Success Rate by at least 20\%, underscoring its superior robustness against jailbreak attacks.

External Datasets

Original Dataset

Jailbreak Proliferation

Exaggerated Safety Dataset

Stanford Alpaca

MOSSBench Benign Dataset

JailBreakV-28k Dataset