STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

TOP Literature Database STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2503.17932

PDF

https://arxiv.org/pdf/2503.17932

Paper Information

Author: Xunguang Wang,Wenxuan Wang,Zhenlan Ji,Zongjie Li,Pingchuan Ma,Daoyuan Wu,Shuai Wang
Published: 3-23-2025
Affiliation: The Hong Kong University of Science and Technology
Country: Hong Kong
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Malicious Prompt Prompt Injection Effectiveness Analysis of Defense Methods

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

External Datasets

UltraChat

JailbreakBench

AdvBench

JailTrickBench

DAN

MultiJail

AlpacaEval