These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Jailbreak attacks pose a serious threat to large language models (LLMs) by
bypassing built-in safety mechanisms and leading to harmful outputs. Studying
these attacks is crucial for identifying vulnerabilities and improving model
security. This paper presents a systematic survey of jailbreak methods from the
novel perspective of stealth. We find that existing attacks struggle to
simultaneously achieve toxic stealth (concealing toxic content) and linguistic
stealth (maintaining linguistic naturalness). Motivated by this, we propose
StegoAttack, a fully stealthy jailbreak attack that uses steganography to hide
the harmful query within benign, semantically coherent text. The attack then
prompts the LLM to extract the hidden query and respond in an encrypted manner.
This approach effectively hides malicious intent while preserving naturalness,
allowing it to evade both built-in and external safety mechanisms. We evaluate
StegoAttack on four safety-aligned LLMs from major providers, benchmarking
against eight state-of-the-art methods. StegoAttack achieves an average attack
success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.0%.
Its ASR drops by less than 1% even under external detection (e.g., Llama
Guard). Moreover, it attains the optimal comprehensive scores on stealth
detection metrics, demonstrating both high efficacy and exceptional stealth
capabilities. The code is available at
https://anonymous.4open.science/r/StegoAttack-Jail66