These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Jailbreak attacks pose persistent threats to large language models (LLMs).
Current safety alignment methods have attempted to address these issues, but
they experience two significant limitations: insufficient safety alignment
depth and unrobust internal defense mechanisms. These limitations make them
vulnerable to adversarial attacks such as prefilling and refusal direction
manipulation. We introduce DeepRefusal, a robust safety alignment framework
that overcomes these issues. DeepRefusal forces the model to dynamically
rebuild its refusal mechanisms from jailbreak states. This is achieved by
probabilistically ablating the refusal direction across layers and token depths
during fine-tuning. Our method not only defends against prefilling and refusal
direction attacks but also demonstrates strong resilience against other unseen
jailbreak strategies. Extensive evaluations on four open-source LLM families
and six representative attacks show that DeepRefusal reduces attack success
rates by approximately 95%, while maintaining model capabilities with minimal
performance degradation.