These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) are shown to be vulnerable to jailbreaking
attacks where adversarial prompts are designed to elicit harmful responses.
While existing defenses effectively mitigate single-turn attacks by detecting
and filtering unsafe inputs, they fail against multi-turn jailbreaks that
exploit contextual drift over multiple interactions, gradually leading LLMs
away from safe behavior. To address this challenge, we propose a safety
steering framework grounded in safe control theory, ensuring invariant safety
in multi-turn dialogues. Our approach models the dialogue with LLMs using
state-space representations and introduces a novel neural barrier function
(NBF) to detect and filter harmful queries emerging from evolving contexts
proactively. Our method achieves invariant safety at each turn of dialogue by
learning a safety predictor that accounts for adversarial queries, preventing
potential context drift toward jailbreaks. Extensive experiments under multiple
LLMs show that our NBF-based safety steering outperforms safety alignment,
prompt-based steering and lightweight LLM guardrails baselines, offering
stronger defenses against multi-turn jailbreaks while maintaining a better
trade-off among safety, helpfulness and over-refusal. Check out the website
here https://sites.google.com/view/llm-nbf/home . Our code is available on
https://github.com/HanjiangHu/NBF-LLM .