Large language models (LLMs) are shown to be vulnerable to jailbreaking
attacks where adversarial prompts are designed to elicit harmful responses.
While existing defenses effectively mitigate single-turn attacks by detecting
and filtering unsafe inputs, they fail against multi-turn jailbreaks that
exploit contextual drift over multiple interactions, gradually leading LLMs
away from safe behavior. To address this challenge, we propose a safety
steering framework grounded in safe control theory, ensuring invariant safety
in multi-turn dialogues. Our approach models the dialogue with LLMs using
state-space representations and introduces a novel neural barrier function
(NBF) to detect and filter harmful queries emerging from evolving contexts
proactively. Our method achieves invariant safety at each turn of dialogue by
learning a safety predictor that accounts for adversarial queries, preventing
potential context drift toward jailbreaks. Extensive experiments under multiple
LLMs show that our NBF-based safety steering outperforms safety alignment,
prompt-based steering and lightweight LLM guardrails baselines, offering
stronger defenses against multi-turn jailbreaks while maintaining a better
trade-off among safety, helpfulness and over-refusal. Check out the website
here https://sites.google.com/view/llm-nbf/home . Our code is available on
https://github.com/HanjiangHu/NBF-LLM .