These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) have emerged as powerful tools, but their
inherent safety risks - ranging from harmful content generation to broader
societal harms - pose significant challenges. These risks can be amplified by
the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing
deployment of LLMs in high-stakes environments. Existing safety-enhancing
techniques, such as fine-tuning with human feedback or adversarial training,
are still vulnerable as they address specific threats and often fail to
generalize across unseen attacks, or require manual system-level defenses. This
paper introduces RepBend, a novel approach that fundamentally disrupts the
representations underlying harmful behaviors in LLMs, offering a scalable
solution to enhance (potentially inherent) safety. RepBend brings the idea of
activation steering - simple vector arithmetic for steering model's behavior
during inference - to loss-based fine-tuning. Through extensive evaluation,
RepBend achieves state-of-the-art performance, outperforming prior methods such
as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success
rates across diverse jailbreak benchmarks, all with negligible reduction in
model usability and general capabilities.