Large Language Models (LLMs) have emerged as powerful tools, but their
inherent safety risks - ranging from harmful content generation to broader
societal harms - pose significant challenges. These risks can be amplified by
the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing
deployment of LLMs in high-stakes environments. Existing safety-enhancing
techniques, such as fine-tuning with human feedback or adversarial training,
are still vulnerable as they address specific threats and often fail to
generalize across unseen attacks, or require manual system-level defenses. This
paper introduces RepBend, a novel approach that fundamentally disrupts the
representations underlying harmful behaviors in LLMs, offering a scalable
solution to enhance (potentially inherent) safety. RepBend brings the idea of
activation steering - simple vector arithmetic for steering model's behavior
during inference - to loss-based fine-tuning. Through extensive evaluation,
RepBend achieves state-of-the-art performance, outperforming prior methods such
as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success
rates across diverse jailbreak benchmarks, all with negligible reduction in
model usability and general capabilities.