These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Safety alignment is critical for the ethical deployment of large language
models (LLMs), guiding them to avoid generating harmful or unethical content.
Current alignment techniques, such as supervised fine-tuning and reinforcement
learning from human feedback, remain fragile and can be bypassed by carefully
crafted adversarial prompts. Unfortunately, such attacks rely on trial and
error, lack generalizability across models, and are constrained by scalability
and reliability.
This paper presents NeuroStrike, a novel and generalizable attack framework
that exploits a fundamental vulnerability introduced by alignment techniques:
the reliance on sparse, specialized safety neurons responsible for detecting
and suppressing harmful inputs. We apply NeuroStrike to both white-box and
black-box settings: In the white-box setting, NeuroStrike identifies safety
neurons through feedforward activation analysis and prunes them during
inference to disable safety mechanisms. In the black-box setting, we propose
the first LLM profiling attack, which leverages safety neuron transferability
by training adversarial prompt generators on open-weight surrogate models and
then deploying them against black-box and proprietary targets. We evaluate
NeuroStrike on over 20 open-weight LLMs from major LLM developers. By removing
less than 0.6% of neurons in targeted layers, NeuroStrike achieves an average
attack success rate (ASR) of 76.9% using only vanilla malicious prompts.
Moreover, Neurostrike generalizes to four multimodal LLMs with 100% ASR on
unsafe image inputs. Safety neurons transfer effectively across architectures,
raising ASR to 78.5% on 11 fine-tuned models and 77.7% on five distilled
models. The black-box LLM profiling attack achieves an average ASR of 63.7%
across five black-box models, including the Google Gemini family.