These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
To ensure AI safety, instruction-tuned Large Language Models (LLMs) are
specifically trained to ensure alignment, which refers to making models behave
in accordance with human intentions. While these models have demonstrated
commendable results on various safety benchmarks, the vulnerability of their
safety alignment has not been extensively studied. This is particularly
troubling given the potential harm that LLMs can inflict. Existing attack
methods on LLMs often rely on poisoned training data or the injection of
malicious prompts. These approaches compromise the stealthiness and
generalizability of the attacks, making them susceptible to detection.
Additionally, these models often demand substantial computational resources for
implementation, making them less practical for real-world applications. In this
work, we study a different attack scenario, called Trojan Activation Attack
(TA^2), which injects trojan steering vectors into the activation layers of
LLMs. These malicious steering vectors can be triggered at inference time to
steer the models toward attacker-desired behaviors by manipulating their
activations. Our experiment results on four primary alignment tasks show that
TA^2 is highly effective and adds little or no overhead to attack efficiency.
Additionally, we discuss potential countermeasures against such activation
attacks.