Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment

TOP Literature Database Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2311.09433

PDF

https://arxiv.org/pdf/2311.09433

Paper Information

Author: Haoran Wang;Kai Shu
Published: 11-16-2023
Updated: 8-16-2024
Affiliation: Department of Computer Science, Illinois Institute of Technology
Country: United States of America
Conference

Labels Estimated by AI

Prompt Injection Attack Method Natural Language Processing

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potential harm that LLMs can inflict. Existing attack methods on LLMs often rely on poisoned training data or the injection of malicious prompts. These approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. Additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. In this work, we study a different attack scenario, called Trojan Activation Attack (TA^2), which injects trojan steering vectors into the activation layers of LLMs. These malicious steering vectors can be triggered at inference time to steer the models toward attacker-desired behaviors by manipulating their activations. Our experiment results on four primary alignment tasks show that TA^2 is highly effective and adds little or no overhead to attack efficiency. Additionally, we discuss potential countermeasures against such activation attacks.

External Datasets

TruthfulQA

ToxiGen

BOLD

AdvBench