Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

TOP Literature Database Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2312.00027

PDF

https://arxiv.org/pdf/2312.00027

Paper Information

Author: Yuanpu Cao;Bochuan Cao;Jinghui Chen
Published: 11-16-2023
Updated: 6-9-2024
Affiliation: The Pennsylvania State University
Country: United States of America
Conference: NAACL-HLT

Labels Estimated by AI

Backdoor Attack Prompt Injection

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Recent developments in Large Language Models (LLMs) have manifested significant advancements. To facilitate safeguards against malicious exploitation, a body of research has concentrated on aligning LLMs with human preferences and inhibiting their generation of inappropriate content. Unfortunately, such alignments are often vulnerable: fine-tuning with a minimal amount of harmful data can easily unalign the target LLM. While being effective, such fine-tuning-based unalignment approaches also have their own limitations: (1) non-stealthiness, after fine-tuning, safety audits or red-teaming can easily expose the potential weaknesses of the unaligned models, thereby precluding their release/use. (2) non-persistence, the unaligned LLMs can be easily repaired through re-alignment, i.e., fine-tuning again with aligned data points. In this work, we show that it is possible to conduct stealthy and persistent unalignment on large language models via backdoor injections. We also provide a novel understanding on the relationship between the backdoor persistence and the activation pattern and further provide guidelines for potential trigger design. Through extensive experiments, we demonstrate that our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense.

External Datasets

AdvBench

TDC 2023