Decoupled Alignment for Robust Plug-and-Play Adaptation

TOP Literature Database Decoupled Alignment for Robust Plug-and-Play Adaptation

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2406.01514

PDF

https://arxiv.org/pdf/2406.01514

Paper Information

Author: Haozheng Luo;Jiahao Yu;Wenxin Zhang;Jialong Li;Jerry Yao-Chieh Hu;Xinyu Xing;Han Liu
Published: 6-4-2024
Updated: 6-6-2024
Affiliation: Department of Computer Science, Northwestern University
Country: United States of America
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Model Performance Evaluation Prompt Injection LLM Performance Evaluation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

We introduce a low-resource safety enhancement method for aligning large language models (LLMs) without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.

External Datasets

AdvBench

ShareGPT

WikiText-2

Big-Bench

MMLU