LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors

TOP Literature Database LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2308.13904

PDF

https://arxiv.org/pdf/2308.13904

Paper Information

Author: Chengkun Wei,Wenlong Meng,Zhikun Zhang,Min Chen,Minghu Zhao,Wenjing Fang,Lei Wang,Zihui Zhang,Wenzhi Chen
Published: 8-27-2023
Updated: 10-15-2023
Affiliation: Zhejiang University
Country: China
Conference: Network and Distributed System Security Symposium (NDSS)

Labels Estimated by AI

Attack Method Backdoor Detection Trigger Detection

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Prompt-tuning has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability. Despite its wide adoption, we empirically show that prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside in the pretrained models and can affect arbitrary downstream tasks. The state-of-the-art backdoor detection approaches cannot defend against task-agnostic backdoors since they hardly converge in reversing the backdoor triggers. To address this issue, we propose LMSanitator, a novel approach for detecting and removing task-agnostic backdoors on Transformer models. Instead of directly inverting the triggers, LMSanitator aims to invert the predefined attack vectors (pretrained models' output when the input is embedded with triggers) of the task-agnostic backdoors, which achieves much better convergence performance and backdoor detection accuracy. LMSanitator further leverages prompt-tuning's property of freezing the pretrained model to perform accurate and fast output monitoring and input purging during the inference phase. Extensive experiments on multiple language models and NLP tasks illustrate the effectiveness of LMSanitator. For instance, LMSanitator achieves 92.8% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1% in most scenarios.

External Datasets

RTE

BoolQ

AG News

Yelp-5

Enron spam

SMS spam

CoNLL04

OntoNotes 5.0

WikiText

BookCorpus