Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Hidden killer: Invisible textual backdoor attacks with syntactic trigger

Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, Maosong Sun

Published: 2021

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

SQuAD: 100,000+ questions for machine comprehension of text

P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang

Published: 2016

Cryptology ePrint Archive

Optimizations of side-channel attack on AES MixColumns using chosen input

A. Vasselle, A. Wurcker

Published: 2019

Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21

Backdoor pre-trained models can transfer to all

Lujia Shen, Shouling Ji, Xuhong Zhang, Jinfeng Li, Jing Chen, Jie Shi, Chengfang Fang, Jianwei Yin, Ting Wang

Published: 2021

Conference on empirical methods in natural language processing

Recursive deep models for semantic compositionality over a sentiment treebank

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., Potts, C.

Published: 2013

Claim-guided textual backdoor attack for practical applications

Minkyoo Song, Hanna Kim, Jaehan Kim, Youngjin Jin, Seungwon Shin

Published: 2024

IEEE Transactions on Services Computing

Backdoor attacks against transfer learning with pretrained deep learning models

Shuo Wang, Surya Nepal, Carsten Rudolph, Marthie Grobler, Shangyu Chen, Tianle Chen

Published: 2020

arxiv

Cited by 1

Network and Distributed System Security Symposium (NDSS)

LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors

Chengkun Wei, Wenlong Meng, Zhikun Zhang, Min Chen, Minghu Zhao, Wenjing Fang, Lei Wang, Zihui Zhang, Wenzhi Chen

Published: 8.27.2023

Prompt-tuning has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability. Despite its wide adoption, we empirically show that prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside in the pretrained models and can affect arbitrary downstream tasks. The state-of-the-art backdoor detection approaches cannot defend against task-agnostic backdoors since they hardly converge in reversing the backdoor triggers. To address this issue, we propose LMSanitator, a novel approach for detecting and removing task-agnostic backdoors on Transformer models. Instead of directly inverting the triggers, LMSanitator aims to invert the predefined attack vectors (pretrained models' output when the input is embedded with triggers) of the task-agnostic backdoors, which achieves much better convergence performance and backdoor detection accuracy. LMSanitator further leverages prompt-tuning's property of freezing the pretrained model to perform accurate and fast output monitoring and input purging during the inference phase. Extensive experiments on multiple language models and NLP tasks illustrate the effectiveness of LMSanitator. For instance, LMSanitator achieves 92.8% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1% in most scenarios.

Attack Method Backdoor Detection Trigger Detection

arXiv preprint

Huggingface’s transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz

Published: 2019

Advances in Neural Information Processing Systems

Adversarial neuron pruning purifies backdoored deep models

Dongxian Wu, Yisen Wang

Published: 2021

Thirty-seventh Conference on Neural Information Processing Systems

Defending pre-trained language models as few-shot learners against backdoor attacks

Zhaohan Xi, Tianyu Du, Changjiang Li, Ren Pang, Shouling Ji, Jinghui Chen, Fenglong Ma, Ting Wang

Published: 2023

The 61st Annual Meeting Of The Association For Computational Linguistics

Bite: Textual backdoor attacks with iterative trigger injection

Jun Yan, Vansh Gupta, Xiang Ren

Published: 2023

arXiv

Rap: Robustness-aware perturbations for defending against backdoor attacks on nlp models

Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, Xu Sun

Published: 2021

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li

Published: 2023

International Conference on Learning Representations (ICLR)

Adversarial unlearning of backdoors via implicit hypergradient

Yi Zeng, Si Chen, Won Park, Z. Morley Mao, Ming Jin, Ruoxi Jia

Published: 2022

Advances in Neural Information Processing Systems

Character-level convolutional networks for text classification

X. Zhang, J. Zhao, Y. LeCun

Published: 2015

Machine Intelligence Research

Red alarm for pre-trained models: Universal vulnerability to neuron-level backdoor attacks

Zhengyan Zhang, Guangxuan Xiao, Yongwei Li, Tian Lv, Fanchao Qi, Zhiyuan Liu, Yasheng Wang, Xin Jiang, Maosong Sun

Published: 2023

arxiv

Cited by 1

Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning

Shuai Zhao, Leilei Gan, Luu Anh Tuan, Jie Fu, Lingjuan Lyu, Meihuizi Jia, Jinming Wen

Published: 2.19.2024

Recently, various parameter-efficient fine-tuning (PEFT) strategies for application to language models have been proposed and successfully implemented. However, this raises the question of whether PEFT, which only updates a limited set of model parameters, constitutes security vulnerabilities when confronted with weight-poisoning backdoor attacks. In this study, we show that PEFT is more susceptible to weight-poisoning backdoor attacks compared to the full-parameter fine-tuning method, with pre-defined triggers remaining exploitable and pre-defined targets maintaining high confidence, even after fine-tuning. Motivated by this insight, we developed a Poisoned Sample Identification Module (PSIM) leveraging PEFT, which identifies poisoned samples through confidence, providing robust defense against weight-poisoning backdoor attacks. Specifically, we leverage PEFT to train the PSIM with randomly reset sample labels. During the inference process, extreme confidence serves as an indicator for poisoned samples, while others are clean. We conduct experiments on text classification tasks, five fine-tuning strategies, and three weight-poisoning backdoor attack methods. Experiments show near 100% success rates for weight-poisoning backdoor attacks when utilizing PEFT. Furthermore, our defensive approach exhibits overall competitive performance in mitigating weight-poisoning backdoor attacks.

Attack Method Backdoor Detection Defense Method

Transactions of the Association for Computational Linguistics

Removing backdoors in pre-trained models by regularized continual pre-training

Biru Zhu, Ganqu Cui, Yangyi Chen, Yujia Qin, Lifan Yuan, Chong Fu, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu

Published: 2023

Dppa: Pruning method for large language model to model merging

Yaochen Zhu, Rui Xia, Jiajun Zhang

Published: 2024

Proceedings of the IEEE international conference on computer vision

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut dinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler

Published: 2015