AIセキュリティポータル K Program
Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm
Share
Abstract
Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6%$\downarrow$). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at https://github.com/obliviateARR/Obliviate.
A large annotated corpus for learning natural language inference
Bowman, S. R., Angeli, G., Potts, C., Manning, C. D.
Published: 2015
Badpre: Task-agnostic backdoor attacks to pre-trained nlp foundation models
Kangjie Chen, Yuxian Meng, Xiaofei Sun, Shangwei Guo, Tianwei Zhang, Jiwei Li, Chun Fan
Published: 2021
Badnl: Backdoor attacks against nlp models with semantic-preserving improvements
Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, Yang Zhang
Published: 2021
A backdoor attack against lstm-based text classification systems
Jiazhu Dai, Chuanshuai Chen, Yufeng Li
Published: 2019
Automated hate speech detection and the problem of offensive language
Thomas Davidson, Dana Warmsley, Michael Macy, Ingmar Weber
Published: 2017
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Published: 2019
Design and evaluation of a multi-domain trojan detection method on deep neural networks
Yansong Gao, Yeonjae Kim, Bao Gia Doan, Zhi Zhang, Gongxuan Zhang, Surya Nepal, Damith C Ranas inghe, Hyoungshick Kim
Published: 2021
A gradient control method for backdoor attacks on parameter-efficient tuning
Naibin Gu, Peng Fu, Xiyu Liu, Zhengxiao Liu, Zheng Lin, Weiping Wang
Published: 2023
Towards a unified view of parameter-efficient transfer learning
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, Graham Neubig
Published: 2021
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly
Published: 2019
Weight poisoning attacks on pretrained models
Keita Kurita, Paul Michel, Graham Neubig
Published: 2020
Defending against insertion-based textual backdoor attacks via attribution
Jiazhao Li, Zhuofeng Wu, Wei Ping, Chaowei Xiao, VG Vinod Vydiswaran
Published: 2023
Backdoor attacks on pre-trained models by layerwise weight poisoning
Linyang Li, Demin Song, Xiaonan Li, Jiehang Zeng, Ruotian Ma, Xipeng Qiu
Published: 2021
Gpt understands, too
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, Jie Tang
Published: 2023
Piccolo: Exposing complex backdoors in nlp transformer models
Yingqi Liu, Guangyu Shen, Guanhong Tao, Shengwei An, Shiqing Ma, Xiangyu Zhang
Published: 2022
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher
Published: 2016
Adapterhub: A framework for adapting transformers
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulic, Sebastian Ruder, Kyunghyun Cho, Iryna Gurevych
Published: 2020
Mind the style of text! adversarial and backdoor attacks based on text style transfer
Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, Maosong Sun
Published: 2021
Hidden killer: Invisible textual backdoor attacks with syntactic trigger
Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, Maosong Sun
Published: 2021
Backdoor pre-trained models can transfer to all
Lujia Shen, Shouling Ji, Xuhong Zhang, Jinfeng Li, Jing Chen, Jie Shi, Chengfang Fang, Jianwei Yin, Ting Wang
Published: 2021
Recursive deep models for semantic compositionality over a sentiment treebank
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., Potts, C.
Published: 2013
Backdoor attacks against transfer learning with pretrained deep learning models
Shuo Wang, Surya Nepal, Carsten Rudolph, Marthie Grobler, Shangyu Chen, Tianle Chen
Published: 2020
LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
Chengkun Wei, Wenlong Meng, Zhikun Zhang, Min Chen, Minghu Zhao, Wenjing Fang, Lei Wang, Zihui Zhang, Wenzhi Chen
Published: 8.27.2023
Adversarial neuron pruning purifies backdoored deep models
Dongxian Wu, Yisen Wang
Published: 2021
Defending pre-trained language models as few-shot learners against backdoor attacks
Zhaohan Xi, Tianyu Du, Changjiang Li, Ren Pang, Shouling Ji, Jinghui Chen, Fenglong Ma, Ting Wang
Published: 2023
Bite: Textual backdoor attacks with iterative trigger injection
Jun Yan, Vansh Gupta, Xiang Ren
Published: 2023
Red alarm for pre-trained models: Universal vulnerability to neuron-level backdoor attacks
Zhengyan Zhang, Guangxuan Xiao, Yongwei Li, Tian Lv, Fanchao Qi, Zhiyuan Liu, Yasheng Wang, Xin Jiang, Maosong Sun
Published: 2023
Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
Shuai Zhao, Leilei Gan, Luu Anh Tuan, Jie Fu, Lingjuan Lyu, Meihuizi Jia, Jinming Wen
Published: 2.19.2024
Removing backdoors in pre-trained models by regularized continual pre-training
Biru Zhu, Ganqu Cui, Yangyi Chen, Yujia Qin, Lifan Yuan, Chong Fu, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu
Published: 2023
Aligning books and movies: Towards story-like visual explanations by watching movies and reading books
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut dinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler
Published: 2015
Share