AIセキュリティポータル K Program
Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor
Share
Abstract
Deep neural networks (DNNs) have long been recognized as vulnerable to backdoor attacks. By providing poisoned training data in the fine-tuning process, the attacker can implant a backdoor into the victim model. This enables input samples meeting specific textual trigger patterns to be classified as target labels of the attacker's choice. While such black-box attacks have been well explored in both computer vision and natural language processing (NLP), backdoor attacks relying on white-box attack philosophy have hardly been thoroughly investigated. In this paper, we take the first step to introduce a new type of backdoor attack that conceals itself within the underlying model architecture. Specifically, we propose to design separate backdoor modules consisting of two functions: trigger detection and noise injection. The add-on modules of model architecture layers can detect the presence of input trigger tokens and modify layer weights using Gaussian noise to disturb the feature distribution of the baseline model. We conduct extensive experiments to evaluate our attack methods using two model architecture settings on five different large language datasets. We demonstrate that the training-free architectural backdoor on a large language model poses a genuine threat. Unlike the-state-of-art work, it can survive the rigorous fine-tuning and retraining process, as well as evade output probability-based defense methods (i.e. BDDR). All the code and data is available https://github.com/SiSL-URI/Arch_Backdoor_LLM.
Generating natural language adversarial examples
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, Kai-Wei Chang
Published: 2018
T-Miner: A Generative Approach to Defend Against Trojan Attacks on DNN-based Text Classification
Ahmadreza Azizi, Ibrahim Asadullah Tahmid, Asim Waheed, Neal Mangaokar, Jiameng Pu, Mobin Javed, Chandan K. Reddy, Bimal Viswanath
Published: 2021.3.7
Badnl: Backdoor attacks against nlp models with semantic-preserving improvements
Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, Yang Zhang
Published: 2021
Textual backdoor attacks can be more harmful via two simple tricks
Yangyi Chen, Fanchao Qi, Hongcheng Gao, Zhiyuan Liu, Maosong Sun
Published: 2022
A backdoor attack against lstm-based text classification systems
Jiazhu Dai, Chuanshuai Chen, Yufeng Li
Published: 2019
Ranking a stream of news
Gianna M Del Corso, Antonio Gulli, Francesco Romani
Published: 2005
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Published: 2019
Training-free lexical backdoor attacks on language models
Yujin Huang, Terry Yue Zhuo, Qiongkai Xu, Han Hu, Xingliang Yuan, Chunyang Chen
Published: 2023
K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data
Abiodun M Ikotun, Absalom E Ezugwu, Laith Abualigah, Belal Abuhaija, Jia Heming
Published: 2023
Principal component analysis: a review and recent developments
Ian T Jolliffe, Jorge Cadima
Published: 2016
Hidden backdoors in human-centric language models
S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, J. Lu
Published: 2021
Membership Inference Attacks by Exploiting Loss Trajectory
Yiyong Liu, Zhengyu Zhao, Michael Backes, Yang Zhang
Published: 2022.9.1
A survey of the usages of deep learning for natural language processing
Daniel W Otter, Julian R Medina, Jugal K Kalita
Published: 2020
Hidden trigger backdoor attack on {NLP} models via linguistic style manipulation
Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, Min Yang
Published: 2022
Towards Data-Free Model Stealing in a Hard Label Setting
Sunandini Sanyal, Sravanti Addepalli, R. Venkatesh Babu
Published: 2022.4.23
CARER: Contextualized affect representations for emotion recognition
Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, Yi-Shin Chen
Published: 2018
Bddr: An effective defense against textual backdoor attacks
Kun Shao, Junan Yang, Yang Ai, Hui Liu, Yu Zhang
Published: 2021
Punctuation matters! stealthy backdoor attack for language models
Xuan Sheng, Zhicheng Li, Zhaoyang Han, Xiangmao Chang, Piji Li
Published: 2023
Recursive deep models for semantic compositionality over a sentiment treebank
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., Potts, C.
Published: 2013
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, Illia Polosukhin
Published: 2017
Deep learning for computer vision: A brief review
Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, Eftychios Protopapadakis
Published: 2018
Bite: Textual backdoor attacks with iterative trigger injection
Jun Yan, Vansh Gupta, Xiang Ren
Published: 2023
Share