Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Generating natural language adversarial examples

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, Kai-Wei Chang

Published: 2018

T-Miner: A Generative Approach to Defend Against Trojan Attacks on DNN-based Text Classification

Ahmadreza Azizi, Ibrahim Asadullah Tahmid, Asim Waheed, Neal Mangaokar, Jiameng Pu, Mobin Javed, Chandan K. Reddy, Bimal Viswanath

Published: 2021.3.7

Deep Neural Network (DNN) classifiers are known to be vulnerable to Trojan or backdoor attacks, where the classifier is manipulated such that it misclassifies any input containing an attacker-determined Trojan trigger. Backdoors compromise a model's integrity, thereby posing a severe threat to the landscape of DNN-based classification. While multiple defenses against such attacks exist for classifiers in the image domain, there have been limited efforts to protect classifiers in the text domain. We present Trojan-Miner (T-Miner) -- a defense framework for Trojan attacks on DNN-based text classifiers. T-Miner employs a sequence-to-sequence (seq-2-seq) generative model that probes the suspicious classifier and learns to produce text sequences that are likely to contain the Trojan trigger. T-Miner then analyzes the text produced by the generative model to determine if they contain trigger phrases, and correspondingly, whether the tested classifier has a backdoor. T-Miner requires no access to the training dataset or clean inputs of the suspicious classifier, and instead uses synthetically crafted "nonsensical" text inputs to train the generative model. We extensively evaluate T-Miner on 1100 model instances spanning 3 ubiquitous DNN model architectures, 5 different classification tasks, and a variety of trigger phrases. We show that T-Miner detects Trojan and clean models with a 98.75% overall accuracy, while achieving low false positives on clean models. We also show that T-Miner is robust against a variety of targeted, advanced attacks from an adaptive attacker.

バックドアモデルの検知攻撃手法テキストの摂動手法

arxiv

被引用数 4

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Architectural Backdoors in Neural Networks

Mikel Bober-Irizar, Ilia Shumailov, Yiren Zhao, Robert Mullins, Nicolas Papernot

Published: 2022.6.16

Machine learning is vulnerable to adversarial manipulation. Previous literature has demonstrated that at the training stage attackers can manipulate data and data sampling procedures to control model behaviour. A common attack goal is to plant backdoors i.e. force the victim model to learn to recognise a trigger known only by the adversary. In this paper, we introduce a new class of backdoor attacks that hide inside model architectures i.e. in the inductive bias of the functions used to train. These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture that others will reuse unknowingly. We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch. We formalise the main construction principles behind architectural backdoors, such as a link between the input and the output, and describe some possible protections against them. We evaluate our attacks on computer vision benchmarks of different scales and demonstrate the underlying vulnerability is pervasive in a variety of training settings.

敵対的攻撃敵対的学習脅威モデル

Proceedings of the 37th Annual Computer Security Applications Conference

Badnl: Backdoor attacks against nlp models with semantic-preserving improvements

Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, Yang Zhang

Published: 2021

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Textual backdoor attacks can be more harmful via two simple tricks

Yangyi Chen, Fanchao Qi, Hongcheng Gao, Zhiyuan Liu, Maosong Sun

Published: 2022

IEEE Access

A backdoor attack against lstm-based text classification systems

Jiazhu Dai, Chuanshuai Chen, Yufeng Li

Published: 2019

Proceedings of the 14th international conference on World Wide Web

Ranking a stream of news

Gianna M Del Corso, Antonio Gulli, Francesco Romani

Published: 2005

Proceedings of NAACL-HLT

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Published: 2019

arXiv

Triggerless backdoor attack for nlp tasks with clean labels

Leilei Gan, Jiwei Li, Tianwei Zhang, Xiaoya Li, Yuxian Meng, Fei Wu, Yi Yang, Shangwei Guo, Chun Fan

Published: 2021

Badnets: Identifying vulnerabilities in the machine learning model supply chain

Tianyu Gu, Brendan Dolan-Gavitt, Siddharth Garg

Published: 2017

arXiv

Composite backdoor attacks against large language models

H. Huang, Z. Zhao, M. Backes, Y. Shen, Y. Zhang

Published: 2023

Proceedings of the ACM Web Conference 2023

Training-free lexical backdoor attacks on language models

Yujin Huang, Terry Yue Zhuo, Qiongkai Xu, Han Hu, Xingliang Yuan, Chunyang Chen

Published: 2023

Information Sciences

K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data

Abiodun M Ikotun, Absalom E Ezugwu, Laith Abualigah, Belal Abuhaija, Jia Heming

Published: 2023

Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences

Principal component analysis: a review and recent developments

Ian T Jolliffe, Jorge Cadima

Published: 2016

arxiv

被引用数 2

Computing Research Repository (CoRR)

Weight Poisoning Attacks on Pre-trained Models

Keita Kurita, Paul Michel, Graham Neubig

Published: 2020.4.15

Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct ``weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose ``backdoors'' after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. We show that by applying a regularization method, which we call RIPPLe, and an initialization procedure, which we call Embedding Surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedure. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat. Finally, we outline practical defenses against such attacks. Code to reproduce our experiments is available at https://github.com/neulab/RIPPLe.

敵対的学習バックドア攻撃ポイズニング

Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security

Hidden backdoors in human-centric language models

S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, J. Lu

Published: 2021

arxiv

被引用数 1

Annual ACM Conference on Computer and Communications Security (CCS)

Membership Inference Attacks by Exploiting Loss Trajectory

Yiyong Liu, Zhengyu Zhao, Michael Backes, Yang Zhang

Published: 2022.9.1

Machine learning models are vulnerable to membership inference attacks in which an adversary aims to predict whether or not a particular sample was contained in the target model's training dataset. Existing attack methods have commonly exploited the output information (mostly, losses) solely from the given target model. As a result, in practical scenarios where both the member and non-member samples yield similarly small losses, these methods are naturally unable to differentiate between them. To address this limitation, in this paper, we propose a new attack method, called \system, which can exploit the membership information from the whole training process of the target model for improving the attack performance. To mount the attack in the common black-box setting, we leverage knowledge distillation, and represent the membership information by the losses evaluated on a sequence of intermediate models at different distillation epochs, namely \emph{distilled loss trajectory}, together with the loss from the given target model. Experimental results over different datasets and model architectures demonstrate the great advantage of our attack in terms of different metrics. For example, on CINIC-10, our attack achieves at least 6$\times$ higher true-positive rate at a low false-positive rate of 0.1\% than existing methods. Further analysis demonstrates the general effectiveness of our attack in more strict scenarios.

敵対的攻撃メンバーシップ推論モデルアーキテクチャ

Cryptology ePrint Archive

Optimizations of side-channel attack on AES MixColumns using chosen input

A. Vasselle, A. Wurcker

Published: 2019

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Learning word vectors for sentiment analysis

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, Christopher Potts

Published: 2011

IEEE Transactions on Neural Networks and Learning Systems

A survey of the usages of deep learning for natural language processing

Daniel W Otter, Julian R Medina, Jugal K Kalita

Published: 2020

31st USENIX Security Symposium (USENIX Security 22)

Hidden trigger backdoor attack on {NLP} models via linguistic style manipulation

Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, Min Yang

Published: 2022

arxiv

被引用数 3

Conference on Empirical Methods in Natural Language Processing (EMNLP)

ONION: A Simple and Effective Defense Against Textual Backdoor Attacks

Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, Maosong Sun

Published: 2020.11.20

Backdoor attacks are a kind of emergent training-time threat to deep neural networks (DNNs). They can manipulate the output of DNNs and possess high insidiousness. In the field of natural language processing, some attack methods have been proposed and achieve very high attack success rates on multiple popular models. Nevertheless, there are few studies on defending against textual backdoor attacks. In this paper, we propose a simple and effective textual backdoor defense named ONION, which is based on outlier word detection and, to the best of our knowledge, is the first method that can handle all the textual backdoor attack situations. Experiments demonstrate the effectiveness of our model in defending BiLSTM and BERT against five different backdoor attacks. All the code and data of this paper can be obtained at https://github.com/thunlp/ONION.

テキストの摂動手法トリガーの検知バックドアモデルの検知

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf

Published: 2019

arxiv

被引用数 1

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Towards Data-Free Model Stealing in a Hard Label Setting

Sunandini Sanyal, Sravanti Addepalli, R. Venkatesh Babu

Published: 2022.4.23

Machine learning models deployed as a service (MLaaS) are susceptible to model stealing attacks, where an adversary attempts to steal the model within a restricted access framework. While existing attacks demonstrate near-perfect clone-model performance using softmax predictions of the classification network, most of the APIs allow access to only the top-1 labels. In this work, we show that it is indeed possible to steal Machine Learning models by accessing only top-1 predictions (Hard Label setting) as well, without access to model gradients (Black-Box setting) or even the training dataset (Data-Free setting) within a low query budget. We propose a novel GAN-based framework that trains the student and generator in tandem to steal the model effectively while overcoming the challenge of the hard label setting by utilizing gradients of the clone network as a proxy to the victim's gradients. We propose to overcome the large query costs associated with a typical Data-Free setting by utilizing publicly available (potentially unrelated) datasets as a weak image prior. We additionally show that even in the absence of such data, it is possible to achieve state-of-the-art results within a low query budget using synthetically crafted samples. We are the first to demonstrate the scalability of Model Stealing in a restricted access setting on a 100 class dataset as well.

DFLに対する攻撃手法メンバーシップ推論クエリの多様性

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

CARER: Contextualized affect representations for emotion recognition

Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, Yi-Shin Chen

Published: 2018

Computers & Security

Bddr: An effective defense against textual backdoor attacks

Kun Shao, Junan Yang, Yang Ai, Hui Liu, Yu Zhang

Published: 2021

CCF International Conference on Natural Language Processing and Chinese Computing

Punctuation matters! stealthy backdoor attack for language models

Xuan Sheng, Zhicheng Li, Zhaoyang Han, Xiangmao Chang, Piji Li

Published: 2023

Conference on empirical methods in natural language processing

Recursive deep models for semantic compositionality over a sentiment treebank

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., Potts, C.

Published: 2013

SSRN

Emtract: Extracting emotions from social media

Domonkos F Vamossy, Rolf Skog

Published: 2023

Advances in neural information processing systems

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, Illia Polosukhin

Published: 2017

Computational intelligence and neuroscience

Deep learning for computer vision: A brief review

Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, Eftychios Protopapadakis

Published: 2018

The 61st Annual Meeting Of The Association For Computational Linguistics

Bite: Textual backdoor attacks with iterative trigger injection

Jun Yan, Vansh Gupta, Xiang Ren

Published: 2023