AIセキュリティポータル K Program
Masked Language Model Based Textual Adversarial Example Detection
Share
Abstract
Adversarial attacks are a serious threat to the reliable deployment of machine learning models in safety-critical applications. They can misguide current models to predict incorrectly by slightly modifying the inputs. Recently, substantial work has shown that adversarial examples tend to deviate from the underlying data manifold of normal examples, whereas pre-trained masked language models can fit the manifold of normal NLP data. To explore how to use the masked language model in adversarial detection, we propose a novel textual adversarial example detection method, namely Masked Language Model-based Detection (MLMD), which can produce clearly distinguishable signals between normal examples and adversarial examples by exploring the changes in manifolds induced by the masked language model. MLMD features a plug and play usage (i.e., no need to retrain the victim model) for adversarial defense and it is agnostic to classification tasks, victim model's architectures, and to-be-defended attack methods. We evaluate MLMD on various benchmark textual datasets, widely studied machine learning models, and state-of-the-art (SOTA) adversarial attacks (in total $3*4*4 = 48$ settings). Experimental results show that MLMD can achieve strong performance, with detection accuracy up to 0.984, 0.967, and 0.901 on AG-NEWS, IMDB, and SST-2 datasets, respectively. Additionally, MLMD is superior, or at least comparable to, the SOTA detection defenses in detection accuracy and F1 score. Among many defenses based on the off-manifold assumption of adversarial examples, this work offers a new angle for capturing the manifold change. The code for this work is openly accessible at \url{https://github.com/mlmddetection/MLMDdetection}.
Adversarial Example Detection Using Latent Neighborhood Graph
Ahmed A. Abusnaina, Yuhang Wu, Sunpreet S. Arora, Yizhen Wang, Fei Wang, Hao Yang, David A. Mohaisen
Published: 2021
Toward Mitigating Adversarial Texts
Basemah Alshemali, Jugal Kumar Kalita
Published: 2019
Generating Natural Language Adversarial Examples
Moustafa Farid Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, Kai-Wei Chang
Published: 2018
Membership Inference Attacks From First Principles
Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, A. Terzis, Florian Tramèr
Published: 2022
Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods
Nicholas Carlini, David A. Wagner
Published: 2017
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, P. Abbeel
Published: 2016
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Published: 2019
Towards Robustness Against Natural Language Word Substitutions
Xinshuai Dong, Anh Tuan Luu, Rongrong Ji, Hong Liu
Published: 2021
HotFlip: White-Box Adversarial Examples for Text Classification
J. Ebrahimi, Anyi Rao, Daniel Lowd, Dejing Dou
Published: 2018
Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems
Steffen Eger, Gözde Gül Sahin, Andreas Rücklé, Ji-Ung Lee, Claudia Schulz, Mohsen Mesgar, Krishnkant Swarnkar, Edwin Simpson, Iryna Gurevych
Published: 2019
Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers
Ji Gao, Jack Lanchantin, Mary Lou Soffa, Yanjun Qi
Published: 2018
BAE: BERT-based Adversarial Examples for Text Classification
Siddhant Garg, Goutham Ramakrishnan
Published: 2020
Generative Adversarial Nets
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, Yoshua Bengio
Published: 2014
Explaining and harnessing adversarial examples
Goodfellow, I. J., Shlens, J., Szegedy, C.
Published: 2015
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith
Published: 2020
Pretrained Transformers Improve Out-of-Distribution Robustness
Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Xiaodong Song
Published: 2020
Long short-term memory
S. Hochreiter, J. Schmidhuber
Published: 1997
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks
Mohit Iyyer, John Wieting, Kevin Gimpel, Luke Zettlemoyer
Published: 2018
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment
Di Jin, Zhijing Jin, Joey Tianyi Zhou, Peter Szolovits
Published: 2020
Contextualized Perturbation for Textual Adversarial Attack
Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, Chris Brockett, Ming-Ting Sun, Bill Dolan
Published: 2021
TextBugger: Generating Adversarial Text Against Real-world Applications
Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, Ting Wang
Published: 2019
BERT-ATTACK: Adversarial Attack against BERT Using BERT
Linyang Li, Ruotian Ma, Qipeng Guo, X. Xue, Xipeng Qiu
Published: 2020
Generating Natural Language Attacks in a Hard Label Black Box Setting
Rishabh Maheshwary, Saket Maheshwary, Vikram Pudi
Published: 2021
A Strong Baseline for Query Efficient Attacks in a Black Box Setting
Rishabh Maheshwary, Saket Maheshwary, Vikram Pudi
Published: 2021
MagNet: a Two-Pronged Defense against Adversarial Examples
Dongyu Meng, Hao Chen
Published: 5.25.2017
GradMask: Gradient-Guided Token Masking for Textual Adversarial Example Detection
Han Cheol Moon, Shafiq R. Joty, Xu Chi
Published: 2022
TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP
John X. Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, Yanjun Qi
Published: 2020
“That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks
Edoardo Mosca, Shreyash Agarwal, Javier Rando-Ramirez, George Louis Groh
Published: 2022
Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples
Maximilian Mozes, Pontus Stenetorp, Bennett Kleinberg, Lewis D. Griffin
Published: 2021
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness
Nathan Ng, Kyunghyun Cho, Marzyeh Ghassemi
Published: 2020
Textual Manifold-based Defense Against Natural Language Adversarial Examples
Dang Minh Nguyen, Anh Tuan Luu
Published: 2022
Combating Adversarial Misspellings with Robust Word Recognition
Danish Pruthi, Bhuwan Dhingra, Zachary Chase Lipton
Published: 2019
Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency
Shuhuai Ren, Yihe Deng, Kun He, Wanxiang Che
Published: 2019
Recursive deep models for semantic compositionality over a sentiment treebank
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., Potts, C.
Published: 2013
Intriguing properties of neural networks
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus
Published: 2014
Rethinking Textual Adversarial Defense for Pre-Trained Language Models
Jiayi Wang, Rongzhou Bao, Zhuosheng Zhang, Hai Zhao
Published: 2022
CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation
Tianlu Wang, Xuezhi Wang, Yao Qin, Ben Packer, Kang Li, Jilin Chen, Alex Beutel, Ed H. Chi
Published: 2020
Natural language adversarial defense through synonym encoding
Xiaosen Wang, Hao Jin, Yichen Yang, Kun He
Published: 2021
Detecting textual adversarial examples through randomized substitution and vote
Xiaosen Wang, Yifeng Xiong, Kun He
Published: 2022
Unsupervised Out-of-Domain Detection via Pre-trained Transformers
Keyang Xu, Tongzheng Ren, Shikun Zhang, Yihao Feng, Caiming Xiong
Published: 2021
Class-Disentanglement and Applications in Adversarial Detection and Defense
Kaiwen Yang, Tianyi Zhou, Yonggang Zhang, Xinmei Tian, Dacheng Tao
Published: 2021
SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions
Mao Ye, Chengyue Gong, Qiang Liu
Published: 2020
Evaluating Membership Inference Through Adversarial Robustness
Zhaoxi Zhang, Leo Yu Zhang, Xufei Zheng, Bilal Hussain Abbasi, Shengshan Hu
Published: 2022
Self-Supervised Adversarial Example Detection by Disentangled Representation
Zhaoxi Zhang, Leo Yu Zhang, Xufei Zheng, Jinyu Tian, Jiantao Zhou
Published: 2022
Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification
Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, Wei Wang
Published: 2019
Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood Ensemble
Yi Zhou, Xiaoqing Zheng, Cho-Jui Hsieh, Kai-Wei Chang, Xuanjing Huang
Published: 2021
Share