Transfer learning --- transferring learned knowledge --- has brought a
paradigm shift in the way models are trained. The lucrative benefits of
improved accuracy and reduced training time have shown promise in training
models with constrained computational resources and fewer training samples.
Specifically, publicly available text-based models such as GloVe and BERT that
are trained on large corpus of datasets have seen ubiquitous adoption in
practice. In this paper, we ask, "can transfer learning in text prediction
models be exploited to perform misclassification attacks?" As our main
contribution, we present novel attack techniques that utilize unintended
features learnt in the teacher (public) model to generate adversarial examples
for student (downstream) models. To the best of our knowledge, ours is the
first work to show that transfer learning from state-of-the-art word-based and
sentence-based teacher models increase the susceptibility of student models to
misclassification attacks. First, we propose a novel word-score based attack
algorithm for generating adversarial examples against student models trained
using context-free word-level embedding model. On binary classification tasks
trained using the GloVe teacher model, we achieve an average attack accuracy of
97% for the IMDB Movie Reviews and 80% for the Fake News Detection. For
multi-class tasks, we divide the Newsgroup dataset into 6 and 20 classes and
achieve an average attack accuracy of 75% and 41% respectively. Next, we
present length-based and sentence-based misclassification attacks for the Fake
News Detection task trained using a context-aware BERT model and achieve 78%
and 39% attack accuracy respectively. Thus, our results motivate the need for
designing training techniques that are robust to unintended feature learning,
specifically for transfer learned models.