As machine learning systems grow in scale, so do their training data
requirements, forcing practitioners to automate and outsource the curation of
training data in order to achieve state-of-the-art performance. The absence of
trustworthy human supervision over the data collection process exposes
organizations to security vulnerabilities; training data can be manipulated to
control and degrade the downstream behaviors of learned models. The goal of
this work is to systematically categorize and discuss a wide range of dataset
vulnerabilities and exploits, approaches for defending against these threats,
and an array of open problems in this space. In addition to describing various
poisoning and backdoor threat models and the relationships among them, we
develop their unified taxonomy.