This work investigates and evaluates multiple defense strategies against
property inference attacks (PIAs), a privacy attack against machine learning
models. Given a trained machine learning model, PIAs aim to extract statistical
properties of its underlying training data, e.g., reveal the ratio of men and
women in a medical training data set. While for other privacy attacks like
membership inference, a lot of research on defense mechanisms has been
published, this is the first work focusing on defending against PIAs. With the
primary goal of developing a generic mitigation strategy against white-box
PIAs, we propose the novel approach property unlearning. Extensive experiments
with property unlearning show that while it is very effective when defending
target models against specific adversaries, property unlearning is not able to
generalize, i.e., protect against a whole class of PIAs. To investigate the
reasons behind this limitation, we present the results of experiments with the
explainable AI tool LIME. They show how state-of-the-art property inference
adversaries with the same objective focus on different parts of the target
model. We further elaborate on this with a follow-up experiment, in which we
use the visualization technique t-SNE to exhibit how severely statistical
training data properties are manifested in machine learning models. Based on
this, we develop the conjecture that post-training techniques like property
unlearning might not suffice to provide the desirable generic protection
against PIAs. As an alternative, we investigate the effects of simpler training
data preprocessing methods like adding Gaussian noise to images of a training
data set on the success rate of PIAs. We conclude with a discussion of the
different defense approaches, summarize the lessons learned and provide
directions for future work.