Semi-supervised learning (SSL) leverages both labeled and unlabeled data to
train machine learning (ML) models. State-of-the-art SSL methods can achieve
comparable performance to supervised learning by leveraging much fewer labeled
data. However, most existing works focus on improving the performance of SSL.
In this work, we take a different angle by studying the training data privacy
of SSL. Specifically, we propose the first data augmentation-based membership
inference attacks against ML models trained by SSL. Given a data sample and the
black-box access to a model, the goal of membership inference attack is to
determine whether the data sample belongs to the training dataset of the model.
Our evaluation shows that the proposed attack can consistently outperform
existing membership inference attacks and achieves the best performance against
the model trained by SSL. Moreover, we uncover that the reason for membership
leakage in SSL is different from the commonly believed one in supervised
learning, i.e., overfitting (the gap between training and testing accuracy). We
observe that the SSL model is well generalized to the testing data (with almost
0 overfitting) but ''memorizes'' the training data by giving a more confident
prediction regardless of its correctness. We also explore early stopping as a
countermeasure to prevent membership inference attacks against SSL. The results
show that early stopping can mitigate the membership inference attack, but with
the cost of model's utility degradation.