These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Unlearning is the predominant method for removing the influence of data in
machine learning models. However, even after unlearning, models often continue
to produce the same predictions on the unlearned data with high confidence.
This persistent behavior can be exploited by adversaries using confident model
predictions on incorrect or obsolete data to harm users. We call this threat
model, which unlearning fails to protect against, *test-time privacy*. In
particular, an adversary with full model access can bypass any naive defenses
which ensure test-time privacy. To address this threat, we introduce an
algorithm which perturbs model weights to induce maximal uncertainty on
protected instances while preserving accuracy on the rest of the instances. Our
core algorithm is based on finetuning with a Pareto optimal objective that
explicitly balances test-time privacy against utility. We also provide a
certifiable approximation algorithm which achieves $(\varepsilon, \delta)$
guarantees without convexity assumptions. We then prove a tight, non-vacuous
bound that characterizes the privacy-utility tradeoff that our algorithms
incur. Empirically, our method obtains $>3\times$ stronger uncertainty than
pretraining with $<0.2\%$ drops in accuracy on various image recognition
benchmarks. Altogether, this framework provides a tool to guarantee additional
protection to end users.