Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models

European Symposium on Security and Privacy (EuroS&P)

When the Curious Abandon Honesty: Federated Learning Is Not Private

Franziska Boenisch, Adam Dziedzic, Roei Schuster, Ali Shahin Shamsabadi, Ilia Shumailov, Nicolas Papernot

Published: 2021.12.6

In federated learning (FL), data does not leave personal devices when they are jointly training a machine learning model. Instead, these devices share gradients, parameters, or other model updates, with a central party (e.g., a company) coordinating the training. Because data never "leaves" personal devices, FL is often presented as privacy-preserving. Yet, recently it was shown that this protection is but a thin facade, as even a passive, honest-but-curious attacker observing gradients can reconstruct data of individual users contributing to the protocol. In this work, we show a novel data reconstruction attack which allows an active and dishonest central party to efficiently extract user data from the received gradients. While prior work on data reconstruction in FL relies on solving computationally expensive optimization problems or on making easily detectable modifications to the shared model's architecture or parameters, in our attack the central party makes inconspicuous changes to the shared model's weights before sending them out to the users. We call the modified weights of our attack trap weights. Our active attacker is able to recover user data perfectly, i.e., with zero error, even when this data stems from the same class. Recovery comes with near-zero costs: the attack requires no complex optimization objectives. Instead, our attacker exploits inherent data leakage from model gradients and simply amplifies this effect by maliciously altering the weights of the shared model through the trap weights. These specificities enable our attack to scale to fully-connected and convolutional deep neural networks trained with large mini-batches of data. For example, for the high-dimensional vision dataset ImageNet, we perfectly reconstruct more than 50% of the training data points from mini-batches as large as 100 data points.

データ抽出と分析ポイズニングトレーニングデータ抽出手法

OpenAI Technical Report

Language models are few-shot learners

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei

Published: 2020

USENIX Security Symposium

The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks

Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, Dawn Song

Published: 2018.2.23

This paper describes a testing methodology for quantitatively assessing the risk that rare or unique training-data sequences are unintentionally memorized by generative sequence models---a common type of machine-learning model. Because such models are sometimes trained on sensitive data (e.g., the text of users' private messages), this methodology can benefit privacy by allowing deep-learning practitioners to select means of training that minimize such memorization. In experiments, we show that unintended memorization is a persistent, hard-to-avoid issue that can have serious consequences. Specifically, for models trained without consideration of memorization, we describe new, efficient procedures that can extract unique, secret sequences, such as credit card numbers. We show that our testing strategy is a practical and easy-to-use first line of defense, e.g., by describing its application to quantitatively limit data exposure in Google's Smart Compose, a commercial text-completion neural network trained on millions of users' email messages.

差分プライバシープライバシー保護メカニズム情報理論的評価

30th USENIX Security Symposium (USENIX Security 21)

Extracting training data from large language models

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., Raffel, C.

32nd USENIX Security Symposium (USENIX Security 23)

2022 IEEE Symposium on Security and Privacy (SP)

Membership inference attacks from first principles

Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., Tramer, F.

Published: 2022

Extracting training data from diffusion models

Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramer, F., Balle, B., Ippolito, D., Wallace, E.

Targeted backdoor attacks on deep learning systems using data poisoning

Chen, X., Liu, C., Li, B., Lu, K., Song, D.

Published: 2017

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Reproducible scaling laws for contrastive language-image learning

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.

Published: 2023

International Conference on Machine Learning (ICML)

Label-Only Membership Inference Attacks

Christopher A. Choquette-Choo, Florian Tramer, Nicholas Carlini, Nicolas Papernot

Published: 2020.7.29

Membership inference attacks are one of the simplest forms of privacy leakage for machine learning models: given a data point and model, determine whether the point was used to train the model. Existing membership inference attacks exploit models' abnormal confidence when queried on their training data. These attacks do not apply if the adversary only gets access to models' predicted labels, without a confidence measure. In this paper, we introduce label-only membership inference attacks. Instead of relying on confidence scores, our attacks evaluate the robustness of a model's predicted labels under perturbations to obtain a fine-grained membership signal. These perturbations include common data augmentations or adversarial examples. We empirically show that our label-only membership inference attacks perform on par with prior attacks that required access to model confidences. We further demonstrate that label-only attacks break multiple defenses against membership inference attacks that (implicitly or explicitly) rely on a phenomenon we call confidence masking. These defenses modify a model's confidence scores in order to thwart attacks, but leave the model's predicted labels unchanged. Our label-only attacks demonstrate that confidence-masking is not a viable defense strategy against membership inference. Finally, we investigate worst-case label-only attacks, that infer membership for a small number of outlier data points. We show that label-only attacks also match confidence-based attacks in this setting. We find that training models with differential privacy and (strong) L2 regularization are the only known defense strategies that successfully prevents all attacks. This remains true even when the differential privacy budget is too high to offer meaningful provable guarantees.

攻撃手法バックドア攻撃メンバーシップ推論

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge neurons in pretrained transformers

Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., Wei, F.

Published: 2022

Privacy side channels in machine learning systems

Debenedetti, E., Severi, G., Carlini, N., Choquette-Choo, C. A., Jagielski, M., Nasr, M., Wallace, E., Tramer, F.

Proceedings of the 40th International Conference on Machine Learning

2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR

Imagenet: A large-scale hierarchical image database

J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei

Published: 2009

Advances in Neural Information Processing Systems

Qlora: Efficient finetuning of quantized llms

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.

Published: 2023

Are diffusion models vulnerable to membership inference attacks?

Duan, J., Kong, F., Wang, S., Shi, X., Xu, K.

Robbing the fed: Directly obtaining private data in federated learning with modified models

Privacy backdoors: Stealing data with corrupted pretrained models

Feng, S., Tramer, F.

Published: 2024

Liam Fowl, Jonas Geiping, Wojtek Czaja, Micah Goldblum, Tom Goldstein

The Eleventh International Conference on Learning Representations

Decepticons: Corrupted transformers breach privacy in federated learning for language models

Fowl, L. H., Geiping, J., Reich, S., Wen, Y., Czaja, W., Goldblum, M., Goldstein, T.

Introducing Android’s Private Compute Services

Frey, S.

Advances in Neural Information Processing Systems (NeurIPS)

Inverting gradients — how easy is it to break privacy in federated learning?

Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, Michael Moeller

Published: 2020

Badnets: Identifying vulnerabilities in the machine learning model supply chain

Tianyu Gu, Brendan Dolan-Gavitt, Siddharth Garg

Published: 2017

Advances in Neural Information Processing Systems

Handcrafted backdoors in deep neural networks

S. Hong, N. Carlini, A. Kurakin

Published: 2022

CoRR

Lora: Low-rank adaptation of large language models

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, W. Chen

Computing Research Repository (CoRR)

被引用数 2

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

Published: 2024.1.11

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

強化学習プロンプトインジェクションバックドア攻撃

Neftune: Noisy embeddings improve instruction finetuning

Jain, N., Chiang, P.-y., Wen, Y., Kirchenbauer, J., Chu, H.-M., Somepalli, G., Bartoldson, B. R., Kailkhura, B., Schwarzschild, A., Saha, A.

International Conference on Machine Learning

Scientific data

Mimic-iii, a freely accessible critical care database

Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., Mark, R. G.

Published: 2016

Scientific data

Mimic-iv, a freely accessible electronic health record dataset

Johnson, A. E., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B.

Published: 2023

A watermark for large language models

Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., Goldstein, T.

How hard is trojan detection in DNNs? fooling detectors with evasive trojans

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton

Published: 2009

International Conference on Learning Representations

Decoupled weight decay regularization

Ilya Loshchilov, Frank Hutter

Published: 2018

Mazeika, M., Zou, A., Arora, A., Pleskov, P., Song, D., Hendrycks, D., Li, B., Forsyth, D.

Communication-Efficient Learning of Deep Networks from Decentralized Data

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Agüera y Arcas

Published: 2016.2.18

Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.

深層学習手法連合学習通信コスト削減

Regularizing and optimizing lstm language models

Merity, S., Keskar, N. S., Socher, R.

Published: 2017

International Conference on Learning Representations (ICLR)

被引用数 2

Language Model Inversion

John X. Morris, Wenting Zhao, Justin T. Chiu, Vitaly Shmatikov, Alexander M. Rush

Published: 2023.11.23

Language models produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of language model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $78$ and recovers $27\%$ of prompts exactly. Code for reproducing all experiments is available at http://github.com/jxmorris12/vec2text.

モデルインバージョンモデル評価プロンプトリーキング

OpenAI blog

Language models are unsupervised multitask learners

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever

Published: 2019

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

Published: 2021