Text Embeddings Reveal (Almost) As Much As Text

TOP Literature Database Text Embeddings Reveal (Almost) As Much As Text

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2310.06816

PDF

https://arxiv.org/pdf/2310.06816

Paper Information

Author: John X. Morris,Volodymyr Kuleshov,Vitaly Shmatikov,Alexander M. Rush
Published: 10-11-2023
Affiliation: Department of Computer Science, Cornell University
Country: United States of America
Conference: Conference on Empirical Methods in Natural Language Processing (EMNLP)

Labels Estimated by AI

Model Inversion Model Evaluation Membership Inference

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a na\"ive model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover $92\%$ of $32\text{-token}$ text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes. Our code is available on Github: \href{https://github.com/jxmorris12/vec2text}{github.com/jxmorris12/vec2text}.

External Datasets

Natural Questions

MSMARCO

MIMIC-III

BEIR benchmark