These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) are increasingly integrated into daily routines,
yet they raise significant privacy and safety concerns. Recent research
proposes collaborative inference, which outsources the early-layer inference to
ensure data locality, and introduces model safety auditing based on inner
neuron patterns. Both techniques expose the LLM's Internal States (ISs), which
are traditionally considered irreversible to inputs due to optimization
challenges and the highly abstract representations in deep layers. In this
work, we challenge this assumption by proposing four inversion attacks that
significantly improve the semantic similarity and token matching rate of
inverted inputs. Specifically, we first develop two white-box
optimization-based attacks tailored for low-depth and high-depth ISs. These
attacks avoid local minima convergence, a limitation observed in prior work,
through a two-phase inversion process. Then, we extend our optimization attack
under more practical black-box weight access by leveraging the transferability
between the source and the derived LLMs. Additionally, we introduce a
generation-based attack that treats inversion as a translation task, employing
an inversion model to reconstruct inputs. Extensive evaluation of short and
long prompts from medical consulting and coding assistance datasets and 6 LLMs
validates the effectiveness of our inversion attacks. Notably, a 4,112-token
long medical consulting prompt can be nearly perfectly inverted with 86.88 F1
token matching from the middle layer of Llama-3 model. Finally, we evaluate
four practical defenses that we found cannot perfectly prevent ISs inversion
and draw conclusions for future mitigation design.