These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) are widely used in sensitive domains, including
healthcare, finance, and legal services, raising concerns about potential
private information leaks during inference. Privacy extraction attacks, such as
jailbreaking, expose vulnerabilities in LLMs by crafting inputs that force the
models to output sensitive information. However, these attacks cannot verify
whether the extracted private information is accurate, as no public datasets
exist for cross-validation, leaving a critical gap in private information
detection during inference. To address this, we propose PrivacyXray, a novel
framework detecting privacy breaches by analyzing LLM inner states. Our
analysis reveals that LLMs exhibit higher semantic coherence and probabilistic
certainty when generating correct private outputs. Based on this, PrivacyXray
detects privacy breaches using four metrics: intra-layer and inter-layer
semantic similarity, token-level and sentence-level probability distributions.
PrivacyXray addresses critical challenges in private information detection by
overcoming the lack of open-source private datasets and eliminating reliance on
external data for validation. It achieves this through the synthesis of
realistic private data and a detection mechanism based on the inner states of
LLMs. Experiments show that PrivacyXray achieves consistent performance, with
an average accuracy of 92.69% across five LLMs. Compared to state-of-the-art
methods, PrivacyXray achieves significant improvements, with an average
accuracy increase of 20.06%, highlighting its stability and practical utility
in real-world applications.