These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) excel in various domains but pose inherent
privacy risks. Existing methods to evaluate privacy leakage in LLMs often use
memorized prefixes or simple instructions to extract data, both of which
well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM
safety mechanisms to generate harmful content, but their role in privacy
scenarios remains underexplored. In this paper, we examine the effectiveness of
jailbreak attacks in extracting sensitive information, bridging privacy leakage
and jailbreak attacks in LLMs. Moreover, we propose PIG, a novel framework
targeting Personally Identifiable Information (PII) and addressing the
limitations of current jailbreak methods. Specifically, PIG identifies PII
entities and their types in privacy queries, uses in-context learning to build
a privacy context, and iteratively updates it with three gradient-based
strategies to elicit target PII. We evaluate PIG and existing jailbreak methods
using two privacy-related datasets. Experiments on four white-box and two
black-box LLMs show that PIG outperforms baseline methods and achieves
state-of-the-art (SoTA) results. The results underscore significant privacy
risks in LLMs, emphasizing the need for stronger safeguards. Our code is
availble at
\href{https://github.com/redwyd/PrivacyJailbreak}{https://github.com/redwyd/PrivacyJailbreak}.