These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) can leak sensitive training data through
memorization and membership inference attacks. Prior work has primarily focused
on strong adversarial assumptions, including attacker access to entire samples
or long, ordered prefixes, leaving open the question of how vulnerable LLMs are
when adversaries have only partial, unordered sample information. For example,
if an attacker knows a patient has "hypertension," under what conditions can
they query a model fine-tuned on patient data to learn the patient also has
"osteoarthritis?" In this paper, we introduce a more general threat model under
this weaker assumption and show that fine-tuned LLMs are susceptible to these
fragment-specific extraction attacks. To systematically investigate these
attacks, we propose two data-blind methods: (1) a likelihood ratio attack
inspired by methods from membership inference, and (2) a novel approach, PRISM,
which regularizes the ratio by leveraging an external prior. Using examples
from both medical and legal settings, we show that both methods are competitive
with a data-aware baseline classifier that assumes access to labeled
in-distribution data, underscoring their robustness.