These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The text generated by large language models is commonly controlled by
prompting, where a prompt prepended to a user's query guides the model's
output. The prompts used by companies to guide their models are often treated
as secrets, to be hidden from the user making the query. They have even been
treated as commodities to be bought and sold on marketplaces. However,
anecdotal reports have shown adversarial users employing prompt extraction
attacks to recover these prompts. In this paper, we present a framework for
systematically measuring the effectiveness of these attacks. In experiments
with 3 different sources of prompts and 11 underlying large language models, we
find that simple text-based attacks can in fact reveal prompts with high
probability. Our framework determines with high precision whether an extracted
prompt is the actual secret prompt, rather than a model hallucination. Prompt
extraction from real systems such as Claude 3 and ChatGPT further suggest that
system prompts can be revealed by an adversary despite existing defenses in
place.