Effective Prompt Extraction from Language Models

TOP Literature Database Effective Prompt Extraction from Language Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2307.06865

PDF

https://arxiv.org/pdf/2307.06865

Paper Information

Author: Yiming Zhang,Nicholas Carlini,Daphne Ippolito
Published: 7-14-2023
Updated: 8-8-2024
Affiliation: Carnegie Mellon University
Country: United States of America
Conference

Labels Estimated by AI

Prompt Injection Prompt leaking Dialogue System

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

The text generated by large language models is commonly controlled by prompting, where a prompt prepended to a user's query guides the model's output. The prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. They have even been treated as commodities to be bought and sold on marketplaces. However, anecdotal reports have shown adversarial users employing prompt extraction attacks to recover these prompts. In this paper, we present a framework for systematically measuring the effectiveness of these attacks. In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability. Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination. Prompt extraction from real systems such as Claude 3 and ChatGPT further suggest that system prompts can be revealed by an adversary despite existing defenses in place.

External Datasets

UNNATURAL

SHAREGPT

AWESOME