System prompts are widely used to guide the outputs of large language models
(LLMs). These prompts often contain business logic and sensitive information,
making their protection essential. However, adversarial and even regular user
queries can exploit LLM vulnerabilities to expose these hidden prompts. To
address this issue, we propose PromptKeeper, a defense mechanism designed to
safeguard system prompts by tackling two core challenges: reliably detecting
leakage and mitigating side-channel vulnerabilities when leakage occurs. By
framing detection as a hypothesis-testing problem, PromptKeeper effectively
identifies both explicit and subtle leakage. Upon leakage detected, it
regenerates responses using a dummy prompt, ensuring that outputs remain
indistinguishable from typical interactions when no leakage is present.
PromptKeeper ensures robust protection against prompt extraction attacks via
either adversarial or regular queries, while preserving conversational
capability and runtime efficiency during benign user interactions.