PromptCOS: Towards System Prompt Copyright Auditing for LLMs via Content-level Output Similarity

TOP Literature Database PromptCOS: Towards System Prompt Copyright Auditing for LLMs via Content-level Output Similarity

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2509.03117

PDF

https://arxiv.org/pdf/2509.03117

Paper Information

Author: Yuchen Yang,Yiming Li,Hongwei Yao,Enhao Huang,Shuo Shao,Bingrun Yang,Zhibo Wang,Dacheng Tao,Zhan Qin
Published: 9-3-2025
Affiliation: State Key Laboratory of Blockchain and Data Security, Zhejiang University
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Model Extraction Attack Prompt validation Prompt leaking

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

The rapid progress of large language models (LLMs) has greatly enhanced reasoning tasks and facilitated the development of LLM-based applications. A critical factor in improving LLM-based applications is the design of effective system prompts, which significantly impact the behavior and output quality of LLMs. However, system prompts are susceptible to theft and misuse, which could undermine the interests of prompt owners. Existing methods protect prompt copyrights through watermark injection and verification but face challenges due to their reliance on intermediate LLM outputs (e.g., logits), which limits their practical feasibility. In this paper, we propose PromptCOS, a method for auditing prompt copyright based on content-level output similarity. It embeds watermarks by optimizing the prompt while simultaneously co-optimizing a special verification query and content-level signal marks. This is achieved by leveraging cyclic output signals and injecting auxiliary tokens to ensure reliable auditing in content-only scenarios. Additionally, it incorporates cover tokens to protect the watermark from malicious deletion. For copyright verification, PromptCOS identifies unauthorized usage by comparing the similarity between the suspicious output and the signal mark. Experimental results demonstrate that our method achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% greater than the best baseline), high fidelity (accuracy degradation of no more than 0.58%), robustness (resilience against three types of potential attacks), and computational efficiency (up to 98.1% reduction in computational cost). Our code is available at GitHub https://github.com/LianPing-cyber/PromptCOS.

External Datasets

BIGBENCH-II

GSM8K

HumanEval