These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Simple fine-tuning can embed hidden text into large language models (LLMs),
which is revealed only when triggered by a specific query. Applications include
LLM fingerprinting, where a unique identifier is embedded to verify licensing
compliance, and steganography, where the LLM carries hidden messages disclosed
through a trigger query.
Our work demonstrates that embedding hidden text via fine-tuning, although
seemingly secure due to the vast number of potential triggers, is vulnerable to
extraction through analysis of the LLM's output decoding process. We introduce
an extraction attack called Unconditional Token Forcing (UTF), which
iteratively feeds tokens from the LLM's vocabulary to reveal sequences with
high token probabilities, indicating hidden text candidates. We also present
Unconditional Token Forcing Confusion (UTFC), a defense paradigm that makes
hidden text resistant to all known extraction attacks without degrading the
general performance of LLMs compared to standard fine-tuning. UTFC has both
benign (improving LLM fingerprinting) and malign applications (using LLMs to
create covert communication channels).
External Datasets
instruction-formatted fingerprint pairs
fingerprinted LLM1
five fingerprinted LLMs provided by Xu et al. (2024)