These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) demonstrate general intelligence across a
variety of machine learning tasks, thereby enhancing the commercial value of
their intellectual property (IP). To protect this IP, model owners typically
allow user access only in a black-box manner, however, adversaries can still
utilize model extraction attacks to steal the model intelligence encoded in
model generation. Watermarking technology offers a promising solution for
defending against such attacks by embedding unique identifiers into the
model-generated content. However, existing watermarking methods often
compromise the quality of generated content due to heuristic alterations and
lack robust mechanisms to counteract adversarial strategies, thus limiting
their practicality in real-world scenarios. In this paper, we introduce an
adaptive and robust watermarking method (named ModelShield) to protect the IP
of LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs
to autonomously insert watermarks into their generated content to avoid the
degradation of model content. We also propose a robust watermark detection
mechanism capable of effectively identifying watermark signals under the
interference of varying adversarial strategies. Besides, ModelShield is a
plug-and-play method that does not require additional model training, enhancing
its applicability in LLM deployments. Extensive evaluations on two real-world
datasets and three LLMs demonstrate that our method surpasses existing methods
in terms of defense effectiveness and robustness while significantly reducing
the degradation of watermarking on the model-generated content.