ModelShield: Adaptive and Robust Watermark against Model Extraction Attack | AI Security Portal

JA

JA

EN

TOP Literature Database ModelShield: Adaptive and Robust Watermark against Model Extraction Attack

arxiv

ModelShield: Adaptive and Robust Watermark against Model Extraction Attack

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2405.02365

PDF

https://arxiv.org/pdf/2405.02365

Paper Information

Author: Kaiyi Pang;Tao Qi;Chuhan Wu;Minhao Bai;Minghu Jiang;Yongfeng Huang
Published: 5-3-2024
Updated: 1-12-2025
Affiliation: Tsinghua University
Country: China
Conference: IEEE Trans. Inf. Forensics Secur.

Labels Estimated by AI

Prompt Injection Watermarking Watermark Evaluation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large language models (LLMs) demonstrate general intelligence across a variety of machine learning tasks, thereby enhancing the commercial value of their intellectual property (IP). To protect this IP, model owners typically allow user access only in a black-box manner, however, adversaries can still utilize model extraction attacks to steal the model intelligence encoded in model generation. Watermarking technology offers a promising solution for defending against such attacks by embedding unique identifiers into the model-generated content. However, existing watermarking methods often compromise the quality of generated content due to heuristic alterations and lack robust mechanisms to counteract adversarial strategies, thus limiting their practicality in real-world scenarios. In this paper, we introduce an adaptive and robust watermarking method (named ModelShield) to protect the IP of LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs to autonomously insert watermarks into their generated content to avoid the degradation of model content. We also propose a robust watermark detection mechanism capable of effectively identifying watermark signals under the interference of varying adversarial strategies. Besides, ModelShield is a plug-and-play method that does not require additional model training, enhancing its applicability in LLM deployments. Extensive evaluations on two real-world datasets and three LLMs demonstrate that our method surpasses existing methods in terms of defense effectiveness and robustness while significantly reducing the degradation of watermarking on the model-generated content.

External Datasets

Human ChatGPT Comparison Corpus (HC3)

InstructWild (WILD)

References

International Journal of Computer Vision

Knowledge distillation: A survey

J. Gou, B. Yu, S. J. Maybank, D. Tao

Published: 2021

Stanford Alpaca: An instruction-following LLaMA model

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto

Published: 2023

The Twelfth International Conference on Learning Representations

Minillm: Knowledge distillation of large language models

Y. Gu, L. Dong, F. Wei, M. Huang

Published: 2023

IEEE Transactions on Information theory

Quantization index modulation: A class of provably good methods for digital watermarking and information embedding

B. Chen, G. W. Wornell

Published: 2001

Proceedings of the IEEE

Multimedia watermarking techniques

F. Hartung, M. Kutter

Published: 1999

IEEE Transactions on Information Forensics and Security

A robust reversible watermarking scheme using attack-simulation-based adaptive normalization and embedding

Y. Tang, C. Wang, S. Xiang, Y.-M. Cheung

Published: 2024

Proceedings of the AAAI Conference on Artificial Intelligence

Protecting intellectual property of language generation apis with lexical watermark

X. He, Q. Xu, L. Lyu, F. Wu, C. Wang

Published: 2022

Advances in Neural Information Processing Systems

Cater: Intellectual property protection on text generation apis via conditional watermarks

X. He, Q. Xu, Y. Zeng, L. Lyu, F. Wu, J. Li, R. Jia

Published: 2022

Proceedings of the 40th International Conference on Machine Learning

Protecting language generation models via invisible watermarking

X. Zhao, Y.-X. Wang, L. Li

Published: 2023

How close is chatgpt to human experts? comparison corpus, evaluation, and detection

B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, Y. Wu

Published: 2023

Instruction in the wild: A user-based instruction dataset

J. Ni, F. Xue, Y. Deng, J. Phang, K. Jain, M. H. Shah, Z. Zheng, Y. You

Published: 2023

OpenAI blog

Language models are unsupervised multitask learners

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever

Published: 2019