Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models | AIセキュリティポータル

EN

JA

EN

TOP 文献データベース Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models

arxiv

Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2402.18059

PDF

https://arxiv.org/pdf/2402.18059

文献情報

作者: Mingjia Huo;Sai Ashish Somayajula;Youwei Liang;Ruisi Zhang;Farinaz Koushanfar;Pengtao Xie
公開日: 2024-2-28
更新日: 2024-6-6
所属機関: Department of Electrical and Computer Engineering, University of California, San Diego
所属の国: United States of America
会議名: International Conference on Machine Learning (ICML)

AIにより推定されたラベル

ウォーターマーキングプロンプトインジェクションマルチオブジェクティブ最適化

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large language models generate high-quality responses with potential misinformation, underscoring the need for regulation by distinguishing AI-generated and human-written texts. Watermarking is pivotal in this context, which involves embedding hidden markers in texts during the LLM inference phase, which is imperceptible to humans. Achieving both the detectability of inserted watermarks and the semantic quality of generated texts is challenging. While current watermarking algorithms have made promising progress in this direction, there remains significant scope for improvement. To address these challenges, we introduce a novel multi-objective optimization (MOO) approach for watermarking that utilizes lightweight networks to generate token-specific watermarking logits and splitting ratios. By leveraging MOO to optimize for both detection and semantic objective functions, our method simultaneously achieves detectability and semantic integrity. Experimental results show that our method outperforms current watermarking techniques in enhancing the detectability of texts generated by LLMs while maintaining their semantic coherence. Our code is available at https://github.com/mignonjia/TS_watermark.

外部データセット

C4

Essays

HC3

参考文献

‘reform’ ai alignment with scott aaronson

Published: 2023

2021 IEEE Symposium on Security and Privacy (SP)

Adversarial watermarking transformer: Towards tracing text provenance with data hiding

Sahar Abdelnabi, Mario Fritz

Published: 2021

Generative ai and the future of elections

R. M. Alvarez, F. Eberhardt, M. Linegar

Published: 2023

Factuality challenges in the era of large language models

I. Augenstein, T. Baldwin, M. Cha, T. Chakraborty, G. L. Ciampaglia, D. Corney, R. DiResta, E. Ferrara, S. Hale, A. Halevy, et al.

Published: 2023

Computational Linguistics

An estimate of an upper bound for the entropy of english

P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, J. C. Lai, R. L. Mercer

Published: 1992

OpenAI Technical Report

Language models are few-shot learners

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei

Published: 2020

Composition-contrastive learning for sentence embeddings

S. J. Chanchani, R. Huang

Published: 2023

X-mark: Towards lossless watermarking through lexical redundancy

L. Chen, Y. Bian, Y. Deng, S. Li, B. Wu, P. Zhao, K. fai Wong

Published: 2023

International conference on machine learning

Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks

Z. Chen, V. Badrinarayanan, C.-Y. Lee, A. Rabinovich

Published: 2018

Proceedings of the 2021 5th International Conference on Cloud and Big Data Computing

Lyapunov central limit theorem: Theoretical properties and applications in big-data-populated smart city settings

A. Cuzzocrea, E. Fadda, A. Baldo

Published: 2021

Decision sciences

Multi-objective optimization

K. Deb, K. Sindhya, J. Hakanen

Published: 2016

Comptes Rendus Mathématique

Multiple-gradient descent algorithm (mgda) for multiobjective optimization

J.-A. Desid ´ eri

Published: 2012

2023 IEEE International Workshop on Information Forensics and Security (WIFS)

Three bricks to consolidate watermarks for large language models

P. Fernandez, A. Chaffin, K. Tit, V. Chappelier, T. Furon

Published: 2023

PeerJ Computer Science

Feature-based detection of automated language models: tackling gpt-2, gpt-3 and grover

L. Frohling, A. Zubiaga

Published: 2021

Proceedings of the 14th ACM Web Science Conference 2022

On pushing deepfake tweet detection capabilities to the limits

M. Gambini, T. Fagni, F. Falchi, M. Tesconi

Published: 2022

Empirical Methods in Natural Language Processing (EMNLP)

SimCSE: Simple contrastive learning of sentence embeddings

Tianyu Gao, Xingcheng Yao, Danqi Chen

Published: 2021

Gltr: Statistical detection and visualization of generated text

S. Gehrmann, H. Strobelt, A. M. Rush

Published: 2019

Evolutionary computation

Pareto front estimation for decision making

I. Giagkiozis, P. J. Fleming

Published: 2014

Analytical biochemistry

The five-parameter logistic: a characterization and comparison with the four-parameter logistic

P. G. Gottschalk, J. R. Dunn

Published: 2005

How close is chatgpt to human experts? comparison corpus, evaluation, and detection

B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, Y. Wu

Published: 2023

2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06)

Dimensionality reduction by learning an invariant mapping

R. Hadsell, S. Chopra, Y. LeCun

Published: 2006

Proceedings of the IEEE international conference on computer vision

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

K. He, X. Zhang, S. Ren, J. Sun

Published: 2015

Proceedings of the AAAI Conference on Artificial Intelligence

Protecting intellectual property of language generation apis with lexical watermark

X. He, Q. Xu, L. Lyu, F. Wu, C. Wang

Published: 2022

Advances in Neural Information Processing Systems

Cater: Intellectual property protection on text generation apis via conditional watermarks

X. He, Q. Xu, Y. Zeng, L. Lyu, F. Wu, J. Li, R. Jia

Published: 2022

Annual Meeting of the Association for Computational Linguistics (ACL)

Automatic Detection of Generated Text is Easiest when Humans are Fooled

Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, Douglas Eck

Published: 2019.11.2

Recent advancements in neural language modelling make it possible to rapidly generate vast amounts of human-sounding text. The capabilities of humans and automatic discriminators to detect machine-generated text have been a large source of research interest, but humans and machines rely on different cues to make their decisions. Here, we perform careful benchmarking and analysis of three popular sampling-based decoding strategies---top-$k$, nucleus sampling, and untruncated random sampling---and show that improvements in decoding methods have primarily optimized for fooling humans. This comes at the expense of introducing statistical abnormalities that make detection easy for automatic systems. We also show that though both human and automatic detector performance improve with longer excerpt length, even multi-sentence excerpts can fool expert human raters over 30% of the time. Our findings reveal the importance of using both human and automatic detectors to assess the humanness of text generation systems.

AIによる出力の識別深層学習手法テキストの摂動手法

Categorical reparameterization with gumbel-softmax

Eric Jang, Shixiang Gu, Ben Poole

Published: 2016

A method for stochastic optimization

Jimmy Ba, Diederik P. Kingma

Published: 2014

Proceedings of the 40th International Conference on Machine Learning

A watermark for large language models

J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, T. Goldstein

Published: 2023

On the reliability of watermarks for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, Tom Goldstein

Published: 2023

Conference on Neural Information Processing Systems (NeurIPS)

Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, Mohit Iyyer

Published: 2023.3.24

The rise in malicious usage of large language models, such as fake content creation and academic plagiarism, has motivated the development of approaches that identify AI-generated text, including those based on watermarking or outlier detection. However, the robustness of these detection algorithms to paraphrases of AI-generated text remains unclear. To stress test these detectors, we build a 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings while only classifying 1% of human-written sequences as AI-generated. We open-source our models, code and data.

プロンプトインジェクション DNN IP保護手法機械学習技術

Trans. Mach. Learn. Res.

Robust Distortion-free Watermarks for Language Models

Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, Percy Liang

Published: 2023.7.28

We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text ($p \leq 0.01$) from $35$ tokens even after corrupting between $40$-$50\%$ of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around $25\%$ of the responses -- whose median length is around $100$ tokens -- are detectable with $p \leq 0.01$, and the watermark is also less robust to certain automated paraphrasing attacks we implement.

生成AI向け電子透かし統計的仮説検定テキストの摂動手法

Who wrote this code? watermarking for code generation

T. Lee, S. Hong, J. Ahn, I. Hong, H. Lee, S. Yun, J. Shin, G. Kim

Published: 2023

Origin tracing and detecting of llms

L. Li, P. Wang, K. Ren, T. Sun, X. Qiu

Published: 2023

Gpt detectors are biased against non-native english writers

W. Liang, M. Yuksekgonul, Y. Mao, E. Wu, J. Zou

Published: 2023

International Conference on Learning Representations (ICLR)

A Semantic Invariant Robust Watermark for Large Language Models

Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, Lijie Wen

Published: 2023.10.10

Watermark algorithms for large language models (LLMs) have achieved extremely high accuracy in detecting text generated by LLMs. Such algorithms typically involve adding extra watermark logits to the LLM's logits at each generation step. However, prior algorithms face a trade-off between attack robustness and security robustness. This is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. In this work, we propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness. The watermark logits in our work are determined by the semantics of all preceding tokens. Specifically, we utilize another embedding LLM to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model. Subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. Finally, we also show that our watermark possesses adequate security robustness. Our code and data are available at \href{https://github.com/THU-BPM/Robust_Watermark}{https://github.com/THU-BPM/Robust\_Watermark}. Additionally, our algorithm could also be accessed through MarkLLM \citep{pan2024markllm} \footnote{https://github.com/THU-BPM/MarkLLM}.

プロンプトインジェクションウォーターマーキング性能評価

Roberta: A robustly optimized bert pretraining approach

Published: 2019

ICML Workshop on Deep Learning for Audio, Speech and Language Processing

Rectifier nonlinearities improve neural network acoustic models

Andrew L. Maas, Awni Y. Hannun, Andrew Y. Ng

Published: 2013

Multi-gradient descent for multi-objective recommender systems

N. Milojkovic, D. Antognini, G. Bergamin, B. Faltings, C. Musat

Published: 2019

DeepTextMark: A Deep Learning-Driven Text Watermarking Approach for Identifying Large Language Model Generated Text

Travis Munyer, Abdullah Tanvir, Arjon Das, Xin Zhong

Published: 2023

New ai classifier for indicating ai-written text

Published: 2023

Conference on Secure and Trustworthy Machine Learning (SaTML)

Mark My Words: Analyzing and Evaluating Language Model Watermarks

Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, David Wagner

Published: 2023.12.1

The capabilities of large language models have grown significantly in recent years and so too have concerns about their misuse. It is important to be able to distinguish machine-generated text from human-authored content. Prior works have proposed numerous schemes to watermark text, which would benefit from a systematic evaluation framework. This work focuses on LLM output watermarking techniques - as opposed to image or model watermarks - and proposes Mark My Words, a comprehensive benchmark for them under different natural language tasks. We focus on three main metrics: quality, size (i.e., the number of tokens needed to detect a watermark), and tamper resistance (i.e., the ability to detect a watermark after perturbing marked text). Current watermarking techniques are nearly practical enough for real-world use: Kirchenbauer et al. [33]'s scheme can watermark models like Llama 2 7B-chat or Mistral-7B-Instruct with no perceivable loss in quality on natural language tasks, the watermark can be detected with fewer than 100 tokens, and their scheme offers good tamper resistance to simple perturbations. However, they struggle to efficiently watermark code generations. We publicly release our benchmark (https://github.com/wagner-group/MarkMyWords).

プロンプトインジェクション透かしの耐久性透かし評価

Journal of machine learning research

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu

Published: 2020

A Robust Semantics-based Watermark for Large Language Model against Paraphrasing

Jie Ren, Han Xu, Yiding Liu, Yingqian Cui, Shuaiqiang Wang, Dawei Yin, Jiliang Tang

Published: 2023.11.15

Large language models (LLMs) have show great ability in various natural language tasks. However, there are concerns that LLMs are possible to be used improperly or even illegally. To prevent the malicious usage of LLMs, detecting LLM-generated text becomes crucial in the deployment of LLM applications. Watermarking is an effective strategy to detect the LLM-generated content by encoding a pre-defined secret watermark to facilitate the detection process. However, the majority of existing watermark methods leverage the simple hashes of precedent tokens to partition vocabulary. Such watermark can be easily eliminated by paraphrase and correspondingly the detection effectiveness will be greatly compromised. Thus, to enhance the robustness against paraphrase, we propose a semantics-based watermark framework SemaMark. It leverages the semantics as an alternative to simple hashes of tokens since the paraphrase will likely preserve the semantic meaning of the sentences. Comprehensive experiments are conducted to demonstrate the effectiveness and robustness of SemaMark under different paraphrases.

ロバスト性評価プロンプトインジェクション情報隠蔽手法

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Cross-domain detection of gpt-2-generated technical text

J. Rodriguez, T. Hay, D. Gros, Z. Shamsi, R. Srinivasan

Published: 2022

essays-with-instructions

Published: 2023

OpenAI blog

Chatgpt: Optimizing language models for dialogue

J. Schulman, B. Zoph, C. Kim, J. Hilton, J. Menick, J. Weng, J. F. C. Uribe, L. Fedus, L. Metz, M. Pokorny, et al.

Published: 2022

Advances in Neural Information Processing Systems 31

Multi-task learning as multi-objective optimization

O. Sener, V. Koltun

Published: 2018

Release Strategies and the Social Impacts of Language Models

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Jasmine Wang

Published: 2019

Chatgpt cheating scandal shocks florida high school

Published: 2023

Llama 2: Open foundation and fine-tuned chat models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel

Published: 2023

Advances in neural information processing systems

Attention is all you need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin

Published: 2017

Towards codable text watermarking for large language models

L. Wang, W. Yang, D. Chen, H. Zhou, Y. Lin, F. Meng, J. Zhou, X. Sun

Published: 2023

Huggingface’s transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz

Published: 2019

Attacking neural text detectors

M. Wolff, S. Wolff

Published: 2020

International Conference on Machine Learning (ICML)

Optimizing watermarks for large language models

Published: 2023.12.29

With the rise of large language models (LLMs) and concerns about potential misuse, watermarks for generative LLMs have recently attracted much attention. An important aspect of such watermarks is the trade-off between their identifiability and their impact on the quality of the generated text. This paper introduces a systematic approach to this trade-off in terms of a multi-objective optimization problem. For a large class of robust, efficient watermarks, the associated Pareto optimal solutions are identified and shown to outperform the currently default watermark.

透かし評価透かしの耐久性最適化手法

Findings of the Association for Computational Linguistics: EMNLP 2023

LLMDet: A third party large language models generated text detection tool

Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng, Tat-Seng Chua

Published: 2023

Large language models can be used to estimate the ideologies of politicians in a zero-shot learning setting

P. Y. Wu, J. A. Tucker, J. Nagler, S. Messing

Published: 2023

Watermarking text generated by black-box language models

Xi Yang, Kejiang Chen, Weiming Zhang, Chang Liu, Yuang Qi, Jie Zhang, Han Fang, Nenghai Yu

Published: 2023

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Robust multi-bit natural language watermarking through invariant features

K. Yoo, W. Ahn, J. Jang, N. Kwak

Published: 2023

Gpt paternity test: Gpt generated text detection with gpt genetic inheritance

X. Yu, Y. Qi, K. Chen, G. Chen, X. Yang, P. Zhu, W. Zhang, N. Yu

Published: 2023

USENIX Security Symposium

REMARK-LLM: A Robust and Efficient Watermarking Framework for Generative Large Language Models

Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, Farinaz Koushanfar

Published: 2023.10.19

We present REMARK-LLM, a novel efficient, and robust watermarking framework designed for texts generated by large language models (LLMs). Synthesizing human-like content using LLMs necessitates vast computational resources and extensive datasets, encapsulating critical intellectual property (IP). However, the generated content is prone to malicious exploitation, including spamming and plagiarism. To address the challenges, REMARK-LLM proposes three new components: (i) a learning-based message encoding module to infuse binary signatures into LLM-generated texts; (ii) a reparameterization module to transform the dense distributions from the message encoding to the sparse distribution of the watermarked textual tokens; (iii) a decoding module dedicated for signature extraction; Furthermore, we introduce an optimized beam search algorithm to guarantee the coherence and consistency of the generated content. REMARK-LLM is rigorously trained to encourage the preservation of semantic integrity in watermarked content, while ensuring effective watermark retrieval. Extensive evaluations on multiple unseen datasets highlight REMARK-LLM proficiency and transferability in inserting 2 times more signature bits into the same texts when compared to prior art, all while maintaining semantic integrity. Furthermore, REMARK-LLM exhibits better resilience against a spectrum of watermark detection and removal attacks.

モデル設計データ生成悪意のあるコンテンツ生成

Opt: Open pre-trained transformer language models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al.

Published: 2022

Neural deepfake detection with factual structure of text

W. Zhong, D. Tang, Z. Xu, R. Wang, N. Duan, M. Zhou, J. Wang, J. Yin

Published: 2020