Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

ICLR

Leveraging unlabeled data to predict out-of-distribution performance

Saurabh Garg, Sivaraman Balakrishnan, Zachary C Lipton, Behnam Neyshabur, Hanie Sedghi

Published: 2022

Advances in Neural Information Processing Systems

Adaptive conformal inference under distribution shift

Gibbs, I., Candes, E.

Published: 2021

The Journal of Machine Learning Research

A kernel two-sample test

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., Smola, A.

Published: 2012

ICLR

Tracking the risk of a deployed model and detecting harmful distribution shifts

Aleksandr Podkopaev, Aaditya Ramdas

Published: 2022

International Conference on Machine Learning (ICML)

WATCH: Adaptive monitoring for AI deployments via weighted-conformal martingales

Drew Prinster, Xing Han, Anqi Liu, Suchi Saria

Published: 2025

Neurocomputing

Reactive soft prototype computing for concept drift streams

Christoph Raab, Moritz Heusinger, Frank-Michael Schleif

Published: 2020

NeurIPS

Failing loudly: An empirical study of methods for detecting dataset shift

Stephan Rabanser, Stephan Gunnemann, Zachary C Lipton

Published: 2019

NeurIPS

Telescoping density-ratio estimation

Benjamin Rhodes, Kai Xu, Michael U Gutmann

Published: 2020

Advances in Neural Information Processing Systems

Classification with valid and adaptive coverage

Romano, Y., Sesia, M., Candes, E.

Published: 2020

Brittlebench: Quantifying LLM robustness via prompt sensitivity

Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Oktar, Samuel J Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, Adina Williams

Published: 2026

I can’t believe it’s not robust: Catastrophic collapse of safety classifiers under embedding drift

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha

Published: 2026

AISTATS

Low-dimensional density ratio estimation for covariate shift correction

Petar Stojanov, Mingming Gong, Jaime Carbonell, Kun Zhang

Published: 2019

Neural Networks

Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search

Masashi Sugiyama, Makoto Yamada, Paul von Bunau, Taiji Suzuki, Takafumi Kanamori, Motoaki Kawanabe

Published: 2011

Advances in Neural Information Processing Systems, Curran Associates, Inc.

Conformal prediction under covariate shift

R. J. Tibshirani, R. Foygel Barber, E. Candes, A. Ramdas

Published: 2019

A collaborative content moderation framework for toxicity detection based on conformalized estimates of annotation disagreement

Guillermo Villate-Castillo, Javier Del Ser, Borja Sanz

Published: 2024

Springer

Algorithmic Learning in a Random World

V. Vovk, A. Gammerman, G. Shafer

Published: 2005

Annals of Mathematical Statistics

Sequential tests of statistical hypotheses

Abraham Wald

Published: 1945

Journal of the Royal Statistical Society Series B

Estimating means of bounded random variables by betting

Ian Waudby-Smith, Aaditya Ramdas

Published: 2024

NeurIPS

B-tests: Low variance kernel two-sample tests

Wojciech Zaremba, Arthur Gretton, Matthew Blaschko

Published: 2013

arxiv

被引用数 5

Computing Research Repository (CoRR)

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson

Published: 2023.7.28

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

LLMセキュリティプロンプトインジェクション不適切コンテンツ生成