AIセキュリティポータル K Program
Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers
Share
Abstract
We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal abstention layer adapts decision thresholds to recover a target error rate epsilon=0.1. In a pre-registered factorial evaluation (4 classifiers x 5 shift conditions x 20 seeds x 2 window sizes, 800 cells), the system achieves 86.6% valid detection (693/800, 95% CI [84.1%, 88.8%]) with mean latency of 39.5 steps. Detection holds across three ground-truth regimes: synthetic onset (86.6%), real temporal jailbreaks (85%, 17/20), and GCG adversarial attacks. Weighted conformal prediction recovers up to 39 pp of lost coverage for DeBERTa (ESS=46/300) but collapses for all other classifiers (ESS~300): logistic density ratio estimation achieves perfect source/target separability in high-dimensional embedding spaces, clipping all importance weights to the floor. DeBERTa shows a gradient from effective correction (paraphrase, ESS=46) to near-total collapse (adversarial suffix, ESS=206). PCA to 32 dimensions breaks the collapse, recovering 33 pp for Llama Guard and 21 pp for ShieldGemma. Variance decomposition reveals classifier (eta^2=0.243), shift type (eta^2=0.237), and their interaction (eta^2=0.185) all contribute substantially to detection latency variance (all p<0.001), indicating per-classifier monitoring profiles are necessary.
Leveraging unlabeled data to predict out-of-distribution performance
Saurabh Garg, Sivaraman Balakrishnan, Zachary C Lipton, Behnam Neyshabur, Hanie Sedghi
Published: 2022
Adaptive conformal inference under distribution shift
Gibbs, I., Candes, E.
Published: 2021
A kernel two-sample test
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., Smola, A.
Published: 2012
Reactive soft prototype computing for concept drift streams
Christoph Raab, Moritz Heusinger, Frank-Michael Schleif
Published: 2020
Failing loudly: An empirical study of methods for detecting dataset shift
Stephan Rabanser, Stephan Gunnemann, Zachary C Lipton
Published: 2019
Telescoping density-ratio estimation
Benjamin Rhodes, Kai Xu, Michael U Gutmann
Published: 2020
Classification with valid and adaptive coverage
Romano, Y., Sesia, M., Candes, E.
Published: 2020
Low-dimensional density ratio estimation for covariate shift correction
Petar Stojanov, Mingming Gong, Jaime Carbonell, Kun Zhang
Published: 2019
Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search
Masashi Sugiyama, Makoto Yamada, Paul von Bunau, Taiji Suzuki, Takafumi Kanamori, Motoaki Kawanabe
Published: 2011
Conformal prediction under covariate shift
R. J. Tibshirani, R. Foygel Barber, E. Candes, A. Ramdas
Published: 2019
Algorithmic Learning in a Random World
V. Vovk, A. Gammerman, G. Shafer
Published: 2005
Sequential tests of statistical hypotheses
Abraham Wald
Published: 1945
Estimating means of bounded random variables by betting
Ian Waudby-Smith, Aaditya Ramdas
Published: 2024
B-tests: Low variance kernel two-sample tests
Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Published: 2013
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson
Published: 2023.7.28
Share