Toxicity Detection for Free

TOP Literature Database Toxicity Detection for Free

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2405.18822

PDF

https://arxiv.org/pdf/2405.18822

Paper Information

Author: Zhanhao Hu,Julien Piet,Geng Zhao,Jiantao Jiao,David Wagner
Published: 5-29-2024
Updated: 11-8-2024
Affiliation: University of California, Berkeley
Country: United States of America
Conference: Conference on Neural Information Processing Systems (NeurIPS)

Labels Estimated by AI

Malicious Prompt Prompt validation Indirect Prompt Injection

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Current LLMs are generally aligned to follow safety requirements and tend to refuse toxic prompts. However, LLMs can fail to refuse toxic prompts or be overcautious and refuse benign examples. In addition, state-of-the-art toxicity detectors have low TPRs at low FPR, incurring high costs in real-world applications where toxic examples are rare. In this paper, we introduce Moderation Using LLM Introspection (MULI), which detects toxic prompts using the information extracted directly from LLMs themselves. We found we can distinguish between benign and toxic prompts from the distribution of the first response token's logits. Using this idea, we build a robust detector of toxic prompts using a sparse logistic regression model on the first response token logits. Our scheme outperforms SOTA detectors under multiple metrics.

External Datasets

ToxicChat

LMSYS-Chat-1M

OpenAI Moderation API Evaluation