On Calibration of LLM-based Guard Models for Reliable Content Moderation

TOP Literature Database On Calibration of LLM-based Guard Models for Reliable Content Moderation

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2410.10414

PDF

https://arxiv.org/pdf/2410.10414

Paper Information

Author: Hongfu Liu;Hengguan Huang;Hao Wang;Xiangming Gu;Ye Wang
Published: 10-14-2024
Affiliation: National University of Singapore
Country: Singapore
Conference

Labels Estimated by AI

Prompt Injection LLM Performance Evaluation Content Moderation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large language models (LLMs) pose significant risks due to the potential for generating harmful content or users attempting to evade guardrails. Existing studies have developed LLM-based guard models designed to moderate the input and output of threat LLMs, ensuring adherence to safety policies by blocking content that violates these protocols upon deployment. However, limited attention has been given to the reliability and calibration of such guard models. In this work, we empirically conduct comprehensive investigations of confidence calibration for 9 existing LLM-based guard models on 12 benchmarks in both user input and model output classification. Our findings reveal that current LLM-based guard models tend to 1) produce overconfident predictions, 2) exhibit significant miscalibration when subjected to jailbreak attacks, and 3) demonstrate limited robustness to the outputs generated by different types of response models. Additionally, we assess the effectiveness of post-hoc calibration methods to mitigate miscalibration. We demonstrate the efficacy of temperature scaling and, for the first time, highlight the benefits of contextual calibration for confidence calibration of guard models, particularly in the absence of validation sets. Our analysis and experiments underscore the limitations of current LLM-based guard models and provide valuable insights for the future development of well-calibrated guard models toward more reliable content moderation. We also advocate for incorporating reliability evaluation of confidence calibration when releasing future LLM-based guard models.

External Datasets

OpenAI Moderation

ToxicChat Test

Aegis Safety Test

SimpleSafetyTests

XSTest

Harmbench Prompt

WildGuardMix Test Prompt

BeaverTails Test

SafeRLHF Test

Harmbench Response

WildGuardMix Test Response

Harmbench-adv