These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) pose significant risks due to the potential for
generating harmful content or users attempting to evade guardrails. Existing
studies have developed LLM-based guard models designed to moderate the input
and output of threat LLMs, ensuring adherence to safety policies by blocking
content that violates these protocols upon deployment. However, limited
attention has been given to the reliability and calibration of such guard
models. In this work, we empirically conduct comprehensive investigations of
confidence calibration for 9 existing LLM-based guard models on 12 benchmarks
in both user input and model output classification. Our findings reveal that
current LLM-based guard models tend to 1) produce overconfident predictions, 2)
exhibit significant miscalibration when subjected to jailbreak attacks, and 3)
demonstrate limited robustness to the outputs generated by different types of
response models. Additionally, we assess the effectiveness of post-hoc
calibration methods to mitigate miscalibration. We demonstrate the efficacy of
temperature scaling and, for the first time, highlight the benefits of
contextual calibration for confidence calibration of guard models, particularly
in the absence of validation sets. Our analysis and experiments underscore the
limitations of current LLM-based guard models and provide valuable insights for
the future development of well-calibrated guard models toward more reliable
content moderation. We also advocate for incorporating reliability evaluation
of confidence calibration when releasing future LLM-based guard models.