Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

TOP 文献データベース Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2409.00598

PDF

https://arxiv.org/pdf/2409.00598

文献情報

作者: Bang An,Sicheng Zhu,Ruiyi Zhang,Michael-Andrei Panaitescu-Liess,Yuancheng Xu,Furong Huang
公開日: 2024-9-1
更新日: 2025-6-11
所属機関: University of Maryland, College Park
所属の国: United States of America
会議名

AIにより推定されたラベル

LLM性能評価プロンプトインジェクションコンテンツモデレーション

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like "how to kill a mosquito," which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs. Our code and dataset are available at https://github.com/umd-huang-lab/FalseRefusal