SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

TOP 文献データベース SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2402.05044

PDF

https://arxiv.org/pdf/2402.05044

文献情報

作者: Lijun Li;Bowen Dong;Ruohui Wang;Xuhao Hu;Wangmeng Zuo;Dahua Lin;Yu Qiao;Jing Shao
公開日: 2024-2-8
更新日: 2024-6-7
所属機関: Shanghai Artificial Intelligence Laboratory
所属の国: China
会議名: Annual Meeting of the Association for Computational Linguistics (ACL)

AIにより推定されたラベル

LLMセキュリティプロンプトインジェクション LLM性能評価

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and evaluator are released under https://github.com/OpenSafetyLab/SALAD-BENCH.

外部データセット

SALAD-Bench

ToxicChat

Beavertails

SafeRLHF