These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
In the rapidly evolving landscape of Large Language Models (LLMs), ensuring
robust safety measures is paramount. To meet this crucial need, we propose
\emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating
LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench
transcends conventional benchmarks through its large scale, rich diversity,
intricate taxonomy spanning three levels, and versatile
functionalities.SALAD-Bench is crafted with a meticulous array of questions,
from standard queries to complex ones enriched with attack, defense
modifications and multiple-choice. To effectively manage the inherent
complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for
QA pairs with a particular focus on attack-enhanced queries, ensuring a
seamless, and reliable evaluation. Above components extend SALAD-Bench from
standard LLM safety evaluation to both LLM attack and defense methods
evaluation, ensuring the joint-purpose utility. Our extensive experiments shed
light on the resilience of LLMs against emerging threats and the efficacy of
contemporary defense tactics. Data and evaluator are released under
https://github.com/OpenSafetyLab/SALAD-BENCH.