These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Evaluating Large Language Models (LLMs) is crucial for understanding their
capabilities and limitations across various applications, including natural
language processing and code generation. Existing benchmarks like MMLU, C-Eval,
and HumanEval assess general LLM performance but lack focus on specific expert
domains such as cybersecurity. Previous attempts to create cybersecurity
datasets have faced limitations, including insufficient data volume and a
reliance on multiple-choice questions (MCQs). To address these gaps, we propose
SecBench, a multi-dimensional benchmarking dataset designed to evaluate LLMs in
the cybersecurity domain. SecBench includes questions in various formats (MCQs
and short-answer questions (SAQs)), at different capability levels (Knowledge
Retention and Logical Reasoning), in multiple languages (Chinese and English),
and across various sub-domains. The dataset was constructed by collecting
high-quality data from open sources and organizing a Cybersecurity Question
Design Contest, resulting in 44,823 MCQs and 3,087 SAQs. Particularly, we used
the powerful while cost-effective LLMs to (1). label the data and (2).
constructing a grading agent for automatic evaluation of SAQs. Benchmarking
results on 16 SOTA LLMs demonstrate the usability of SecBench, which is
arguably the largest and most comprehensive benchmark dataset for LLMs in
cybersecurity. More information about SecBench can be found at our website, and
the dataset can be accessed via the artifact link.