CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

TOP Literature Database CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2411.16239

PDF

https://arxiv.org/pdf/2411.16239

Paper Information

Author: Zhengmin Yu;Jiutian Zeng;Siyi Chen;Wenhan Xu;Dandan Xu;Xiangyu Liu;Zonghao Ying;Nan Wang;Yuan Zhang;Min Yang
Published: 11-25-2024
Updated: 1-17-2025
Affiliation: Fudan University
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

LLM Performance Evaluation Cybersecurity

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Over the past year, there has been a notable rise in the use of large language models (LLMs) for academic research and industrial practices within the cybersecurity field. However, it remains a lack of comprehensive and publicly accessible benchmarks to evaluate the performance of LLMs on cybersecurity tasks. To address this gap, we introduce CS-Eval, a publicly accessible, comprehensive and bilingual LLM benchmark specifically designed for cybersecurity. CS-Eval synthesizes the research hotspots from academia and practical applications from industry, curating a diverse set of high-quality questions across 42 categories within cybersecurity, systematically organized into three cognitive levels: knowledge, ability, and application. Through an extensive evaluation of a wide range of LLMs using CS-Eval, we have uncovered valuable insights. For instance, while GPT-4 generally excels overall, other models may outperform it in certain specific subcategories. Additionally, by conducting evaluations over several months, we observed significant improvements in many LLMs' abilities to solve cybersecurity tasks. The benchmarks are now publicly available at https://github.com/CS-EVAL/CS-Eval.

External Datasets

CS-Eval

SySeVR

FLVD

Vulbench

GNU Coreutils

BinaryCorp

ExtractFix

GPT2-CSRC

Loghub

ADFA SID

Detect4J

Bugs.jar