These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Over the past year, there has been a notable rise in the use of large
language models (LLMs) for academic research and industrial practices within
the cybersecurity field. However, it remains a lack of comprehensive and
publicly accessible benchmarks to evaluate the performance of LLMs on
cybersecurity tasks. To address this gap, we introduce CS-Eval, a publicly
accessible, comprehensive and bilingual LLM benchmark specifically designed for
cybersecurity. CS-Eval synthesizes the research hotspots from academia and
practical applications from industry, curating a diverse set of high-quality
questions across 42 categories within cybersecurity, systematically organized
into three cognitive levels: knowledge, ability, and application. Through an
extensive evaluation of a wide range of LLMs using CS-Eval, we have uncovered
valuable insights. For instance, while GPT-4 generally excels overall, other
models may outperform it in certain specific subcategories. Additionally, by
conducting evaluations over several months, we observed significant
improvements in many LLMs' abilities to solve cybersecurity tasks. The
benchmarks are now publicly available at https://github.com/CS-EVAL/CS-Eval.