Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories

TOP Literature Database Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2503.03586

PDF

https://arxiv.org/pdf/2503.03586

Paper Information

Author: Alperen Yildiz,Sin G. Teo,Yiling Lou,Yebo Feng,Chong Wang,Dinil M. Divakaran
Published: 3-6-2025
Updated: 3-18-2025
Affiliation: National University of Singapore
Country: Singapore
Conference: Annual Meeting of the Association for Computational Linguistics (ACL)

Labels Estimated by AI

Vulnerability detection Indirect Prompt Injection Deep Learning

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs) have shown promise in software vulnerability detection, particularly on function-level benchmarks like Devign and BigVul. However, real-world detection requires interprocedural analysis, as vulnerabilities often emerge through multi-hop function calls rather than isolated functions. While repository-level benchmarks like ReposVul and VulEval introduce interprocedural context, they remain computationally expensive, lack pairwise evaluation of vulnerability fixes, and explore limited context retrieval, limiting their practicality. We introduce JitVul, a JIT vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits. Built from 879 CVEs spanning 91 vulnerability types, JitVul enables comprehensive evaluation of detection capabilities. Our results show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code. While prompting strategies like Chain-of-Thought help LLMs, ReAct Agents require further refinement. Both methods show inconsistencies, either misidentifying vulnerabilities or over-analyzing security guards, indicating significant room for improvement.

External Datasets

JITVUL

PrimeVul

MegaVul

DiverseVul

ReposVul

VulEval