These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) have shown promise in software vulnerability
detection, particularly on function-level benchmarks like Devign and BigVul.
However, real-world detection requires interprocedural analysis, as
vulnerabilities often emerge through multi-hop function calls rather than
isolated functions. While repository-level benchmarks like ReposVul and VulEval
introduce interprocedural context, they remain computationally expensive, lack
pairwise evaluation of vulnerability fixes, and explore limited context
retrieval, limiting their practicality.
We introduce JitVul, a JIT vulnerability detection benchmark linking each
function to its vulnerability-introducing and fixing commits. Built from 879
CVEs spanning 91 vulnerability types, JitVul enables comprehensive evaluation
of detection capabilities. Our results show that ReAct Agents, leveraging
thought-action-observation and interprocedural context, perform better than
LLMs in distinguishing vulnerable from benign code. While prompting strategies
like Chain-of-Thought help LLMs, ReAct Agents require further refinement. Both
methods show inconsistencies, either misidentifying vulnerabilities or
over-analyzing security guards, indicating significant room for improvement.