Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories

TOP 文献データベース Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2503.03586

PDF

https://arxiv.org/pdf/2503.03586

文献情報

作者: Alperen Yildiz,Sin G. Teo,Yiling Lou,Yebo Feng,Chong Wang,Dinil M. Divakaran
公開日: 2025-3-6
更新日: 2025-3-18
所属機関: National University of Singapore
所属の国: Singapore
会議名: Annual Meeting of the Association for Computational Linguistics (ACL)

AIにより推定されたラベル

脆弱性検出インダイレクトプロンプトインジェクション深層学習

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large Language Models (LLMs) have shown promise in software vulnerability detection, particularly on function-level benchmarks like Devign and BigVul. However, real-world detection requires interprocedural analysis, as vulnerabilities often emerge through multi-hop function calls rather than isolated functions. While repository-level benchmarks like ReposVul and VulEval introduce interprocedural context, they remain computationally expensive, lack pairwise evaluation of vulnerability fixes, and explore limited context retrieval, limiting their practicality. We introduce JitVul, a JIT vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits. Built from 879 CVEs spanning 91 vulnerability types, JitVul enables comprehensive evaluation of detection capabilities. Our results show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code. While prompting strategies like Chain-of-Thought help LLMs, ReAct Agents require further refinement. Both methods show inconsistencies, either misidentifying vulnerabilities or over-analyzing security guards, indicating significant room for improvement.

外部データセット

JITVUL

PrimeVul

MegaVul

DiverseVul

ReposVul

VulEval