VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities

TOP 文献データベース VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2509.03331

PDF

https://arxiv.org/pdf/2509.03331

文献情報

作者: Weizhe Wang,Wei Ma,Qiang Hu,Yao Zhang,Jianfei Sun,Bin Wu,Yang Liu,Guangquan Xu,Lingxiao Jiang
公開日: 2025-9-3
所属機関: Tianjin University
所属の国: China
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

プロンプトインジェクション脆弱性分析大規模言語モデル

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

The adoption of Large Language Models (LLMs) for automated software vulnerability patching has shown promising outcomes on carefully curated evaluation sets. Nevertheless, existing datasets predominantly rely on superficial validation methods rather than exploit-based verification, leading to overestimated performance in security-sensitive applications. This paper introduces VulnRepairEval, an evaluation framework anchored in functional Proof-of-Concept (PoC) exploits. Our framework delivers a comprehensive, containerized evaluation pipeline that enables reproducible differential assessment, where repair success requires the original exploit to fail execution against the modified code. The benchmark construction involved extensive data curation: we processed over 400 CVEs and approximately 2,500 potential sources to extract a collection of authentic vulnerability instances (23 Python CVEs) amenable to automated testing with working PoCs. Through VulnRepairEval, we conduct a comprehensive evaluation of 12 popular LLMs and observe a significant performance deficit: even the top-performing model successfully addresses merely 5/23 instances (about 21.7%), exposing critical weaknesses in security-focused applications. Our failure analysis reveals that most unsuccessful attempts stem from imprecise vulnerability identification and patches containing syntactic or semantic errors. Enhanced prompting strategies and multi-agent approaches yield minimal improvements, with overall effectiveness remaining largely unaffected. This work contributes a stringent, practical evaluation framework for LLM-driven vulnerability remediation and underscores the necessity for assessment protocols that authentically reflect real-world exploitation scenarios.

外部データセット

CVEfixes

ReposVul

MegaVul

MoreFixes

Vul4J

VUL4C

CVE-Bench

HumanEval

MBPP

BugsInPy

Defects4J

SWE-Bench