AIセキュリティポータル K Program
Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection
Share
Abstract
According to our survey of machine learning for vulnerability detection (ML4VD), 9 in every 10 papers published in the past five years define ML4VD as a function-level binary classification problem: Given a function, does it contain a security flaw? From our experience as security researchers, faced with deciding whether a given function makes the program vulnerable to attacks, we would often first want to understand the context in which this function is called. In this paper, we study how often this decision can really be made without further context and study both vulnerable and non-vulnerable functions in the most popular ML4VD datasets. We call a function "vulnerable" if it was involved in a patch of an actual security flaw and confirmed to cause the program's vulnerability. It is "non-vulnerable" otherwise. We find that in almost all cases this decision cannot be made without further context. Vulnerable functions are often vulnerable only because a corresponding vulnerability-inducing calling context exists while non-vulnerable functions would often be vulnerable if a corresponding context existed. But why do ML4VD techniques achieve high scores even though there is demonstrably not enough information in these samples? Spurious correlations: We find that high scores can be achieved even when only word counts are available. This shows that these datasets can be exploited to achieve high scores without actually detecting any security vulnerabilities. We conclude that the prevailing problem statement of ML4VD is ill-defined and call into question the internal validity of this growing body of work. Constructively, we call for more effective benchmarking methodologies to evaluate the true capabilities of ML4VD, propose alternative problem statements, and examine broader implications for the evaluation of machine learning and programming analysis research.
DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection
Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, David Wagner
Published: 2023.4.2
Principles of Program Analysis
F. Nielson, H. R. Nielson, C. Hankin
Published: 2010
An axiomatic basis for computer programming
C. A. R. Hoare
Published: 1969
An empirical study on the effectiveness of static C code analyzers for vulnerability detection
Stephan Lipp, Sebastian Banescu, Alexander Pretschner
Published: 2022
Why don’t software developers use static analysis tools to find bugs?
B. Johnson, Y. Song, E. Murphy-Hill, R. Bowdidge
Published: 2013
Fuzzing: Challenges and opportunities
M. Bohme, C. Cadar, A. Roychoudhury
Published: 2021
Mendelfuzz: The return of the deterministic stage
H. Zheng, F. Toffalini, M. Bohme, M. Payer
Published: 2025
Large language model guided protocol fuzzing
R. Meng, M. Mirchev, M. Bohme, A. Roychoudhury
Published: 2024
Invivo fuzzing by amplifying actual executions
O. Galland, M. Bohme
Published: 2025
Retesting software during development and maintenance
M. J. Harrold, A. Orso
Published: 2008
Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks
S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, G. Stringhini
Published: 2024
Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection
Niklas Risse, Marcel Böhme
Published: 2023.6.28
Data quality for software vulnerability datasets
Roland Croft, M Ali Babar, M Mehdi Kholoosi
Published: 2023
Dos and Don'ts of Machine Learning in Computer Security
Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, Konrad Rieck
Published: 2020.10.19
Towards causal deep learning for vulnerability detection
Md Mahbubur Rahman, Ira Ceka, Chengzhi Mao, Saikat Chakraborty, Baishakhi Ray, Wei Le
Published: 2024
Guidelines for performing systematic literature reviews in software engineering
B. Kitchenham, S. Charters
Published: 2007
Guidelines for conducting systematic mapping studies in software engineering: An update
Kai Petersen, Sairam Vakkalanka, Ludwik Kuzniarz
Published: 2015
AC/C++ code vulnerability dataset with code changes and CVE summaries
Jiahao Fan, Yi Li, Shaohua Wang, Tien N Nguyen
Published: 2020
Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, Yang Liu
Published: 2019.9.9
A coefficient of agreement for nominal scales
Cohen, J.
Published: 1960
Learning from imbalanced data
H. He, E. A. Garcia
Published: 2009
Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection
Benjamin Steenhoek, Hongyang Gao, Wei Le
Published: 2022.12.16
Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation
Y. Wang, W. Wang, S. Joty, S. C. Hoi
Published: 2021
On feature learning in the presence of spurious correlations
P. Izmailov, P. Kirichenko, N. Gruver, A. G. Wilson
Published: 2024
Vu1spg: Vulnerability detection based on slice property graph representation learning
W. Zheng, Y. Jiang, X. Su
Published: 2021
Share