These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
In this paper, we present a challenging code reasoning task: vulnerability
detection. Large Language Models (LLMs) have shown promising results in
natural-language and math reasoning, but state-of-the-art (SOTA) models
reported only 54.5% Balanced Accuracy in our vulnerability detection
evaluation, even those models pre-trained on large amounts of source code. Our
error analysis on LLM responses shows that the models struggle to reason about
the code semantics relevant to identifying vulnerabilities, especially subtle
semantic differences caused by small textual changes. We explored prominent
models and training settings to understand their effects on vulnerability
detection performance -- including better prompts, larger models, more
pre-training data, and fine-tuning -- but none led to significant improvements.
This raises the question of whether simply scaling training data and model size
will allow us to "solve" complex code reasoning tasks like vulnerability
detection, or if a fundamental shift in modeling and training techniques is
required. We also explored adding domain knowledge to prompts; although it
helped certain models understand some code semantics, vulnerability detection
requires multi-step reasoning, and these models still failed in steps, such as
reasoning about variable relations. Our results suggest that new models, new
training methods, or more execution-specific pretraining data may be needed to
conquer vulnerability detection. We speculate that auto-regressive pre-training
on source code may not effectively extract code semantics, especially on the
current pretraining mixtures, in which execution data is scarce. Success on
vulnerability detection as a code reasoning task can benefit many areas of
software engineering such as debugging, test input generation, and program
repair. Our code and data are available at
https://doi.org/10.6084/m9.figshare.27368025.