Large language models (LLMs) excel in many tasks of software engineering, yet
progress in leveraging them for vulnerability discovery has stalled in recent
years. To understand this phenomenon, we investigate LLMs through the lens of
classic code metrics. Surprisingly, we find that a classifier trained solely on
these metrics performs on par with state-of-the-art LLMs for vulnerability
discovery. A root-cause analysis reveals a strong correlation and a causal
effect between LLMs and code metrics: When the value of a metric is changed,
LLM predictions tend to shift by a corresponding magnitude. This dependency
suggests that LLMs operate at a similarly shallow level as code metrics,
limiting their ability to grasp complex patterns and fully realize their
potential in vulnerability discovery. Based on these findings, we derive
recommendations on how research should more effectively address this challenge.