Advancing our understanding of software vulnerabilities, automating their
identification, the analysis of their impact, and ultimately their mitigation
is necessary to enable the development of software that is more secure. While
operating a vulnerability assessment tool that we developed and that is
currently used by hundreds of development units at SAP, we manually collected
and curated a dataset of vulnerabilities of open-source software and the
commits fixing them. The data was obtained both from the National Vulnerability
Database (NVD) and from project-specific Web resources that we monitor on a
continuous basis. From that data, we extracted a dataset that maps 624 publicly
disclosed vulnerabilities affecting 205 distinct open-source Java projects,
used in SAP products or internal tools, onto the 1282 commits that fix them.
Out of 624 vulnerabilities, 29 do not have a CVE identifier at all and 46,
which do have a CVE identifier assigned by a numbering authority, are not
available in the NVD yet. The dataset is released under an open-source license,
together with supporting scripts that allow researchers to automatically
retrieve the actual content of the commits from the corresponding repositories
and to augment the attributes available for each instance. Also, these scripts
allow to complement the dataset with additional instances that are not security
fixes (which is useful, for example, in machine learning applications). Our
dataset has been successfully used to train classifiers that could
automatically identify security-relevant commits in code repositories. The
release of this dataset and the supporting code as open-source will allow
future research to be based on data of industrial relevance; also, it
represents a concrete step towards making the maintenance of this dataset a
shared effort involving open-source communities, academia, and the industry.