With malware detection techniques increasingly adopting machine learning
approaches, the creation of precise training sets becomes more and more
important. Large data sets of realistic web traffic, correctly classified as
benign or malicious are needed, not only to train classic and deep learning
algorithms, but also to serve as evaluation benchmarks for existing malware
detection products. Interestingly, despite the vast number and versatility of
threats a user may encounter when browsing the web, actual malicious content is
often hard to come by, since prerequisites such as browser and operating system
type and version must be met in order to receive the payload from a malware
distributing server. In combination with privacy constraints on data sets of
actual user traffic, it is difficult for researchers and product developers to
evaluate anti-malware solutions against large-scale data sets of realistic web
traffic. In this paper we present WebEye, a framework that autonomously creates
realistic HTTP traffic, enriches recorded traffic with additional information,
and classifies records as malicious or benign, using different classifiers. We
are using WebEye to collect malicious HTML and JavaScript and show how datasets
created with WebEye can be used to train machine learning based malware
detection algorithms. We regard WebEye and the data sets it creates as a tool
for researchers and product developers to evaluate and improve their AI-based
anti-malware solutions against large-scale benchmarks.