VirusTotal (VT) provides aggregated threat intelligence on various entities
including URLs, IP addresses, and binaries. It is widely used by researchers
and practitioners to collect ground truth and evaluate the maliciousness of
entities. In this work, we provide a comprehensive analysis of VT URL scanning
reports containing the results of 95 scanners for 1.577 Billion URLs over two
years. Individual VT scanners are known to be noisy in terms of their detection
and attack type classification. To obtain high quality ground truth of URLs and
actively take proper actions to mitigate different types of attacks, there are
two challenges: (1) how to decide whether a given URL is malicious given noisy
reports and (2) how to determine attack types (e.g., phishing or malware
hosting) that the URL is involved in, given conflicting attack labels from
different scanners. In this work, we provide a systematic comparative study on
the behavior of VT scanners for different attack types of URLs. A common
practice to decide the maliciousness is to use a cut-off threshold of scanners
that report the URL as malicious. However, in this work, we show that using a
fixed threshold is suboptimal, due to several reasons: (1) correlations between
scanners; (2) lead/lag behavior; (3) the specialty of scanners; (4) the quality
and reliability of scanners. A common practice to determine an attack type is
to use majority voting. However, we show that majority voting could not
accurately classify the attack type of a URL due to the bias from correlated
scanners. Instead, we propose a machine learning-based approach to assign an
attack type to URLs given the VT reports.