This paper focuses on reporting of Internet malicious activity (or
mal-activity in short) by public blacklists with the objective of providing a
systematic characterization of what has been reported over the years, and more
importantly, the evolution of reported activities. Using an initial seed of 22
blacklists, covering the period from January 2007 to June 2017, we collect more
than 51 million mal-activity reports involving 662K unique IP addresses
worldwide. Leveraging the Wayback Machine, antivirus (AV) tool reports and
several additional public datasets (e.g., BGP Route Views and Internet
registries) we enrich the data with historical meta-information including
geo-locations (countries), autonomous system (AS) numbers and types of
mal-activity. Furthermore, we use the initially labelled dataset of approx 1.57
million mal-activities (obtained from public blacklists) to train a machine
learning classifier to classify the remaining unlabeled dataset of approx 44
million mal-activities obtained through additional sources. We make our unique
collected dataset (and scripts used) publicly available for further research.
The main contributions of the paper are a novel means of report collection,
with a machine learning approach to classify reported activities,
characterization of the dataset and, most importantly, temporal analysis of
mal-activity reporting behavior. Inspired by P2P behavior modeling, our
analysis shows that some classes of mal-activities (e.g., phishing) and a small
number of mal-activity sources are persistent, suggesting that either
blacklist-based prevention systems are ineffective or have unreasonably long
update periods. Our analysis also indicates that resources can be better
utilized by focusing on heavy mal-activity contributors, which constitute the
bulk of mal-activities.