The clear, social, and dark web have lately been identified as rich sources
of valuable cyber-security information that -given the appropriate tools and
methods-may be identified, crawled and subsequently leveraged to actionable
cyber-threat intelligence. In this work, we focus on the information gathering
task, and present a novel crawling architecture for transparently harvesting
data from security websites in the clear web, security forums in the social
web, and hacker forums/marketplaces in the dark web. The proposed architecture
adopts a two-phase approach to data harvesting. Initially a machine
learning-based crawler is used to direct the harvesting towards websites of
interest, while in the second phase state-of-the-art statistical language
modelling techniques are used to represent the harvested information in a
latent low-dimensional feature space and rank it based on its potential
relevance to the task at hand. The proposed architecture is realised using
exclusively open-source tools, and a preliminary evaluation with crowdsourced
results demonstrates its effectiveness.