These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The clear, social, and dark web have lately been identified as rich sources
of valuable cyber-security information that -given the appropriate tools and
methods-may be identified, crawled and subsequently leveraged to actionable
cyber-threat intelligence. In this work, we focus on the information gathering
task, and present a novel crawling architecture for transparently harvesting
data from security websites in the clear web, security forums in the social
web, and hacker forums/marketplaces in the dark web. The proposed architecture
adopts a two-phase approach to data harvesting. Initially a machine
learning-based crawler is used to direct the harvesting towards websites of
interest, while in the second phase state-of-the-art statistical language
modelling techniques are used to represent the harvested information in a
latent low-dimensional feature space and rank it based on its potential
relevance to the task at hand. The proposed architecture is realised using
exclusively open-source tools, and a preliminary evaluation with crowdsourced
results demonstrates its effectiveness.