Malicious web content is a serious problem on the Internet today. In this
paper we propose a deep learning approach to detecting malevolent web pages.
While past work on web content detection has relied on syntactic parsing or on
emulation of HTML and Javascript to extract features, our approach operates
directly on a language-agnostic stream of tokens extracted directly from static
HTML files with a simple regular expression. This makes it fast enough to
operate in high-frequency data contexts like firewalls and web proxies, and
allows it to avoid the attack surface exposure of complex parsing and emulation
code. Unlike well-known approaches such as bag-of-words models, which ignore
spatial information, our neural network examines content at hierarchical
spatial scales, allowing our model to capture locality and yielding superior
accuracy compared to bag-of-words baselines. Our proposed architecture achieves
a 97.5% detection rate at a 0.1% false positive rate, and classifies
small-batched web pages at a rate of over 100 per second on commodity hardware.
The speed and accuracy of our approach makes it appropriate for deployment to
endpoints, firewalls, and web proxies.