Malicious PDF files represent one of the biggest threats to computer
security. To detect them, significant research has been done using handwritten
signatures or machine learning based on manual feature extraction. Those
approaches are both time-consuming, require significant prior knowledge and the
list of features has to be updated with each newly discovered vulnerability. In
this work, we propose a novel algorithm that uses an ensemble of Convolutional
Neural Network (CNN) on the byte level of the file, without any handcrafted
features. We show, using a data set of 90000 files downloadable online, that
our approach maintains a high detection rate (94%) of PDF malware and even
detects new malicious files, still undetected by most antiviruses. Using
automatically generated features from our CNN network, and applying a
clustering algorithm, we also obtain high similarity between the antiviruses'
labels and the resulting clusters.