The success of a fuzzing campaign is heavily depending on the quality of seed
inputs used for test generation. It is however challenging to compose a corpus
of seed inputs that enable high code and behavior coverage of the target
program, especially when the target program requires complex input formats such
as PDF files. We present a machine learning based framework to improve the
quality of seed inputs for fuzzing programs that take PDF files as input. Given
an initial set of seed PDF files, our framework utilizes a set of neural
networks to 1) discover the correlation between these PDF files and the
execution in the target program, and 2) leverage such correlation to generate
new seed files that more likely explore new paths in the target program. Our
experiments on a set of widely used PDF viewers demonstrate that the improved
seed inputs produced by our framework could significantly increase the code
coverage of the target program and the likelihood of detecting program crashes.