Many malware families utilize domain generation algorithms (DGAs) to
establish command and control (C&C) connections. While there are many methods
to pseudorandomly generate domains, we focus in this paper on detecting (and
generating) domains on a per-domain basis which provides a simple and flexible
means to detect known DGA families. Recent machine learning approaches to DGA
detection have been successful on fairly simplistic DGAs, many of which produce
names of fixed length. However, models trained on limited datasets are somewhat
blind to new DGA variants.
In this paper, we leverage the concept of generative adversarial networks to
construct a deep learning based DGA that is designed to intentionally bypass a
deep learning based detector. In a series of adversarial rounds, the generator
learns to generate domain names that are increasingly more difficult to detect.
In turn, a detector model updates its parameters to compensate for the
adversarially generated domains. We test the hypothesis of whether
adversarially generated domains may be used to augment training sets in order
to harden other machine learning models against yet-to-be-observed DGAs. We
detail solutions to several challenges in training this character-based
generative adversarial network (GAN). In particular, our deep learning
architecture begins as a domain name auto-encoder (encoder + decoder) trained
on domains in the Alexa one million. Then the encoder and decoder are
reassembled competitively in a generative adversarial network (detector +
generator), with novel neural architectures and training strategies to improve
convergence.