As the amount of cyber data continues to grow, cyber network defenders are
faced with increasing amounts of data they must analyze to ensure the security
of their networks. In addition, new types of attacks are constantly being
created and executed globally. Current rules-based approaches are effective at
characterizing and flagging known attacks, but they typically fail when
presented with a new attack or new types of data. By comparison, unsupervised
machine learning offers distinct advantages by not requiring labeled data to
learn from large amounts of network traffic. In this paper, we present a
natural language-based technique (suffix trees) as applied to cyber anomaly
detection. We illustrate one methodology to generate a language using cyber
data features, and our experimental results illustrate positive preliminary
results in applying this technique to flow-type data. As an underlying
assumption to this work, we make the claim that malicious cyber actors leave
observables in the data as they execute their attacks. This work seeks to
identify those artifacts and exploit them to identify a wide range of cyber
attacks without the need for labeled ground-truth data.