Machine learning (ML) started to become widely deployed in cyber security
settings for shortening the detection cycle of cyber attacks. To date, most
ML-based systems are either proprietary or make specific choices of feature
representations and machine learning models. The success of these techniques is
difficult to assess as public benchmark datasets are currently unavailable. In
this paper, we provide concrete guidelines and recommendations for using
supervised ML in cyber security. As a case study, we consider the problem of
botnet detection from network traffic data. Among our findings we highlight
that: (1) feature representations should take into consideration attack
characteristics; (2) ensemble models are well-suited to handle class imbalance;
(3) the granularity of ground truth plays an important role in the success of
these methods.