Intrusion detection systems (IDS) are used to monitor networks or systems for
attack activity or policy violations. Such a system should be able to
successfully identify anomalous deviations from normal traffic behavior. Here
we discuss the machine learning approach to building an anomaly-based IDS using
the CSE-CIC-IDS2018 dataset. Since the publication of this dataset a relatively
large number of papers have been published, most of them presenting IDS
architectures and results based on complex machine learning methods, like deep
neural networks, gradient boosting classifiers, or hidden Markov models. Here
we show that similar results can be obtained using a very simple nearest
neighbor classification approach, avoiding the inherent complications of
training such complex models. The advantages of the nearest neighbor algorithm
are: (1) it is very simple to implement; (2) it is extremely robust; (3) it has
no parameters, and therefore it cannot overfit the data. This result also shows
that currently there is a trend of developing over-engineered solutions in the
machine learning community. Such solutions are based on complex methods, like
deep learning neural networks, without even considering baseline solutions
corresponding to simple, but efficient methods.