Many of the proposed machine learning (ML) based network intrusion detection
systems (NIDSs) achieve near perfect detection performance when evaluated on
synthetic benchmark datasets. Though, there is no record of if and how these
results generalise to other network scenarios, in particular to real-world
networks. In this paper, we investigate the generalisability property of
ML-based NIDSs by extensively evaluating seven supervised and unsupervised
learning models on four recently published benchmark NIDS datasets. Our
investigation indicates that none of the considered models is able to
generalise over all studied datasets. Interestingly, our results also indicate
that the generalisability has a high degree of asymmetry, i.e., swapping the
source and target domains can significantly change the classification
performance. Our investigation also indicates that overall, unsupervised
learning methods generalise better than supervised learning models in our
considered scenarios. Using SHAP values to explain these results indicates that
the lack of generalisability is mainly due to the presence of strong
correspondence between the values of one or more features and Attack/Benign
classes in one dataset-model combination and its absence in other datasets that
have different feature distributions.