Intrusion detection is an essential task in the cyber threat environment.
Machine learning and deep learning techniques have been applied for intrusion
detection. However, most of the existing research focuses on the model work but
ignores the fact that poor data quality has a direct impact on the performance
of a machine learning system. More attention should be paid to the data work
when building a machine learning-based intrusion detection system. This article
first summarizes existing machine learning-based intrusion detection systems
and the datasets used for building these systems. Then the data preparation
workflow and quality requirements for intrusion detection are discussed. To
figure out how data and models affect machine learning performance, we
conducted experiments on 11 HIDS datasets using seven machine learning models
and three deep learning models. The experimental results show that BERT and GPT
were the best algorithms for HIDS on all of the datasets. However, the
performance on different datasets varies, indicating the differences between
the data quality of these datasets. We then evaluate the data quality of the 11
datasets based on quality dimensions proposed in this paper to determine the
best characteristics that a HIDS dataset should possess in order to yield the
best possible result. This research initiates a data quality perspective for
researchers and practitioners to improve the performance of machine
learning-based intrusion detection.