Understanding the Process of Data Labeling in Cybersecurity

TOP 文献データベース Understanding the Process of Data Labeling in Cybersecurity

ACM Symposium on Applied Computing (SAC)

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2311.16388

PDF

https://arxiv.org/pdf/2311.16388

文献情報

作者: Tobias Braun;Irdin Pekaric;Giovanni Apruzzese
公開日: 2025-3-18
所属機関: University of Liechtenstein
所属の国: Liechtenstein
会議名: ACM Symposium on Applied Computing (SAC)

AIにより推定されたラベル

データラベリングの課題サイバーセキュリティ専門家の意見収集

Abstract

Many domains now leverage the benefits of Machine Learning (ML), which promises solutions that can autonomously learn to solve complex tasks by training over some data. Unfortunately, in cyberthreat detection, high-quality data is hard to come by. Moreover, for some specific applications of ML, such data must be labeled by human operators. Many works "assume" that labeling is tough/challenging/costly in cyberthreat detection, thereby proposing solutions to address such a hurdle. Yet, we found no work that specifically addresses the process of labeling 'from the viewpoint of ML security practitioners'. This is a problem: to this date, it is still mostly unknown how labeling is done in practice -- thereby preventing one from pinpointing "what is needed" in the real world. In this paper, we take the first step to build a bridge between academic research and security practice in the context of data labeling. First, we reach out to five subject matter experts and carry out open interviews to identify pain points in their labeling routines. Then, by using our findings as a scaffold, we conduct a user study with 13 practitioners from large security companies, and ask detailed questions on subjects such as active learning, costs of labeling, and revision of labels. Finally, we perform proof-of-concept experiments addressing labeling-related aspects in cyberthreat detection that are sometimes overlooked in research. Altogether, our contributions and recommendations serve as a stepping stone to future endeavors aimed at improving the quality and robustness of ML-driven security systems. We release our resources.