Many domains now leverage the benefits of Machine Learning (ML), which
promises solutions that can autonomously learn to solve complex tasks by
training over some data. Unfortunately, in cyberthreat detection, high-quality
data is hard to come by. Moreover, for some specific applications of ML, such
data must be labeled by human operators. Many works "assume" that labeling is
tough/challenging/costly in cyberthreat detection, thereby proposing solutions
to address such a hurdle. Yet, we found no work that specifically addresses the
process of labeling 'from the viewpoint of ML security practitioners'. This is
a problem: to this date, it is still mostly unknown how labeling is done in
practice -- thereby preventing one from pinpointing "what is needed" in the
real world.
In this paper, we take the first step to build a bridge between academic
research and security practice in the context of data labeling. First, we reach
out to five subject matter experts and carry out open interviews to identify
pain points in their labeling routines. Then, by using our findings as a
scaffold, we conduct a user study with 13 practitioners from large security
companies, and ask detailed questions on subjects such as active learning,
costs of labeling, and revision of labels. Finally, we perform proof-of-concept
experiments addressing labeling-related aspects in cyberthreat detection that
are sometimes overlooked in research. Altogether, our contributions and
recommendations serve as a stepping stone to future endeavors aimed at
improving the quality and robustness of ML-driven security systems. We release
our resources.