With the increasing prevalence of encrypted network traffic, cyber security
analysts have been turning to machine learning (ML) techniques to elucidate the
traffic on their networks. However, ML models can become stale as new traffic
emerges that is outside of the distribution of the training set. In order to
reliably adapt in this dynamic environment, ML models must additionally provide
contextualized uncertainty quantification to their predictions, which has
received little attention in the cyber security domain. Uncertainty
quantification is necessary both to signal when the model is uncertain about
which class to choose in its label assignment and when the traffic is not
likely to belong to any pre-trained classes.
We present a new, public dataset of network traffic that includes labeled,
Virtual Private Network (VPN)-encrypted network traffic generated by 10
applications and corresponding to 5 application categories. We also present an
ML framework that is designed to rapidly train with modest data requirements
and provide both calibrated, predictive probabilities as well as an
interpretable "out-of-distribution" (OOD) score to flag novel traffic samples.
We describe calibrating OOD scores using p-values of the relative Mahalanobis
distance.
We demonstrate that our framework achieves an F1 score of 0.98 on our dataset
and that it can extend to an enterprise network by testing the model: (1) on
data from similar applications, (2) on dissimilar application traffic from an
existing category, and (3) on application traffic from a new category. The
model correctly flags uncertain traffic and, upon retraining, accurately
incorporates the new data.