The ability to accurately predict cyber-attacks would enable organizations to
mitigate their growing threat and avert the financial losses and disruptions
they cause. But how predictable are cyber-attacks? Researchers have attempted
to combine external data -- ranging from vulnerability disclosures to
discussions on Twitter and the darkweb -- with machine learning algorithms to
learn indicators of impending cyber-attacks. However, successful cyber-attacks
represent a tiny fraction of all attempted attacks: the vast majority are
stopped, or filtered by the security appliances deployed at the target. As we
show in this paper, the process of filtering reduces the predictability of
cyber-attacks. The small number of attacks that do penetrate the target's
defenses follow a different generative process compared to the whole data which
is much harder to learn for predictive models. This could be caused by the fact
that the resulting time series also depends on the filtering process in
addition to all the different factors that the original time series depended
on. We empirically quantify the loss of predictability due to filtering using
real-world data from two organizations. Our work identifies the limits to
forecasting cyber-attacks from highly filtered data.