Complex heterogeneous dynamic networks like knowledge graphs are powerful
constructs that can be used in modeling data provenance from computer systems.
From a security perspective, these attributed graphs enable causality analysis
and tracing for analyzing a myriad of cyberattacks. However, there is a paucity
in systematic development of pipelines that transform system executions and
provenance into usable graph representations for machine learning tasks. This
lack of instrumentation severely inhibits scientific advancement in provenance
graph machine learning by hindering reproducibility and limiting the
availability of data that are critical for techniques like graph neural
networks. To fulfill this need, we present Flurry, an end-to-end data pipeline
which simulates cyberattacks, captures provenance data from these attacks at
multiple system and application layers, converts audit logs from these attacks
into data provenance graphs, and incorporates this data with a framework for
training deep neural models that supports preconfigured or custom-designed
models for analysis in real-world resilient systems. We showcase this pipeline
by processing data from multiple system attacks and performing anomaly
detection via graph classification using current benchmark graph
representational learning frameworks. Flurry provides a fast, customizable,
extensible, and transparent solution for providing this much needed data to
cybersecurity professionals.