Child Sexual Abuse Media (CSAM) is any visual record of a sexually-explicit
activity involving minors. CSAM impacts victims differently from the actual
abuse because the distribution never ends, and images are permanent. Machine
learning-based solutions can help law enforcement quickly identify CSAM and
block digital distribution. However, collecting CSAM imagery to train machine
learning models has many ethical and legal constraints, creating a barrier to
research development. With such restrictions in place, the development of CSAM
machine learning detection systems based on file metadata uncovers several
opportunities. Metadata is not a record of a crime, and it does not have legal
restrictions. Therefore, investing in detection systems based on metadata can
increase the rate of discovery of CSAM and help thousands of victims. We
propose a framework for training and evaluating deployment-ready machine
learning models for CSAM identification. Our framework provides guidelines to
evaluate CSAM detection models against intelligent adversaries and models'
performance with open data. We apply the proposed framework to the problem of
CSAM detection based on file paths. In our experiments, the best-performing
model is based on convolutional neural networks and achieves an accuracy of
0.97. Our evaluation shows that the CNN model is robust against offenders
actively trying to evade detection by evaluating the model against
adversarially modified data. Experiments with open datasets confirm that the
model generalizes well and is deployment-ready.