These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
With growing credit card transaction volumes, the fraud percentages are also
rising, including overhead costs for institutions to combat and compensate
victims. The use of machine learning into the financial sector permits more
effective protection against fraud and other economic crime. Suitably trained
machine learning classifiers help proactive fraud detection, improving
stakeholder trust and robustness against illicit transactions. However, the
design of machine learning based fraud detection algorithms has been
challenging and slow due the massively unbalanced nature of fraud data and the
challenges of identifying the frauds accurately and completely to create a gold
standard ground truth. Furthermore, there are no benchmarks or standard
classifier evaluation metrics to measure and identify better performing
classifiers, thus keeping researchers in the dark.
In this work, we develop a theoretical foundation to model human annotation
errors and extreme imbalance typical in real world fraud detection data sets.
By conducting empirical experiments on a hypothetical classifier, with a
synthetic data distribution approximated to a popular real world credit card
fraud data set, we simulate human annotation errors and extreme imbalance to
observe the behavior of popular machine learning classifier evaluation
matrices. We demonstrate that a combined F1 score and g-mean, in that specific
order, is the best evaluation metric for typical imbalanced fraud detection
model classification.