Data minimization (DM) describes the principle of collecting only the data
strictly necessary for a given task. It is a foundational principle across
major data protection regulations like GDPR and CPRA. Violations of this
principle have substantial real-world consequences, with regulatory actions
resulting in fines reaching hundreds of millions of dollars. Notably, the
relevance of data minimization is particularly pronounced in machine learning
(ML) applications, which typically rely on large datasets, resulting in an
emerging research area known as Data Minimization in Machine Learning (DMML).
At the same time, existing work on other ML privacy and security topics often
addresses concerns relevant to DMML without explicitly acknowledging the
connection. This disconnect leads to confusion among practitioners,
complicating their efforts to implement DM principles and interpret the
terminology, metrics, and evaluation criteria used across different research
communities. To address this gap, our work introduces a comprehensive framework
for DMML, including a unified data pipeline, adversaries, and points of
minimization. This framework allows us to systematically review the literature
on data minimization and \emph{DM-adjacent} methodologies, for the first time
presenting a structured overview designed to help practitioners and researchers
effectively apply DM principles. Our work facilitates a unified DM-centric
understanding and broader adoption of data minimization strategies in AI/ML.