These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The vast majority of theoretical results in machine learning and statistics
assume that the available training data is a reasonably reliable reflection of
the phenomena to be learned or estimated. Similarly, the majority of machine
learning and statistical techniques used in practice are brittle to the
presence of large amounts of biased or malicious data. In this work we consider
two frameworks in which to study estimation, learning, and optimization in the
presence of significant fractions of arbitrary data.
The first framework, list-decodable learning, asks whether it is possible to
return a list of answers, with the guarantee that at least one of them is
accurate. For example, given a dataset of $n$ points for which an unknown
subset of $\alpha n$ points are drawn from a distribution of interest, and no
assumptions are made about the remaining $(1-\alpha)n$ points, is it possible
to return a list of $\operatorname{poly}(1/\alpha)$ answers, one of which is
correct? The second framework, which we term the semi-verified learning model,
considers the extent to which a small dataset of trusted data (drawn from the
distribution in question) can be leveraged to enable the accurate extraction of
information from a much larger but untrusted dataset (of which only an
$\alpha$-fraction is drawn from the distribution).
We show strong positive results in both settings, and provide an algorithm
for robust learning in a very general stochastic optimization setting. This
general result has immediate implications for robust estimation in a number of
settings, including for robustly estimating the mean of distributions with
bounded second moments, robustly learning mixtures of such distributions, and
robustly finding planted partitions in random graphs in which significant
portions of the graph have been perturbed by an adversary.