These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Clustering is a fundamental tool in statistical machine learning in the
presence of heterogeneous data. Most recent results focus primarily on optimal
mislabeling guarantees when data are distributed around centroids with
sub-Gaussian errors. Yet, the restrictive sub-Gaussian model is often invalid
in practice since various real-world applications exhibit heavy tail
distributions around the centroids or suffer from possible adversarial attacks
that call for robust clustering with a robust data-driven initialization. In
this paper, we present initialization and subsequent clustering methods that
provably guarantee near-optimal mislabeling for general mixture models when the
number of clusters and data dimensions are finite. We first introduce a hybrid
clustering technique with a novel multivariate trimmed mean type centroid
estimate to produce mislabeling guarantees under a weak initialization
condition for general error distributions around the centroids. A matching
lower bound is derived, up to factors depending on the number of clusters. In
addition, our approach also produces similar mislabeling guarantees even in the
presence of adversarial outliers. Our results reduce to the sub-Gaussian case
in finite dimensions when errors follow sub-Gaussian distributions. To solve
the problem thoroughly, we also present novel data-driven robust initialization
techniques and show that, with probabilities approaching one, these initial
centroid estimates are sufficiently good for the subsequent clustering
algorithm to achieve the optimal mislabeling rates. Furthermore, we demonstrate
that the Lloyd algorithm is suboptimal for more than two clusters even when
errors are Gaussian and for two clusters when error distributions have heavy
tails. Both simulated data and real data examples further support our robust
initialization procedure and clustering algorithm.