In many, if not most, machine learning applications the training data is
naturally heterogeneous (e.g. federated learning, adversarial attacks and
domain adaptation in neural net training). Data heterogeneity is identified as
one of the major challenges in modern day large-scale learning. A classical way
to represent heterogeneous data is via a mixture model. In this paper, we study
generalization performance and statistical rates when data is sampled from a
mixture distribution. We first characterize the heterogeneity of the mixture in
terms of the pairwise total variation distance of the sub-population
distributions. Thereafter, as a central theme of this paper, we characterize
the range where the mixture may be treated as a single (homogeneous)
distribution for learning. In particular, we study the generalization
performance under the classical PAC framework and the statistical error rates
for parametric (linear regression, mixture of hyperplanes) as well as
non-parametric (Lipschitz, convex and H\"older-smooth) regression problems. In
order to do this, we obtain Rademacher complexity and (local) Gaussian
complexity bounds with mixture data, and apply them to get the generalization
and convergence rates respectively. We observe that as the (regression)
function classes get more complex, the requirement on the pairwise total
variation distance gets stringent, which matches our intuition. We also do a
finer analysis for the case of mixed linear regression and provide a tight
bound on the generalization error in terms of heterogeneity.