These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
In this work, we investigate a particular implicit bias in the gradient
descent training process, which we term "Feature Averaging", and argue that it
is one of the principal factors contributing to non-robustness of deep neural
networks. Despite the existence of multiple discriminative features capable of
classifying data, neural networks trained by gradient descent exhibit a
tendency to learn the average (or certain combination) of these features,
rather than distinguishing and leveraging each feature individually. In
particular, we provide a detailed theoretical analysis of the training dynamics
of gradient descent in a two-layer ReLU network for a binary classification
task, where the data distribution consists of multiple clusters with orthogonal
cluster center vectors. We rigorously prove that gradient descent converges to
the regime of feature averaging, wherein the weights associated with each
hidden-layer neuron represent an average of the cluster centers (each center
corresponding to a distinct feature). It leads the network classifier to be
non-robust due to an attack that aligns with the negative direction of the
averaged features. Furthermore, we prove that, with the provision of more
granular supervised information, a two-layer multi-class neural network is
capable of learning individual features, from which one can derive a binary
classifier with the optimal robustness under our setting. Besides, we also
conduct extensive experiments using synthetic datasets, MNIST and CIFAR-10 to
substantiate the phenomenon of feature averaging and its role in adversarial
robustness of neural networks. We hope the theoretical and empirical insights
can provide a deeper understanding of the impact of the gradient descent
training on feature learning process, which in turn influences the robustness
of the network, and how more detailed supervision may enhance model robustness.