These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Despite their impressive performance in classification tasks, neural networks
are known to be vulnerable to adversarial attacks, subtle perturbations of the
input data designed to deceive the model. In this work, we investigate the
correlation between these perturbations and the implicit bias of neural
networks trained with gradient-based algorithms. To this end, we analyse a
representation of the network's implicit bias through the lens of the Fourier
transform. Specifically, we identify unique fingerprints of implicit bias and
adversarial attacks by calculating the minimal, essential frequencies needed
for accurate classification of each image, as well as the frequencies that
drive misclassification in its adversarially perturbed counterpart. This
approach enables us to uncover and analyse the correlation between these
essential frequencies, providing a precise map of how the network's biases
align or contrast with the frequency components exploited by adversarial
attacks. To this end, among other methods, we use a newly introduced technique
capable of detecting nonlinear correlations between high-dimensional datasets.
Our results provide empirical evidence that the network bias in Fourier space
and the target frequencies of adversarial attacks are highly correlated and
suggest new potential strategies for adversarial defence.