Adversarial examples pose a security risk as they can alter decisions of a
machine learning classifier through slight input perturbations. Certified
robustness has been proposed as a mitigation where given an input $\mathbf{x}$,
a classifier returns a prediction and a certified radius $R$ with a provable
guarantee that any perturbation to $\mathbf{x}$ with $R$-bounded norm will not
alter the classifier's prediction. In this work, we show that these guarantees
can be invalidated due to limitations of floating-point representation that
cause rounding errors. We design a rounding search method that can efficiently
exploit this vulnerability to find adversarial examples against
state-of-the-art certifications in two threat models, that differ in how the
norm of the perturbation is computed. We show that the attack can be carried
out against linear classifiers that have exact certifiable guarantees and
against neural networks that have conservative certifications. In the weak
threat model, our experiments demonstrate attack success rates over 50% on
random linear classifiers, up to 23% on the MNIST dataset for linear SVM, and
up to 15% for a neural network. In the strong threat model, the success rates
are lower but positive. The floating-point errors exploited by our attacks can
range from small to large (e.g., $10^{-13}$ to $10^{3}$) - showing that even
negligible errors can be systematically exploited to invalidate guarantees
provided by certified robustness. Finally, we propose a formal mitigation
approach based on rounded interval arithmetic, encouraging future
implementations of robustness certificates to account for limitations of modern
computing architecture to provide sound certifiable guarantees.
外部データセット
MNIST
参考文献
Operations Research Letters
Exact solutions to linear programming problems
David L Applegate, William Cook, Sanjeeb Dash, Daniel G Espinoza
Published: 2007
Statistical science
Interval estimation for a binomial proportion
Lawrence D. Brown, T. Tony Cai, Anirban DasGupta
Published: 2001
Journal of Machine Learning Research
Branch and bound for piecewise linear neural network verification
Rudy Bunel, Jingyue Lu, Ilker Turkaslan, P Kohli, P Torr, M Pawan Kumar
Published: 2020
arxiv
被引用数 1
IEEE Symposium on Security and Privacy
Towards Evaluating the Robustness of Neural Networks
Nicholas Carlini, David Wagner
Published: 2016.8.17
Neural networks provide state-of-the-art results for most machine learning
tasks. Unfortunately, neural networks are vulnerable to adversarial examples:
given an input $x$ and any target classification $t$, it is possible to find a
new input $x'$ that is similar to $x$ but classified as $t$. This makes it
difficult to apply neural networks in security-critical areas. Defensive
distillation is a recently proposed approach that can take an arbitrary neural
network, and increase its robustness, reducing the success rate of current
attacks' ability to find adversarial examples from $95\%$ to $0.5\%$.
In this paper, we demonstrate that defensive distillation does not
significantly increase the robustness of neural networks by introducing three
new attack algorithms that are successful on both distilled and undistilled
neural networks with $100\%$ probability. Our attacks are tailored to three
distance metrics used previously in the literature, and when compared to
previous adversarial example generation algorithms, our attacks are often much
more effective (and never worse). Furthermore, we propose using high-confidence
adversarial examples in a simple transferability test we show can also be used
to break defensive distillation. We hope our attacks will be used as a
benchmark in future defense attempts to create neural networks that resist
adversarial examples.