These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Deep neural networks are vulnerable to adversarial attacks and hard to
interpret because of their black-box nature. The recently proposed invertible
network is able to accurately reconstruct the inputs to a layer from its
outputs, thus has the potential to unravel the black-box model. An invertible
network classifier can be viewed as a two-stage model: (1) invertible
transformation from input space to the feature space; (2) a linear classifier
in the feature space. We can determine the decision boundary of a linear
classifier in the feature space; since the transform is invertible, we can
invert the decision boundary from the feature space to the input space.
Furthermore, we propose to determine the projection of a data point onto the
decision boundary, and define explanation as the difference between data and
its projection. Finally, we propose to locally approximate a neural network
with its first-order Taylor expansion, and define feature importance using a
local linear model. We provide the implementation of our method:
\url{https://github.com/juntang-zhuang/explain_invertible}.