Deep neural networks are vulnerable to adversarial attacks and hard to
interpret because of their black-box nature. The recently proposed invertible
network is able to accurately reconstruct the inputs to a layer from its
outputs, thus has the potential to unravel the black-box model. An invertible
network classifier can be viewed as a two-stage model: (1) invertible
transformation from input space to the feature space; (2) a linear classifier
in the feature space. We can determine the decision boundary of a linear
classifier in the feature space; since the transform is invertible, we can
invert the decision boundary from the feature space to the input space.
Furthermore, we propose to determine the projection of a data point onto the
decision boundary, and define explanation as the difference between data and
its projection. Finally, we propose to locally approximate a neural network
with its first-order Taylor expansion, and define feature importance using a
local linear model. We provide the implementation of our method:
\url{https://github.com/juntang-zhuang/explain_invertible}.