Attention-based neural networks have achieved state-of-the-art results on a
wide range of tasks. Most such models use deterministic attention while
stochastic attention is less explored due to the optimization difficulties or
complicated model design. This paper introduces Bayesian attention belief
networks, which construct a decoder network by modeling unnormalized attention
weights with a hierarchy of gamma distributions, and an encoder network by
stacking Weibull distributions with a deterministic-upward-stochastic-downward
structure to approximate the posterior. The resulting auto-encoding networks
can be optimized in a differentiable way with a variational lower bound. It is
simple to convert any models with deterministic attention, including pretrained
ones, to the proposed Bayesian attention belief networks. On a variety of
language understanding tasks, we show that our method outperforms deterministic
attention and state-of-the-art stochastic attention in accuracy, uncertainty
estimation, generalization across domains, and robustness to adversarial
attacks. We further demonstrate the general applicability of our method on
neural machine translation and visual question answering, showing great
potential of incorporating our method into various attention-related tasks.