Contextual word embeddings such as BERT have achieved state of the art
performance in numerous NLP tasks. Since they are optimized to capture the
statistical properties of training data, they tend to pick up on and amplify
social stereotypes present in the data as well. In this study, we (1)~propose a
template-based method to quantify bias in BERT; (2)~show that this method
obtains more consistent results in capturing social biases than the traditional
cosine based method; and (3)~conduct a case study, evaluating gender bias in a
downstream task of Gender Pronoun Resolution. Although our case study focuses
on gender bias, the proposed technique is generalizable to unveiling other
biases, including in multiclass settings, such as racial and religious biases.