Machine learning (ML) for text classification has been widely used in various
domains. These applications can significantly impact ethics, economics, and
human behavior, raising serious concerns about trusting ML decisions. Studies
indicate that conventional metrics are insufficient to build human trust in ML
models. These models often learn spurious correlations and predict based on
them. In the real world, their performance can deteriorate significantly. To
avoid this, a common practice is to test whether predictions are reasonable
based on valid patterns in the data. Along with this, a challenge known as the
trustworthiness oracle problem has been introduced. Due to the lack of
automated trustworthiness oracles, the assessment requires manual validation of
the decision process disclosed by explanation methods. However, this is
time-consuming, error-prone, and unscalable.
We propose TOKI, the first automated trustworthiness oracle generation method
for text classifiers. TOKI automatically checks whether the words contributing
the most to a prediction are semantically related to the predicted class.
Specifically, we leverage ML explanations to extract the decision-contributing
words and measure their semantic relatedness with the class based on word
embeddings. We also introduce a novel adversarial attack method that targets
trustworthiness vulnerabilities identified by TOKI. To evaluate their alignment
with human judgement, experiments are conducted. We compare TOKI with a naive
baseline based solely on model confidence and TOKI-guided adversarial attack
method with A2T, a SOTA adversarial attack method. Results show that relying on
prediction uncertainty cannot effectively distinguish between trustworthy and
untrustworthy predictions, TOKI achieves 142% higher accuracy than the naive
baseline, and TOKI-guided attack method is more effective with fewer
perturbations than A2T.