Phishing attacks remain one of the most prevalent and persistent
cybersecurity threat with attackers continuously evolving and intensifying
tactics to evade the general detection system. Despite significant advances in
artificial intelligence and machine learning, faithfully reproducing the
interpretable reasoning with classification and explainability that underpin
phishing judgments remains challenging. Due to recent advancement in Natural
Language Processing, Large Language Models (LLMs) show a promising direction
and potential for improving domain specific phishing classification tasks.
However, enhancing the reliability and robustness of classification models
requires not only accurate predictions from LLMs but also consistent and
trustworthy explanations aligning with those predictions. Therefore, a key
question remains: can LLMs not only classify phishing emails accurately but
also generate explanations that are reliably aligned with their predictions and
internally self-consistent? To answer these questions, we have fine-tuned
transformer based models, including BERT, Llama models, and Wizard, to improve
domain relevance and make them more tailored to phishing specific distinctions,
using Binary Sequence Classification, Contrastive Learning (CL) and Direct
Preference Optimization (DPO). To that end, we examined their performance in
phishing classification and explainability by applying the ConsistenCy measure
based on SHAPley values (CC SHAP), which measures prediction explanation token
alignment to test the model's internal faithfulness and consistency and uncover
the rationale behind its predictions and reasoning. Overall, our findings show
that Llama models exhibit stronger prediction explanation token alignment with
higher CC SHAP scores despite lacking reliable decision making accuracy,
whereas Wizard achieves better prediction accuracy but lower CC SHAP scores.