Machine learning progress is advancing the detection of malicious URLs.
However, advanced Transformers applied to URLs face difficulties in extracting
local information, character-level details, and structural relationships. To
address these challenges, we propose a novel approach for malicious URL
detection, named TransURL. This method is implemented by co-training the
character-aware Transformer with three feature modules: Multi-Layer Encoding,
Multi-Scale Feature Learning, and Spatial Pyramid Attention. This specialized
Transformer enables TransURL to extract embeddings with character-level
information from URL token sequences, with the three modules aiding the fusion
of multi-layer Transformer encodings and the capture of multi-scale local
details and structural relationships. The proposed method is evaluated across
several challenging scenarios, including class imbalance learning,
multi-classification, cross-dataset testing, and adversarial sample attacks.
Experimental results demonstrate a significant improvement compared to previous
methods. For instance, it achieved a peak F1-score improvement of 40% in
class-imbalanced scenarios and surpassed the best baseline by 14.13% in
accuracy for adversarial attack scenarios. Additionally, a case study
demonstrated that our method accurately identified all 30 active malicious web
pages, whereas two previous state-of-the-art methods missed 4 and 7 malicious
web pages, respectively. The codes and data are available at:
https://github.com/Vul-det/TransURL/.