Toward More Generalized Malicious URL Detection Models

TOP Literature Database Toward More Generalized Malicious URL Detection Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2202.10027

PDF

https://arxiv.org/pdf/2202.10027

Paper Information

Author: YunDa Tsai;Cayon Liow;Yin Sheng Siang;Shou-De Lin
Published: 2-21-2022
Updated: 2-10-2024
Affiliation: National Taiwan University
Country: Taiwan
Conference: AAAI Conference on Artificial Intelligence (AAAI)

Labels Estimated by AI

Bias Token Distribution Analysis Impact of Generalization

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

This paper reveals a data bias issue that can severely affect the performance while conducting a machine learning model for malicious URL detection. We describe how such bias can be identified using interpretable machine learning techniques, and further argue that such biases naturally exist in the real world security data for training a classification model. We then propose a debiased training strategy that can be applied to most deep-learning based models to alleviate the negative effects from the biased features. The solution is based on the technique of self-supervised adversarial training to train deep neural networks learning invariant embedding from biased data. We conduct a wide range of experiments to demonstrate that the proposed strategy can lead to significantly better generalization capability for both CNN-based and RNN-based detection models.

External Datasets

ISCX-URL-2016

VirusTotal daily queries

References

arXiv

Adversarial Invariant Feature Learning with Accuracy Constraint for Domain Generalization

Akuzawa, K., Iwasawa, Y., Matsuo, Y.

Published: 2019

Proceedings of the European Conference on Computer Vision (ECCV)

Turning a blind eye: Explicit removal of biases and variation from deep neural network embeddings

Alvi, M., Zisserman, A., Nellaker, C.

Published: 2018

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Malware classification with LSTM and GRU language models and a character-level CNN

Athiwaratkun, B., Stokes, J. W.

Published: 2017

Data decisions and theoretical implications when adversarially learning fair representations

A. Beutel, J. Chen, Z. Zhao, E.H. Chi

Published: 2017

arXiv

Censoring representations with an adversary

Edwards, H., Storkey, A.

Published: 2015

Journal of Information Security

Malware analysis and classification: A survey

Gandotra, E., Bansal, D., Sofat, S.

Published: 2014

Domain Adaptation in Computer Vision Applications

Domain-adversarial training of neural networks

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.

Published: 2017

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Learning Not to Learn: Training Deep Neural Networks with Biased Data

Kim, B., Kim, H., Kim, K., Kim, S., Kim, J.

Published: 2019

arxiv

Cited by 4

URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection

Hung Le, Quang Pham, Doyen Sahoo, Steven C. H. Hoi

Published: 2.9.2018

Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet.

Machine Learning Method Membership Inference Model Inversion

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Domain generalization with adversarial feature learning

Li, H., Jialin Pan, S., Wang, S., Kot, A. C.

Published: 2018

arXiv

The variational fair autoencoder

Louizos, C., Swersky, K., Li, Y., Welling, M., Zemel, R.

Published: 2015

Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on

Malware classification with recurrent networks

Pascanu, R., Stokes, J. W., Sanossian, H., Marinescu, M., Thomas, A.

Published: 2015

arxiv

Cited by 1

AAAI Workshops

Malware Detection by Eating a Whole EXE

Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, Charles Nicholas

Published: 10.26.2017

In this work we introduce malware detection from raw byte sequences as a fruitful research area to the larger machine learning community. Building a neural network for such a problem presents a number of interesting challenges that have not occurred in tasks such as image processing or NLP. In particular, we note that detection from raw bytes presents a sequence problem with over two million time steps and a problem where batch normalization appear to hinder the learning process. We present our initial work in building a solution to tackle this problem, which has linear complexity dependence on the sequence length, and allows for interpretable sub-regions of the binary to be identified. In doing so we will discuss the many challenges in building a neural network to process data at this scale, and the methods we used to work around them.

Model Design Malware Detection Method Malware Classification

arXiv

Smoothgrad: removing noise by adding noise

Smilkov, D., Thorat, N., Kim, B., Viegas, F., Watten- berg, M.

Published: 2017

Proceedings of the 34th International Conference on Machine Learning-Volume 70

Axiomatic attribution for deep networks

Sundararajan, M., Taly, A., Yan, Q.

Published: 2017

Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Toward an Effective Black-Box Adversarial Attack on Functional JavaScript Malware against Commercial Anti-Virus

Tsai, Y.-D., Chen, C., Lin, S.-D.

Published: 2021

Advances in Neural Information Processing Systems

Controllable invariance through adversarial feature learning

Xie, Q., Dai, Z., Du, Y., Hovy, E., Neubig, G.

Published: 2017

arXiv

Fairness constraints: Mechanisms for fair classification

Zafar, M. B., Valera, I., Rodriguez, M. G., Gummadi, K. P.

Published: 2015