MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

TOP Literature Database MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2409.13213

PDF

https://arxiv.org/pdf/2409.13213

Paper Information

Author: Jiliang Li,Yifan Zhang,Yu Huang,Kevin Leach
Published: 9-20-2024
Updated: 4-18-2025
Affiliation: Department of Computer Science, Vanderbilt University
Country: United States of America
Conference: European Symposium on Security and Privacy (EuroS&P)

Labels Estimated by AI

Malware Detection with Limited Samples Poisoning Data Augmentation Method

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Recent growth and proliferation of malware have tested practitioners ability to promptly classify new samples according to malware families. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing deep-learning malware family classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set. The sheer volume of new samples found in the wild creates substantial pressure on practitioners ability to reverse engineer enough malware to adequately train modern classifiers. In this paper, we present MalMixer, a malware family classifier using semi-supervised learning that achieves high accuracy with sparse training data. We present a domain-knowledge-aware data augmentation technique for malware feature representations, enhancing few-shot performance of semi-supervised malware family classification. We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings. Our research confirms the feasibility and effectiveness of lightweight, domain-knowledge-aware data augmentation methods for malware features and shows the capabilities of similar semi-supervised classifiers in addressing malware classification issues.

External Datasets

BODMAS

MOTIF