These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
With the increase in machine learning (ML) applications in different domains,
incentives for deceiving these models have reached more than ever. As data is
the core backbone of ML algorithms, attackers shifted their interest toward
polluting the training data. Data credibility is at even higher risk with the
rise of state-of-art research topics like open design principles, federated
learning, and crowd-sourcing. Since the machine learning model depends on
different stakeholders for obtaining data, there are no reliable automated
mechanisms to verify the veracity of data from each source.
Malware detection is arduous due to its malicious nature with the addition of
metamorphic and polymorphic ability in the evolving samples. ML has proven to
solve the zero-day malware detection problem, which is unresolved by
traditional signature-based approaches. The poisoning of malware training data
can allow the malware files to go undetected by the ML-based malware detectors,
helping the attackers to fulfill their malicious goals. A feasibility analysis
of the data poisoning threat in the malware detection domain is still lacking.
Our work will focus on two major sections: training ML-based malware detectors
and poisoning the training data using the label-poisoning approach. We will
analyze the robustness of different machine learning models against data
poisoning with varying volumes of poisoning data.