These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Current software supply chains heavily rely on open-source packages hosted in
public repositories. Given the popularity of ecosystems like npm and PyPI,
malicious users started to spread malware by publishing open-source packages
containing malicious code. Recent works apply machine learning techniques to
detect malicious packages in the npm ecosystem. However, the scarcity of
samples poses a challenge to the application of machine learning techniques in
other ecosystems. Despite the differences between JavaScript and Python, the
open-source software supply chain attacks targeting such languages show
noticeable similarities (e.g., use of installation scripts, obfuscated strings,
URLs).
In this paper, we present a novel approach that involves a set of
language-independent features and the training of models capable of detecting
malicious packages in npm and PyPI by capturing their commonalities. This
methodology allows us to train models on a diverse dataset encompassing
multiple languages, thereby overcoming the challenge of limited sample
availability. We evaluate the models both in a controlled experiment (where
labels of data are known) and in the wild by scanning newly uploaded packages
for both npm and PyPI for 10 days.
We find that our approach successfully detects malicious packages for both
npm and PyPI. Over an analysis of 31,292 packages, we reported 58 previously
unknown malicious packages (38 for npm and 20 for PyPI), which were
consequently removed from the respective repositories.