These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The massive deployment of Machine Learning (ML) models has been accompanied
by the emergence of several attacks that threaten their trustworthiness and
raise ethical and societal concerns such as invasion of privacy, discrimination
risks, and lack of accountability. Model hijacking is one of these attacks,
where the adversary aims to hijack a victim model to execute a different task
than its original one. Model hijacking can cause accountability and security
risks since a hijacked model owner can be framed for having their model
offering illegal or unethical services. Prior state-of-the-art works consider
model hijacking as a training time attack, whereby an adversary requires access
to the ML model training to execute their attack. In this paper, we consider a
stronger threat model where the attacker has no access to the training phase of
the victim model. Our intuition is that ML models, typically
over-parameterized, might (unintentionally) learn more than the intended task
for they are trained. We propose a simple approach for model hijacking at
inference time named SnatchML to classify unknown input samples using distance
measures in the latent space of the victim model to previously known samples
associated with the hijacking task classes. SnatchML empirically shows that
benign pre-trained models can execute tasks that are semantically related to
the initial task. Surprisingly, this can be true even for hijacking tasks
unrelated to the original task. We also explore different methods to mitigate
this risk. We first propose a novel approach we call meta-unlearning, designed
to help the model unlearn a potentially malicious task while training on the
original task dataset. We also provide insights on over-parameterization as one
possible inherent factor that makes model hijacking easier, and we accordingly
propose a compression-based countermeasure against this attack.