These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Model extraction attacks are one type of inference-time attacks that
approximate the functionality and performance of a black-box victim model by
launching a certain number of queries to the model and then leveraging the
model's predictions to train a substitute model. These attacks pose severe
security threats to production models and MLaaS platforms and could cause
significant monetary losses to the model owners. A body of work has proposed to
defend machine learning models against model extraction attacks, including both
active defense methods that modify the model's outputs or increase the query
overhead to avoid extraction and passive defense methods that detect malicious
queries or leverage watermarks to perform post-verification. In this work, we
introduce a new defense paradigm called attack as defense which modifies the
model's output to be poisonous such that any malicious users that attempt to
use the output to train a substitute model will be poisoned. To this end, we
propose a novel lightweight backdoor attack method dubbed HoneypotNet that
replaces the classification layer of the victim model with a honeypot layer and
then fine-tunes the honeypot layer with a shadow model (to simulate model
extraction) via bi-level optimization to modify its output to be poisonous
while remaining the original performance. We empirically demonstrate on four
commonly used benchmark datasets that HoneypotNet can inject backdoors into
substitute models with a high success rate. The injected backdoor not only
facilitates ownership verification but also disrupts the functionality of
substitute models, serving as a significant deterrent to model extraction
attacks.