These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
A backdoor deep learning (DL) model behaves normally upon clean inputs but
misbehaves upon trigger inputs as the backdoor attacker desires, posing severe
consequences to DL model deployments. State-of-the-art defenses are either
limited to specific backdoor attacks (source-agnostic attacks) or
non-user-friendly in that machine learning (ML) expertise or expensive
computing resources are required. This work observes that all existing backdoor
attacks have an inevitable intrinsic weakness, non-transferability, that is, a
trigger input hijacks a backdoored model but cannot be effective to another
model that has not been implanted with the same backdoor. With this key
observation, we propose non-transferability enabled backdoor detection (NTD) to
identify trigger inputs for a model-under-test (MUT) during
run-time.Specifically, NTD allows a potentially backdoored MUT to predict a
class for an input. In the meantime, NTD leverages a feature extractor (FE) to
extract feature vectors for the input and a group of samples randomly picked
from its predicted class, and then compares similarity between the input and
the samples in the FE's latent space. If the similarity is low, the input is an
adversarial trigger input; otherwise, benign. The FE is a free pre-trained
model privately reserved from open platforms. As the FE and MUT are from
different sources, the attacker is very unlikely to insert the same backdoor
into both of them. Because of non-transferability, a trigger effect that does
work on the MUT cannot be transferred to the FE, making NTD effective against
different types of backdoor attacks. We evaluate NTD on three popular
customized tasks such as face recognition, traffic sign recognition and general
animal classification, results of which affirm that NDT has high effectiveness
(low false acceptance rate) and usability (low false rejection rate) with low
detection latency.