Deep learning models are vulnerable to various adversarial manipulations of
their training data, parameters, and input sample. In particular, an adversary
can modify the training data and model parameters to embed backdoors into the
model, so the model behaves according to the adversary's objective if the input
contains the backdoor features, referred to as the backdoor trigger (e.g., a
stamp on an image). The poisoned model's behavior on clean data, however,
remains unchanged. Many detection algorithms are designed to detect backdoors
on input samples or model parameters, through the statistical difference
between the latent representations of adversarial and clean input samples in
the poisoned model. In this paper, we design an adversarial backdoor embedding
algorithm that can bypass the existing detection algorithms including the
state-of-the-art techniques. We design an adaptive adversarial training
algorithm that optimizes the original loss function of the model, and also
maximizes the indistinguishability of the hidden representations of poisoned
data and clean data. This work calls for designing adversary-aware defense
mechanisms for backdoor detection.