With the surge of Machine Learning (ML), An emerging amount of intelligent
applications have been developed. Deep Neural Networks (DNNs) have demonstrated
unprecedented performance across various fields such as medical diagnosis and
autonomous driving. While DNNs are widely employed in security-sensitive
fields, they are identified to be vulnerable to Neural Trojan (NT) attacks that
are controlled and activated by stealthy triggers. In this paper, we target to
design a robust and adaptive Trojan detection scheme that inspects whether a
pre-trained model has been Trojaned before its deployment. Prior works are
oblivious of the intrinsic property of trigger distribution and try to
reconstruct the trigger pattern using simple heuristics, i.e., stimulating the
given model to incorrect outputs. As a result, their detection time and
effectiveness are limited. We leverage the observation that the pixel trigger
typically features spatial dependency and propose the first trigger
approximation based black-box Trojan detection framework that enables a fast
and scalable search of the trigger in the input space. Furthermore, our
approach can also detect Trojans embedded in the feature space where certain
filter transformations are used to activate the Trojan. We perform extensive
experiments to investigate the performance of our approach across various
datasets and ML models. Empirical results show that our approach achieves a
ROC-AUC score of 0.93 on the public TrojAI dataset. Our code can be found at
https://github.com/xinqiaozhang/adatrojan