These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Most existing methods to detect backdoored machine learning (ML) models take
one of the two approaches: trigger inversion (aka. reverse engineer) and weight
analysis (aka. model diagnosis). In particular, the gradient-based trigger
inversion is considered to be among the most effective backdoor detection
techniques, as evidenced by the TrojAI competition, Trojan Detection Challenge
and backdoorBench. However, little has been done to understand why this
technique works so well and, more importantly, whether it raises the bar to the
backdoor attack. In this paper, we report the first attempt to answer this
question by analyzing the change rate of the backdoored model around its
trigger-carrying inputs. Our study shows that existing attacks tend to inject
the backdoor characterized by a low change rate around trigger-carrying inputs,
which are easy to capture by gradient-based trigger inversion. In the meantime,
we found that the low change rate is not necessary for a backdoor attack to
succeed: we design a new attack enhancement called \textit{Gradient Shaping}
(GRASP), which follows the opposite direction of adversarial training to reduce
the change rate of a backdoored model with regard to the trigger, without
undermining its backdoor effect. Also, we provide a theoretic analysis to
explain the effectiveness of this new technique and the fundamental weakness of
gradient-based trigger inversion. Finally, we perform both theoretical and
experimental analysis, showing that the GRASP enhancement does not reduce the
effectiveness of the stealthy attacks against the backdoor detection methods
based on weight analysis, as well as other backdoor mitigation methods without
using detection.