With the fast evolvement of embedded deep-learning computing systems,
applications powered by deep learning are moving from the cloud to the edge.
When deploying neural networks (NNs) onto the devices under complex
environments, there are various types of possible faults: soft errors caused by
cosmic radiation and radioactive impurities, voltage instability, aging,
temperature variations, and malicious attackers. Thus the safety risk of
deploying NNs is now drawing much attention. In this paper, after the analysis
of the possible faults in various types of NN accelerators, we formalize and
implement various fault models from the algorithmic perspective. We propose
Fault-Tolerant Neural Architecture Search (FT-NAS) to automatically discover
convolutional neural network (CNN) architectures that are reliable to various
faults in nowadays devices. Then we incorporate fault-tolerant training (FTT)
in the search process to achieve better results, which is referred to as
FTT-NAS. Experiments on CIFAR-10 show that the discovered architectures
outperform other manually designed baseline architectures significantly, with
comparable or fewer floating-point operations (FLOPs) and parameters.
Specifically, with the same fault settings, F-FTT-Net discovered under the
feature fault model achieves an accuracy of 86.2% (VS. 68.1% achieved by
MobileNet-V2), and W-FTT-Net discovered under the weight fault model achieves
an accuracy of 69.6% (VS. 60.8% achieved by ResNet-20). By inspecting the
discovered architectures, we find that the operation primitives, the weight
quantization range, the capacity of the model, and the connection pattern have
influences on the fault resilience capability of NN models.