The emerging Resistive RAM (ReRAM) technology significantly boosts the performance and the energy efficiency of the deep learning accelerators (DLAs) via the Computing-in-Memory (CiM) architecture. However, ReRAM-based DLA also suffers… Click to show full abstract
The emerging Resistive RAM (ReRAM) technology significantly boosts the performance and the energy efficiency of the deep learning accelerators (DLAs) via the Computing-in-Memory (CiM) architecture. However, ReRAM-based DLA also suffers a high occurrence rate of memory faults. How to detect and protect against the faults in ReRAM devices poses great challenges to ReRAM-based DLA design. In this work, we propose RRAMedy, an in-situ fault detection and network remedy framework for ReRAM-based DLAs. With the proposed Adversarial Example Testing, which is a lifetime on-device and on-line fault detection technique, it achieves high detection coverage of both hard faults and soft faults at a low run-time cost. In addition, it employs an edge-cloud collaborative model retraining method to tolerate the detected faults by leveraging the inherent fault-adaptive capability of DNNs. Meanwhile, to enable in-situ model remedy when the cloud assistance is absent due to security or overhead issues, we propose to accelerate the fault-masking retraining process on edge devices with parallelized Knowledge Transfer. Our experimental results show that the proposed fault detection technique achieves high fault detection accuracy and delivers real-time testing performance. Meanwhile, the proposed retraining approach greatly alleviates the accuracy degradation problem and achieves excellent performance speedups over the baselines.
               
Click one of the above tabs to view related content.