Since off-chip DRAM access affects both performance and power consumption significantly, convolutional neural network (CNN) accelerators commonly aim to maximize data reuse in on-chip memory. By organizing the on-chip memory… Click to show full abstract
Since off-chip DRAM access affects both performance and power consumption significantly, convolutional neural network (CNN) accelerators commonly aim to maximize data reuse in on-chip memory. By organizing the on-chip memory to multiple banks, we may hide off-chip DRAM access delay by prefetching data to unused banks during computation. When and where to prefetch data and how to reuse the feature map data between layers define the multi-bank on-chip memory management (MOMM) problem. In this paper, we propose compiler techniques to solve the MOMM problem with two different objectives: one is to minimize the off-chip memory access volume, and the other is to minimize the processing delay caused by unhidden DRAM accesses. By running CNN benchmarks on a cycle-level NPU simulator, we demonstrate the trade-off relation between two objectives. Compared with the baseline approach that does not reuse the feature map between layers, we could reduce the DRAM access volume and the processing delay up to 55.0 and 79.4 percent, respectively. Moreover, we extend the proposed techniques to consider layer fusion that aims to reuse feature maps between layers. Experiment results confirm the superiority of the proposed hybrid fusion technique to the per-layer processing technique and the pure fusion technique.
               
Click one of the above tabs to view related content.