Many-core architecture is a promising architecture to accelerate increasingly larger neural networks (NNs). Most many-core architectures couple a standalone CPU core and a tensor core together as a compute node.… Click to show full abstract
Many-core architecture is a promising architecture to accelerate increasingly larger neural networks (NNs). Most many-core architectures couple a standalone CPU core and a tensor core together as a compute node. However, the existing architectures suffer from inefficiency at the architecture, data flow, and control flow levels: The standalone scalar CPU core with deep out-of-order pipeline and low data parallelism per instruction incurs high hardware overhead and low throughput; Fixed proportions of CPU and tensor cores execute computations alternately in each cluster, leading to core under-utilization under diverse workloads; The MIMD parallelism strategy causes redundant instruction cache (I-Cache) accesses, which increases power consumption. To tackle the above limitations, we propose MRCIM, a many-core reconfigurable computing-in-memory (CIM) processor with reconfigurable cores featuring both CPU and tensor modes. 1) We design a reconfigurable CPU core by reusing the CIM-based tensor core’s inherent memory and computing logic to simplify the pipeline logic and improve the data parallelism of conventional CPU. 2) We propose interleaved workload execution (IWE) and adaptive workload mapping (AWM) scheduling strategies, which dynamically adjust the proportion of CPU core and tensor core in a cluster, making them work in parallel with high utilization. 3) We propose a hybrid MIMD/SIMD control flow to bypass unnecessary I-Cache accesses by instruction forwarding and sharing, thereby reducing power consumption. Experimental results show MRCIM achieves 166.48x~446.67x speedup and 96.76x~309.01x energy saving over Intel i9-13900k CPU, 12.62x~27.62x speedup and 5.49x~17.82x energy saving over NVIDIA RTX 4090 GPU. Compared with state-of-the-art NN processor architectures, our MRCIM achieves average 6.84x, 7.51x, and 3.66x speedup and average 4.57x, 3.03x, and 3.11x energy saving over Simba, LUT-ICC, and MAICC.
               
Click one of the above tabs to view related content.