This article presents an adaptive clock scheme to exploit instruction-based dynamic timing slack (DTS) for a general-purpose graphics processor unit (GPGPU) architecture. Based on the developed transitional static timing analysis,… Click to show full abstract
This article presents an adaptive clock scheme to exploit instruction-based dynamic timing slack (DTS) for a general-purpose graphics processor unit (GPGPU) architecture. Based on the developed transitional static timing analysis, the deterministic DTS can be identified for each instruction at different pipeline stages. A critical path (CP) messenger scheme was designed to monitor the runtime utilization of CPs. Both real-time issued instruction information and CP messengers are utilized to determine the runtime DTS margin and guide the cycle-by-cycle clock period adjustment. To apply the proposed adaptive clock on GPGPU, a hierarchical clocking scheme is built including a global phase-locked loop (PLL) and local delay-locked loop (DLL)-based clock generator inside each compute unit (CU). Each CU core contains its own clock domain with adjustable local clocking. In addition, to exploit error-resilient characteristics of the neural network, an elastic pipeline clocking scheme is developed to redistribute the timing margin across pipeline stages for machine learning computations. Measurement results from the implemented open-source GPGPU architecture on a 65 nm CMOS process demonstrate up to 18% performance improvement or equivalent 30% energy saving can be obtained by exploiting the deterministic instruction-based DTS. The proposed elastic pipeline clocking can gain an additional 8% energy saving with small accuracy degradation for neural network inference operations.
               
Click one of the above tabs to view related content.