Parameter quantization with lower bit-width is the common approach to reduce the computation loads in CNN inference. With the parameters being replaced by fixed-width binaries, multiplication operations can be replaced… Click to show full abstract
Parameter quantization with lower bit-width is the common approach to reduce the computation loads in CNN inference. With the parameters being replaced by fixed-width binaries, multiplication operations can be replaced by the look-up-table (LUT), where the multiplier-multiplicand operands serve as the table index, and the pre-calculated products serve as table elements. Because the histogram profiles of the parameters in different layers/channels differ significantly in CNN, previous LUT-based computation methods have to use different LUTs for each layer/channel, and consequently demand larger memory space along with extra access time and power consumption. In this work, we first normalize the parameters’ Gaussian profiles of different layers/channels to have similar means and variances, and further quantize the normalized parameters into fixed-width through non-linear quantization. Because of the normalized parameters’ profile, we can use one single compact LUT (16×16 entries) to replace all multiplication operations in the whole network. Furthermore, the normalization procedure also reduces the errors induced from quantization. Experiments demonstrate that with a compact 256-entry LUT, we can achieve the accuracy comparable to the results from 32-bit floating-point calculation; while significantly reduce the computation loads and memory spaces, along with power consumption and hardware resources.
               
Click one of the above tabs to view related content.