Offloading compute-intensive kernels to hardware accelerators relies on the large degree of parallelism offered by these platforms. However, the effective bandwidth of the memory interface often causes a bottleneck, hindering… Click to show full abstract
Offloading compute-intensive kernels to hardware accelerators relies on the large degree of parallelism offered by these platforms. However, the effective bandwidth of the memory interface often causes a bottleneck, hindering the accelerator’s effective performance. Techniques enabling data reuse, such as tiling, lower the pressure on memory traffic but do not fully exploit the bandwidth. A further increase in bandwidth utilization is possible by using burst rather than element-wise accesses, provided the data is contiguous in memory. In this article, we propose a memory allocation technique, and provide a proof-of-concept source-to-source compiler pass, that enables such burst transfers by modifying the data layout in external memory. Our experiments show the new memory allocation yields close to 100% bandwidth utilization while the memory engines occupy less than 5% of the field-programmable gate array (FPGA) logic area.
               
Click one of the above tabs to view related content.