FPGAs are starting to incorporate High Bandwidth Memory (HBM) to both reduce the memory bandwidth bottleneck encountered in some applications and to provide more capacity to store application state. However,… Click to show full abstract
FPGAs are starting to incorporate High Bandwidth Memory (HBM) to both reduce the memory bandwidth bottleneck encountered in some applications and to provide more capacity to store application state. However, the overall performance characteristics of HBMs are still not well understood, especially in the context of FPGAs, making it difficult to optimize designs relying on HBM. In this article, we bridge the gap between nominal specifications and actual performance by characterizing HBM on a state-of-the-art FPGA, i.e., a Xilinx Alveo U280 featuring a two-stack HBM subsystem. To this end, we have developed Shuhai, a benchmarking tool that throws light on all the subtle details of the performance and usage of HBMs on an FPGA. FPGA-based benchmarking should also provide a more accurate picture of HBM than measuring performance on CPUs/GPUs, since CPUs/GPUs are noisier systems due to their complex control logic and cache hierarchy. Since the memory itself is complex, leveraging custom hardware logic to benchmark it directly from an FPGA provides more details as well as more accurate and deterministic measurements. We observe that 1) HBM is able to provide up to 425 GB/s memory bandwidth, and 2) how HBM is used has a significant impact on the achievable throughput, which in turn demonstrates the importance of unveiling the performance characteristics of HBM so as to use HBM in the right manner. To demonstrate the generality of Shuhai, we also show results for other types of memory, e.g., DDR4, and DDR3, and quantitatively compare the performance characteristics of HBM with those of DDR4 and DDR3.
               
Click one of the above tabs to view related content.