Sorting is an important kernel that requires many passes on data, where each pass imposes significant data movement overhead. Processing in memory (PIM) can reduce this data movement overhead while… Click to show full abstract
Sorting is an important kernel that requires many passes on data, where each pass imposes significant data movement overhead. Processing in memory (PIM) can reduce this data movement overhead while providing high parallelism. The radix sorting algorithm is scalable and can exploit PIM's parallelism. However, this algorithm is inefficient for current PIM-based accelerators for three reasons: (i) requiring a large intermediate array per processing unit, wasting capacity, (ii) requiring a prefix-sum operation across all the large intermediate arrays, imposing performance overhead, and (iii) requiring significant random accesses, which are costly in PIM. In this paper, we propose an algorithm and hardware co-optimization for sorting that enable every group of processing elements to cooperatively share and generate an intermediate array, reducing the capacity overhead of intermediate arrays and performance overhead of the prefix-sum operation. To prevent the shared array from becoming a bottleneck due to random accesses, we eliminate random accesses by adding a local sorting step to the radix sorting and providing efficient hardware support for this step. On average, our hardware/algorithm optimizations, Pulley, deliver 20× speedup compared to Bonsai, an FPGA-based sorting accelerator, and 13× speedup compared to IMC, an in-logic-layer-based sorting accelerator.
               
Click one of the above tabs to view related content.