4.7 Article

VCSR: An Efficient GPU Memory-Aware Sparse Format

期刊

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TPDS.2022.3177291

关键词

GPU; memory patterns; SpMV; sparse matrices

向作者/读者索取更多资源

This paper introduces a novel memory-aware format called VCSR, which out-performs previous formats on a GPU. VCSR achieves high thread-level parallelism and memory utilization by exploiting knowledge of GPU memory microarchitecture, reducing the number of global memory transactions, and providing a reordering mechanism. Experimental results demonstrate significant performance improvements of VCSR on different GPUs.
The Sparse Matrix-Vector Multiplication (SpMV) kernel is used in a broad class of linear algebra computations. SpMV computations result in a performance bottleneck in many high performance applications, so optimizing SpMV performance is paramount. While implementing this kernel on a GPU can potentially boost performance significantly, current GPU libraries either provide modest performance gains or are burdened with high sparse format conversion overhead. In this paper we introduce the Vertical Compressed Sparse Row (VCSR) format, a novel memory-aware format that out-performs previous proposed formats on a GPU. We first motivate the design of our baseline VCSR format and then step through a series of enhancements that further improve VCSR's memory efficiency (VCSR-MEM) and performance (VCSR-INTRLV), while also considering conversion overhead. VCSR attempts to produce a high degree of thread-level parallelism and memory utilization by exploiting knowledge of GPU memory microarchitecture. VCSR can reduce the number of global memory transactions significantly, an issue not addressed by most other sparse formats. In addition, VCSR provides a novel reordering mechanism. It minimizes the size of the compressed matrix, handles both regular/irregular sparse matrices, and can be customized based on matrix size. VCSR also minimizes conversion overhead, as compared to full or partial row reordering. Our methodology is highly configurable and can be optimized for any sparse matrix. We have evaluated the VCSR format for the SpMV kernel when run on two different NVIDIA GPUs, the Kepler K40 and the Volta V100. We compare VCSR with NVIDIA's cuSPARSE library (the HYB format), a state-of-the-art sparse library. We also compare against other state-of-the-art CSR-based formats, including CSR5, merge-base SpMV and HOLA. We evaluate the benefits of VCSR over the entire University of Florida's SuiteSparse dataset collection. The VCSR-baseline format achieves an average speedup ranging from 1.10 x to 1.39x when compared to the performance of the four state-of-the-art formats on an NVIDIA V100. While the VCSR-MEM format can save a significant amount of memory space, it is a bit slower than our VCSR-baseline. VCSR-INTRLV performs much better than the VCSR-baseline, and even when including the conversion overhead, achieves an average speedup of 1.08 x as compared to HOLA (the best performing format among the prior schemes).

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据