3.8 Proceedings Paper

Optimization of Block Sparse Matrix-Vector Multiplication on Shared-Memory Parallel Architectures

向作者/读者索取更多资源

We examine the implementation of block compressed row storage (BCSR) sparse matrix-vector multiplication (SpMV) for sparse matrices with dense block substructure, optimized for blocks with sizes from 2x2 to 32x32, on CPU, Intel many-integrated-core, and GPU architectures. Previous research on SpMV for matrices with dense block substructure has largely focused on the design of novel data structures to optimize performance for specific architectures or to store variable-sized, variably-aligned blocks, but depending on alternate storage formats breaks compatibility with existing preconditioners and solvers or imposes significant runtime costs when converting between matrix formats. This paper instead focuses on the optimization of SpMV using the standard block compressed row storage (BCSR) format. We give a set of algorithms that performs SpMV up to 4x faster than the NVIDIA cuSPARSE cusparseDbsrmv routine, up to 147x faster than the Intel Math Kernel Library (MKL) mkl_dbsrmv routine (a single-threaded BCSR SpMV kernel), and up to 3x faster than the MKL mkl_dcsrmv routine (a multi-threaded CSR SpMV kernel).

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据