4.4 Article

An efficient sparse stiffness matrix vector multiplication using compressed sparse row storage format on AMD GPU

Journal

Publisher

WILEY
DOI: 10.1002/cpe.7186

Keywords

AMD GPU; HPC; performance acceleration; sparse stiffness matrix-vector multiplication

Funding

  1. National Key R&D Program of China [2017YFB0202303]

Ask authors/readers for more resources

The performance of sparse stiffness matrix-vector multiplication is crucial for large-scale structural mechanics numerical simulation. This article introduces a new CSR-vector row algorithm that achieves fine-grained computing optimization for sparse stiffness matrices on AMD GPUs, demonstrating efficient reduce operations and deep memory access optimization, resulting in improved computing performance.
The performance of sparse stiffness matrix-vector multiplication is essential for large-scale structural mechanics numerical simulation. Compressed sparse row (CSR) is the most common format for storing sparse stiffness matrices. However, the high sparsity of the sparse stiffness matrix makes the number of nonzero elements per row very small. Therefore, the CSR-scalar algorithm, light algorithm, and HOLA algorithm in the calculation will cause some threads in the GPU to be in idle state, which will not only affect the computing performance but also waste computing resources. In this article, a new algorithm, CSR-vector row, is proposed for fine-grained computing optimization based on the AMD GPU architecture on heterogeneous supercomputers. This algorithm can set a vector to calculate a row based on the number of nonzero elements of the stiffness matrix. CSR-vector row has efficient reduce operations, deep memory access optimization, better memory access, and calculation overlapping kernel function configuration scheme. The access bandwidth of the algorithm on AMD GPU is more than 700 GB/s. Compared with CSR-scalar algorithm, the parallel efficiency of CSR-vector row is improved by 7.2 times. And floating-point computing performance is 41%-95% higher than that of light algorithm and HOLA algorithm. In addition, CSR-vector row is used to calculate the examples from CFD, electromagnetics, quantum chemistry, power network, and semiconductor process, the memory access bandwidth and double floating-point performance are also improved compared with rocSPARSE-CSR-vector.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.4
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available