Journal
COMPUTER PHYSICS COMMUNICATIONS
Volume 233, Issue -, Pages 29-40Publisher
ELSEVIER SCIENCE BV
DOI: 10.1016/j.cpc.2018.06.019
Keywords
Block solver; GPU
Funding
- U.S. Department of Energy, Office of Science, Office of High Energy Physics [DE-AC02-07CH11359]
- U.S. National Science Foundation [PHY14-14614]
- Exascale Computing Project [17-SC-20-SC]
- U.S. Department of Energy Office of Science
- National Nuclear Security Administration
- ORNL
Ask authors/readers for more resources
The cost of the iterative solution of a sparse matrix-vector system against multiple vectors is a common challenge within scientific computing. A tremendous number of algorithmic advances, such as eigenvector deflation and domain-specific multi-grid algorithms, have been ubiquitously beneficial in reducing this cost. However, they do not address the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. Practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. We present an implementation of the block Conjugate Gradient algorithm on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. As a representative case, we consider the domain of lattice quantum chromodynamics and present results for one of the fermion discretizations. Using the QUDA library as a framework, we demonstrate a 5 x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster. (C) 2018 Elsevier B.V. All rights reserved.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available