☆ 4.7 Article

Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs

COMPUTER PHYSICS COMMUNICATIONS (2018)

Journal

COMPUTER PHYSICS COMMUNICATIONS

Volume 233, Issue -, Pages 29-40

Publisher

ELSEVIER SCIENCE BV

DOI: 10.1016/j.cpc.2018.06.019

Keywords

Block solver; GPU

Funding

U.S. Department of Energy, Office of Science, Office of High Energy Physics [DE-AC02-07CH11359]
U.S. National Science Foundation [PHY14-14614]
Exascale Computing Project [17-SC-20-SC]
U.S. Department of Energy Office of Science
National Nuclear Security Administration
ORNL

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

The cost of the iterative solution of a sparse matrix-vector system against multiple vectors is a common challenge within scientific computing. A tremendous number of algorithmic advances, such as eigenvector deflation and domain-specific multi-grid algorithms, have been ubiquitously beneficial in reducing this cost. However, they do not address the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. Practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. We present an implementation of the block Conjugate Gradient algorithm on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. As a representative case, we consider the domain of lattice quantum chromodynamics and present results for one of the fermion discretizations. Using the QUDA library as a framework, we demonstrate a 5 x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster. (C) 2018 Elsevier B.V. All rights reserved.

Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs

Journal

COMPUTER PHYSICS COMMUNICATIONS

Publisher

ELSEVIER SCIENCE BV

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs

Journal

COMPUTER PHYSICS COMMUNICATIONS

Publisher

ELSEVIER SCIENCE BV

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper