☆ 4.3 Article

Scalable communication for high-order stencil computations using CUDA-aware MPI

PARALLEL COMPUTING (2022)

Journal

PARALLEL COMPUTING

Volume 111, Issue -, Pages -

Publisher

ELSEVIER

DOI: 10.1016/J.PARCO.2022.102904

Keywords

High-performance computing; Graphics processing units; Stencil computations; Computational physics; Magnetohydrodynamics

Funding

Academy of Finland ReSoLVE Centre of Excellence [307411]
European Research Council (ERC) under the European Union [818665]
CHARMS within ASIAA from Academia Sinica, Taiwan
European Research Council (ERC) [818665] Funding Source: European Research Council (ERC)

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Modern compute nodes provide high parallelism and processing power. Optimization of data movement is critical for achieving strong scaling in communication-heavy applications. This study explores the computational aspects of iterative stencil loops and implements a communication scheme using CUDA-aware MPI to accelerate magnetohydrodynamics simulations.

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. Our GPU implementation scales strongly from one to 64 devices at 50%-87% of the expected efficiency based on a theoretical performance model. Compared with a multi-core CPU solver, our implementation exhibits 20-60x speedup and 9-12x improved energy efficiency in compute-bound benchmarks on 16 nodes.

Scalable communication for high-order stencil computations using CUDA-aware MPI

Journal

PARALLEL COMPUTING

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Scalable communication for high-order stencil computations using CUDA-aware MPI

Journal

PARALLEL COMPUTING

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper