4.7 Article

Heterogeneous acceleration algorithms for shallow cumulus convection scheme over GPU clusters

Publisher

ELSEVIER
DOI: 10.1016/j.future.2023.04.021

Keywords

High performance computing; Shallow cumulus model; Heterogeneous computing; Graphics processing unit; CUDA; HIP

Ask authors/readers for more resources

The physical process of atmospheric cumulus convection is crucial in climate modeling, but its computational complexity hinders the development of high-resolution models. This paper proposes parallel algorithms for the University of Washington shallow cumulus (UWshcu) model, suitable for large-scale, heterogeneous computing systems. The experimental results demonstrate the efficiency and scalability of these algorithms.
The physical process of atmospheric cumulus convection plays a crucial role in climate modeling, and its complex computational process severely restricts the development of high-resolution climate models. Accelerating the cumulus convective process calculation in climate models is a significant challenge. Traditional CPU-accelerated computing is increasingly unable to meet the growing demand for computing resources from high-resolution climate models. Therefore, developing an efficient cumulus convection scheme is quite valuable and necessary. In response to this demand, this paper selects the University of Washington shallow cumulus (UWshcu) model as the research object and proposes its parallel algorithms suitable for large-scale, heterogeneous, high-performance computing systems: (1) the single GPU acceleration algorithm based on CUDA C, namely GPU-UWshcu; (2) the multi-NVIDIA GPUs acceleration algorithm based on the MPI+CUDA hybrid programming model, namely CGPUs-UWshcu; (3) the multi-AMD GPUs acceleration algorithm based on MPI+HIP, namely HGPUs-UWshcu; (4) the multiple CPUs+GPUs acceleration algorithm based on MPI+OpenMP+HIP, namely MOH-UWshcu. Experimental results show that these algorithms are efficient and have good scalability. GPU-UWshcu achieves a speedup of 74.39x on a single Tesla V100 GPU compared with the serial algorithm running on an Intel Xeon E5-2680 v2 CPU core, and CGPUs-UWshcu achieves a 151.22x speedup on 16 Tesla V100 GPUs compared to a single Intel Xeon E5-2630 v4 CPU (10 cores). On the ORISE supercomputer, HGPUs-UWshcu uses 1024 AMD GPUs to achieve a 664.65x speedup compared to a single CPU (32 cores), with a parallel efficiency of 68.91% compared to using 32 GPUs. Compared to using the same number of CPU cores, MOH-UWshcu uses 8192 CPU cores+1024 GPUs to achieve a speedup of 4.98x, with 55.22 TFLOPS in double precision. (c) 2023 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available