Journal
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE
Volume 146, Issue -, Pages 166-177Publisher
ELSEVIER
DOI: 10.1016/j.future.2023.04.021
Keywords
High performance computing; Shallow cumulus model; Heterogeneous computing; Graphics processing unit; CUDA; HIP
Categories
Ask authors/readers for more resources
The physical process of atmospheric cumulus convection is crucial in climate modeling, but its computational complexity hinders the development of high-resolution models. This paper proposes parallel algorithms for the University of Washington shallow cumulus (UWshcu) model, suitable for large-scale, heterogeneous computing systems. The experimental results demonstrate the efficiency and scalability of these algorithms.
The physical process of atmospheric cumulus convection plays a crucial role in climate modeling, and its complex computational process severely restricts the development of high-resolution climate models. Accelerating the cumulus convective process calculation in climate models is a significant challenge. Traditional CPU-accelerated computing is increasingly unable to meet the growing demand for computing resources from high-resolution climate models. Therefore, developing an efficient cumulus convection scheme is quite valuable and necessary. In response to this demand, this paper selects the University of Washington shallow cumulus (UWshcu) model as the research object and proposes its parallel algorithms suitable for large-scale, heterogeneous, high-performance computing systems: (1) the single GPU acceleration algorithm based on CUDA C, namely GPU-UWshcu; (2) the multi-NVIDIA GPUs acceleration algorithm based on the MPI+CUDA hybrid programming model, namely CGPUs-UWshcu; (3) the multi-AMD GPUs acceleration algorithm based on MPI+HIP, namely HGPUs-UWshcu; (4) the multiple CPUs+GPUs acceleration algorithm based on MPI+OpenMP+HIP, namely MOH-UWshcu. Experimental results show that these algorithms are efficient and have good scalability. GPU-UWshcu achieves a speedup of 74.39x on a single Tesla V100 GPU compared with the serial algorithm running on an Intel Xeon E5-2680 v2 CPU core, and CGPUs-UWshcu achieves a 151.22x speedup on 16 Tesla V100 GPUs compared to a single Intel Xeon E5-2630 v4 CPU (10 cores). On the ORISE supercomputer, HGPUs-UWshcu uses 1024 AMD GPUs to achieve a 664.65x speedup compared to a single CPU (32 cores), with a parallel efficiency of 68.91% compared to using 32 GPUs. Compared to using the same number of CPU cores, MOH-UWshcu uses 8192 CPU cores+1024 GPUs to achieve a speedup of 4.98x, with 55.22 TFLOPS in double precision. (c) 2023 Elsevier B.V. All rights reserved.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available