期刊
COMPUTER PHYSICS COMMUNICATIONS
卷 271, 期 -, 页码 -出版社
ELSEVIER
DOI: 10.1016/j.cpc.2021.108231
关键词
Scale-resolving simulation; Unstructured mesh; Heterogeneous computing; CPU plus GPU; MPI plus OpenMP plus OpenCL; Hybrid supercomputer
资金
- Russian Science Foundation [19-11-00299]
- Russian Science Foundation [19-11-00299] Funding Source: Russian Science Foundation
This paper presents a heterogeneous parallel algorithm and its software implementation for simulating compressible turbulent flows. The algorithm is based on a family of higher accuracy edge-based reconstruction schemes on unstructured mixed-element meshes. The parallel solution can utilize a large number of computing devices on various computing architectures, including manycore CPUs and GPUs. The paper provides a detailed description of the parallel algorithm and its efficient implementation, as well as demonstrations of parallel performance on different supercomputers.
A heterogeneous parallel algorithm for simulation of compressible turbulent flows and its portable software implementation are presented. The underlying numerical method is based on a family of higher accuracy edge-based reconstruction schemes on unstructured mixed-element meshes. The proposed parallel solution can engage a large number of computing devices of most of the existing computing architectures used in modern supercomputers, including manycore CPUs and GPUs. It is capable of co-execution on both CPUs and accelerators simultaneously. The multilevel parallel algorithm combines: MPI for distributing workload among hybrid cluster nodes and between devices inside nodes; OpenMP for manycore CPUs and other supporting devices, such as Intel Xeon Phi; OpenCL for massively-parallel accelerators, such as GPUs of various vendors, including NVIDIA, AMD, Intel. The main focus is on the adaptation of the numerical method and its computational algorithm to the stream processing parallel paradigm. The very limited device memory inherent in GPU computing is also taken into account. A detailed description of the parallel algorithm is presented, as well as the techniques used for its efficient parallel implementation. Special attention is paid to implicit time integration with its linear solver and calculation of convective fluxes and viscous terms. The use of mixed floating-point precision and overlapping communications and computations is also discussed. Parallel performance is demonstrated in practical applications on different kinds of supercomputers using up to 10 thousand cores and multiple GPUs of comparable overall performance. (C) 2021 Elsevier B.V. All rights reserved.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据