4.7 Article

A simple one-step index algorithm for implementation of lattice Boltzmann method on GPU

期刊

COMPUTER PHYSICS COMMUNICATIONS
卷 283, 期 -, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.cpc.2022.108603

关键词

Lattice Boltzmann method; One-step index algorithm; High-performance computing; Multi-GPUs

向作者/读者索取更多资源

We proposed a simple one-step index (OSI) algorithm for solving the lattice Boltzmann equation, which achieves the streaming of particle distribution functions (PDFs) on a single grid system. The algorithm is derived from the conventional A-B pattern and has fixed memory addresses for the PDFs in accordance with collision principles. It reassigns their indexes to implicitly compute the streaming process. The algorithm is simple to program and suitable for GPUs, showing good performance and efficiency.
We proposed a simple one-step index (OSI) algorithm for solving the lattice Boltzmann equation, particularly the streaming of particle distribution functions (PDFs) on a single grid system. The OSI algorithm is derived from the conventional A-B pattern. The memory addresses of the PDFs are fixed in this algorithm and consistent with collision principles. The streaming process is implicitly computed by reassigning their indexes corresponding to the time steps, spatial coordinates, and directions of the PDFs. The algorithm is simple to program because it reads and writes the PDFs only once per time step and does not require the synchronization of odd and even time steps. In this implementation, the data layout of the PDFs is the structure of arrays (SoA), suitable for the memory access pattern of graphics processing units (GPUs). The accuracy and single-precision performance of the proposed algorithm for the three-dimensional lid-driven cavity flow simulation with the D3Q19 model were validated and tested on an NVIDIA A100 having a 40 GB PCIe using CUDA and OpenACC. Performances of 8.4 and 8.1 giga lattice updates per second were obtained for CUDA and OpenACC, respectively. OpenACC can outperform CUDA by up to 95% with significantly less programming work. The bandwidth usage rates on a single GPU were 96% and 94% for CUDA and OpenACC, respectively, close to the theoretical values. Lattice Boltzmann method parallelism is implemented using CUDA and MPI for multi-GPU usage. Finally, computation and communication overlaps were implemented to optimize the parallel efficiency, where the weak scaling parallel efficiency exceeded 0.98 on up to 512 GPUs.(c) 2022 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据