期刊
JOURNAL OF SUPERCOMPUTING
卷 77, 期 10, 页码 11911-11929出版社
SPRINGER
DOI: 10.1007/s11227-021-03762-z
关键词
LRnLA algorithms; Lattice-Boltzmann method; Parallel computing; CUDA GPU
类别
资金
- Russian Science Foundation [18-71-10004]
- Russian Science Foundation [18-71-10004] Funding Source: Russian Science Foundation
This paper explores data flow arrangement possibilities, defines a new propagation scheme, and implements it as GPU program code using a locally recursive non-locally asynchronous algorithm construction method, achieving performance up to 10 GLUps on a single nVidia GeForce RTX 3090 GPU.
The LBM produces stencil numerical schemes which fall into the memory-bound domain. Therefore the performance may be multiplied if the arithmetic intensity is increased. In this paper, the data flow arrangement possibilities at the streaming step are explored while aiming for the development of the most efficient algorithms and implementations of the LBM schemes. The locally recursive non-locally asynchronous algorithm construction method is used for the purpose. This method is based on the analysis of the dependency graph of the task in the dD1T Minkowsky space. The schemes of well-known propagation patterns are illustrated and analyzed. With the knowledge of their advantages and drawbacks, the new propagation scheme is defined. The best propagation scheme which is constructed with this method is implemented as the program code for GPU. The description of the code and the performance results are provided. The obtained performance is up to 10 GLUps on a single nVidia GeForce RTX 3090 GPU.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据