3.8 Proceedings Paper

Accelerating unstructured-grid CFD algorithms on NVIDIA and AMD GPUs

出版社

IEEE COMPUTER SOC
DOI: 10.1109/IA354616.2021.00010

关键词

Unstructured grid CFD; GPU Performance; Performance Portability; AMD ROCm; Atomic Update

资金

  1. NASA Langley Research Center CIF/IRAD program
  2. NASA Transformational Tools and Technologies (TTT) Project of the Transformative Aeronautics Concepts Program under the Aeronautics Research Mission Directorate
  3. National Institute of Aerospace [NNLO9AAOOA]

向作者/读者索取更多资源

Optimization methods were studied to improve GPU efficiency, with a focus on AMD MI100 GPU and some success on NVIDIA V100 and A100. Techniques combining register shuffling and on-chip shared memory were used to enhance performance.
Computational performance of the FUN3D unstructured-grid computational fluid dynamics (CFD) application on GPUs is highly dependent on the efficiency of floating-point atomic updates needed to support the irregular cell-, edge-, and node-based data access patterns in massively parallel GPU environments. We examine several optimization methods to improve GPU efficiency of performance-critical kernels that are dominated by atomic update costs on NVIDIA V100/A100 and AMD CDNA MI100 GPUs. Optimization on the AMD MI100 GPU was of primary interest since similar hardware will be used in the upcoming Frontier supercomputer. Techniques combining register shuffling and on-chip shared memory were used to transpose and/or aggregate results amongst collaborating GPU threads before atomically updating global memory. These techniques, along with algorithmic optimizations to reduce the update frequency, reduced the run-time of select kernels on the MI100 GPU by a factor of between 2.5 and 6.0 over atomically updating global memory directly. Performance impact on the NVIDIA GPUs was mixed with the performance of the V100 often degraded when using register-based aggregation/transposition techniques while the A100 generally benefited from these methods, though to a lesser extent than measured on the MI100 GPU. Overall, both V100 and A100 GPUs outperformed the MI100 GPU on kernels dominated by double-precision atomic updates; however, the techniques demonstrated here reduced the performance gap and improved the MI100 performance.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据