4.6 Article

Acceleration of a Production-Level Unstructured Grid Finite Volume CFD Code on GPU

Journal

APPLIED SCIENCES-BASEL
Volume 13, Issue 10, Pages -

Publisher

MDPI
DOI: 10.3390/app13106193

Keywords

unstructured-grid; CFD; shared memory parallelization; GPU; data racing

Ask authors/readers for more resources

Due to the complexity of unstructured CFD computing, parallelizing the finite volume method algorithms in shared memory for many-core GPUs is a significant challenge. Three parallel programming strategies and several data locality optimization methods were implemented and evaluated. The proposed methods achieved improved memory access performance and significant acceleration effects compared to CPU versions.
Due to the complex topological relationship, poor data locality, and data racing problems in unstructured CFD computing, how to parallelize the finite volume method algorithms in shared memory to efficiently explore the hardware capabilities of many-core GPUs has become a significant challenge. Based on a production-level unstructured CFD software, three shared memory parallel programming strategies, atomic operation, colouring, and reduction were designed and implemented by deeply analysing its computing behaviour and memory access mode. Several data locality optimization methods-grid reordering, loop fusion, and multi-level memory access-were proposed. Aimed at the sequential attribute of LU-SGS solution, two methods based on cell colouring and hyperplane were implemented. All the parallel methods and optimization techniques implemented were comprehensively analysed and evaluated by the three-dimensional grid of the M6 wing and CHN-T1 aeroplane. The results show that using the Cuthill-McKee grid renumbering and loop fusion optimization techniques can improve memory access performance by 10%. The proposed reduction strategy, combined with multi-level memory access optimization, has a significant acceleration effect, speeding up the hot spot subroutine with data races three times. Compared with the serial CPU version, the overall speed-up of the GPU codes can reach 127. Compared with the parallel CPU version, the overall speed-up of the GPU codes can achieve more than thirty times the result in the same Message Passing Interface (MPI) ranks.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available