Journal
APPLIED SCIENCES-BASEL
Volume 13, Issue 10, Pages -Publisher
MDPI
DOI: 10.3390/app13106193
Keywords
unstructured-grid; CFD; shared memory parallelization; GPU; data racing
Ask authors/readers for more resources
Due to the complexity of unstructured CFD computing, parallelizing the finite volume method algorithms in shared memory for many-core GPUs is a significant challenge. Three parallel programming strategies and several data locality optimization methods were implemented and evaluated. The proposed methods achieved improved memory access performance and significant acceleration effects compared to CPU versions.
Due to the complex topological relationship, poor data locality, and data racing problems in unstructured CFD computing, how to parallelize the finite volume method algorithms in shared memory to efficiently explore the hardware capabilities of many-core GPUs has become a significant challenge. Based on a production-level unstructured CFD software, three shared memory parallel programming strategies, atomic operation, colouring, and reduction were designed and implemented by deeply analysing its computing behaviour and memory access mode. Several data locality optimization methods-grid reordering, loop fusion, and multi-level memory access-were proposed. Aimed at the sequential attribute of LU-SGS solution, two methods based on cell colouring and hyperplane were implemented. All the parallel methods and optimization techniques implemented were comprehensively analysed and evaluated by the three-dimensional grid of the M6 wing and CHN-T1 aeroplane. The results show that using the Cuthill-McKee grid renumbering and loop fusion optimization techniques can improve memory access performance by 10%. The proposed reduction strategy, combined with multi-level memory access optimization, has a significant acceleration effect, speeding up the hot spot subroutine with data races three times. Compared with the serial CPU version, the overall speed-up of the GPU codes can reach 127. Compared with the parallel CPU version, the overall speed-up of the GPU codes can achieve more than thirty times the result in the same Message Passing Interface (MPI) ranks.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available