4.4 Article

Parallelizing and optimizing large-scale 3D multi-phase flow simulations on the Tianhe-2 supercomputer

Journal

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE
Volume 28, Issue 5, Pages 1678-1692

Publisher

WILEY
DOI: 10.1002/cpe.3717

Keywords

heterogeneous system; intel xeon phi; Tianhe-2; multi-phase flow; LBM

Funding

  1. Basic Research Program of National University of Defense Technology [ZDYYJCYJ20140101]
  2. Open Research Program of China State Key Laboratory of Aerodynamics [SKLA20140104]
  3. IAPCM Application Research Program for High Performance Computing [R2015-0402-01]
  4. National Science Foundation of China [11502296]

Ask authors/readers for more resources

The lattice Boltzmann method (LBM) is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large-scale LBM simulations with increasing resolution and extending temporal range require massive (HPC) resources, thus motivating us to port it onto modern many-core heterogeneous supercomputers like Tianhe-2. Although many-core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating-point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large-scale heterogeneous system. In this paper, we parallelize and optimize the open source 3D multi-phase LBM code openlbmflow on the Intel Xeon Phi (MIC) accelerated Tianhe-2 supercomputer using a hybrid and heterogeneous MPI+OpenMP+Offload+single instruction, mulitple data (SIMD) programming model. With cache blocking and SIMD-friendly data structure transformation, we dramatically improve the SIMD and cache efficiency for the single-thread performance on both CPU and Phi, achieving a speedup of 7.9X and 8.8X, respectively, compared with the baseline code. To collaborate CPUs and Phi processors efficiently, we propose a load-balance scheme to distribute workloads among intra-node two CPUs and three Phi processors and use an asynchronous model to overlap the collaborative computation and communication as far as possible. The collaborative approach with two CPUs and three Phi processors improves the performance by around 3.2X compared with the CPU-only approach. Scalability tests show that openlbmflow can achieve a parallel efficiency of about 60% on 2048 nodes, with about 400K cores in total. To the best of our knowledge, this is the largest scale CPU-MIC collaborative LBM simulation for 3D multi-phase flow problems. Copyright (c) 2015 John Wiley & Sons, Ltd.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.4
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available