☆ 3.8 Proceedings Paper

Gravitational Octree Code Performance Evaluation on Volta GPU

PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019) (2019)

Journal

PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019)

Volume -, Issue -, Pages -

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3337821.3337845

Keywords

GPU computing; Volta architecture; performance modeling; N-body simulation

Funding

Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures of Global Scientific Information and Computing Center, Tokyo Institute of Technology
High Performance Computing Infrastructure in Japan of Global Scientific Information and Computing Center, Tokyo Institute of Technology [jh180045-NAH]
TSUBAME Encouragement Program for Young/Female Users of Global Scientific Information and Computing Center, Tokyo Institute of Technology

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

In this study, the gravitational octree code originally optimized for the Fermi, Kepler, and Maxwell GPU architectures is adapted to the Volta architecture. The Volta architecture introduces independent thread scheduling requiring either the insertion of the explicit synchronizations at appropriate locations or the enforcement of the same implicit synchronizations as do the Pascal or earlier architectures by specifying - gencode arch=compute_60, code=sm_70. The performance measurements on Tesla V100, the current flagship GPU by NVIDIA, revealed that the N-body simulations of the Andromeda galaxy model with 2(23) = 8 388 608 particles took 3.8 x 10(-2) s or 3.3 x 10(-2) s per step for cases without or with the implicit synchronizations, respectively. Tesla V100 achieves a 1.4 to 2.2-fold acceleration in comparison with Tesla P100, the flagship GPU in the previous generation. The observed speed-up of 2.2 is greater than 1.5, which is the ratio of the theoretical peak performance of the two GPUs. The independence of the units for integer operations from those for floating-point number operations enables the overlapped execution of integer and floating-point number operations. It hides the execution time of the integer operations leading to the speed-up rate above the theoretical peak performance ratio. Tesla V100 can execute N-body simulation with up to 25 x 2(20) = 26 214 400 particles, and it took 2.0 x 10(-1) s per step. It corresponds to 3.5 TFlop/s, which is 22% of the single-precision theoretical peak performance.

Gravitational Octree Code Performance Evaluation on Volta GPU

Journal

PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019)

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Gravitational Octree Code Performance Evaluation on Volta GPU

Journal

PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019)

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper