☆ 4.5 Article

Algorithm 967: A Distributed-Memory Fast Multipole Method for Volume Potentials

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE (2016)

Journal

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE

Volume 43, Issue 2, Pages -

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.1145/2898349

Keywords

FMM; N-body problems; potential theory

Funding

AFOSR [FA9550-12-10484]
NSF [CCF-1337393]
U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program [DE-SC0010518, DE-SC0009286]
NIH [10042242]
DARPA [W911NF-115-2-0121]
Technische Universitat Munchen-Institute for Advanced Study - German Excellence Initiative (European Union) [291763]
Direct For Computer & Info Scie & Enginr
Division of Computing and Communication Foundations [1337393] Funding Source: National Science Foundation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

an integral transform: A convolution with the fundamental solution of the PDE, also known as a volume potential. We present a Fast Multipole Method (FMM) for computing volume potentials and use them to construct spatially adaptive solvers for the Poisson, Stokes, and low-frequency Helmholtz problems. Conventional N-body methods apply to discrete particle interactions. With volume potentials, one replaces the sums with volume integrals. Particle N-body methods can be used to accelerate such integrals. but it is more efficient to develop a special FMM. In this article, we discuss the efficient implementation of such an FMM. We use high-order piecewise Chebyshev polynomials and an octree data structure to represent the input and output fields and enable spectrally accurate approximation of the near-field and the Kernel Independent FMM (KIFMM) for the far-field approximation. For distributed-memory parallelism, we use space-filling curves, locally essential trees, and a hypercube-like communication scheme developed previously in our group. We present new near and far interaction traversals that optimize cache usage. Also, unlike particle N-body codes, we need a 2: 1 balanced tree to allow for precomputations. We present a fast scheme for 2: 1 balancing. Finally, we use vectorization, including the AVX instruction set on the Intel Sandy Bridge architecture to get better than 50% of peak floating-point performance. We use task parallelism to employ the Xeon Phi on the Stampede platform at the Texas Advanced Computing Center (TACC). We achieve about 600GFLOP/s of double-precision performance on a single node. Our largest run on Stampede took 3.5s on 16K cores for a problem with 18E+9 unknowns for a highly nonuniform particle distribution (corresponding to an effective resolution exceeding 3E+23 unknowns since we used 23 levels in our octree).

Algorithm 967: A Distributed-Memory Fast Multipole Method for Volume Potentials

Journal

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Algorithm 967: A Distributed-Memory Fast Multipole Method for Volume Potentials

Journal

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper