☆ 4.7 Article

Cache blocking strategies applied to flux reconstruction

COMPUTER PHYSICS COMMUNICATIONS (2022)

Journal

COMPUTER PHYSICS COMMUNICATIONS

Volume 271, Issue -, Pages -

Publisher

ELSEVIER

DOI: 10.1016/j.cpc.2021.108193

Keywords

Cache blocking; Kernel fusion; Flux reconstruction; High-order; Computational fluid dynamics

Funding

Imperial College London
Engineering and Physical Sciences Research Council [EP/R030340/1]
EPSRC [EP/R030340/1] Funding Source: UKRI

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study investigates the performance improvement of Flux Reconstruction methods on modern hardware architectures using cache blocking. Compared to kernel fusion, cache blocking can benefit high-order CFD codes, although it may be simpler but requires sufficiently large CPU cache. In practice, new kernel grouping configurations can lead to approximately 2.81x performance gains.

On modern hardware architectures, the performance of Flux Reconstruction (FR) methods can be limited by memory bandwidth. In a typical implementation, these methods are implemented as a chain of distinct kernels. Often, a dataset which has just been written in the main memory by a kernel is read back immediately by the next kernel. One way to avoid such a redundant expenditure of memory bandwidth is kernel fusion. However, on a practical level kernel fusion requires that the source for all kernels be available, thus preventing calls to certain third-party library functions. Moreover, it can add substantial complexity to a codebase. An alternative to full kernel fusion is cache blocking. But for this to be effective, CPU cache has to be meaningfully big. Historically, size of L1 and L2 caches prevented cache blocking for high-order CFD applications. However in recent years, size of L2 cache has grown from around 0.25 MiB to 1.25 MiB, and made it possible to apply cache blocking for high-order CFD codes. In this approach, kernels remain distinct, and are executed one after another on small chunks of data that can fit in the cache, as opposed to on full datasets. These chunks of data stay in the cache and whenever a kernel requests access to data that is already in the cache, memory bandwidth is conserved. In this study, a data structure that facilitates cache blocking is considered, and a range of kernel grouping configurations for an FR based Euler solver are examined. A theoretical study is conducted for hexahedral elements with no anti-aliasing at p = 3 and p = 4 in order to determine the predicted performance of a few kernel grouping configurations. Then, these candidates are implemented in the PyFR solver and the performance gains in practice are compared with the theoretical estimates that range between 2.05x and 2.50x. An inviscid Taylor-Green Vortex test case is used as a benchmark, and the most performant configuration leads to a speedup of approximately 2.81x in practice. (C) 2021 Elsevier B.V. All rights reserved.

Cache blocking strategies applied to flux reconstruction

Journal

COMPUTER PHYSICS COMMUNICATIONS

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Cache blocking strategies applied to flux reconstruction

Journal

COMPUTER PHYSICS COMMUNICATIONS

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper