4.7 Article

Cache blocking strategies applied to flux reconstruction

Journal

COMPUTER PHYSICS COMMUNICATIONS
Volume 271, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.cpc.2021.108193

Keywords

Cache blocking; Kernel fusion; Flux reconstruction; High-order; Computational fluid dynamics

Funding

  1. Imperial College London
  2. Engineering and Physical Sciences Research Council [EP/R030340/1]
  3. EPSRC [EP/R030340/1] Funding Source: UKRI

Ask authors/readers for more resources

This study investigates the performance improvement of Flux Reconstruction methods on modern hardware architectures using cache blocking. Compared to kernel fusion, cache blocking can benefit high-order CFD codes, although it may be simpler but requires sufficiently large CPU cache. In practice, new kernel grouping configurations can lead to approximately 2.81x performance gains.
On modern hardware architectures, the performance of Flux Reconstruction (FR) methods can be limited by memory bandwidth. In a typical implementation, these methods are implemented as a chain of distinct kernels. Often, a dataset which has just been written in the main memory by a kernel is read back immediately by the next kernel. One way to avoid such a redundant expenditure of memory bandwidth is kernel fusion. However, on a practical level kernel fusion requires that the source for all kernels be available, thus preventing calls to certain third-party library functions. Moreover, it can add substantial complexity to a codebase. An alternative to full kernel fusion is cache blocking. But for this to be effective, CPU cache has to be meaningfully big. Historically, size of L1 and L2 caches prevented cache blocking for high-order CFD applications. However in recent years, size of L2 cache has grown from around 0.25 MiB to 1.25 MiB, and made it possible to apply cache blocking for high-order CFD codes. In this approach, kernels remain distinct, and are executed one after another on small chunks of data that can fit in the cache, as opposed to on full datasets. These chunks of data stay in the cache and whenever a kernel requests access to data that is already in the cache, memory bandwidth is conserved. In this study, a data structure that facilitates cache blocking is considered, and a range of kernel grouping configurations for an FR based Euler solver are examined. A theoretical study is conducted for hexahedral elements with no anti-aliasing at p = 3 and p = 4 in order to determine the predicted performance of a few kernel grouping configurations. Then, these candidates are implemented in the PyFR solver and the performance gains in practice are compared with the theoretical estimates that range between 2.05x and 2.50x. An inviscid Taylor-Green Vortex test case is used as a benchmark, and the most performant configuration leads to a speedup of approximately 2.81x in practice. (C) 2021 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available