☆ 4.3 Article

Benchmarking the GPU memory at the warp level

PARALLEL COMPUTING (2018)

Journal

PARALLEL COMPUTING

Volume 71, Issue -, Pages 23-41

Publisher

ELSEVIER SCIENCE BV

DOI: 10.1016/j.parco.2017.11.003

Keywords

Graphic process unit (GPU); Micro-benchmarks; Warp-level latency

Funding

National Natural Science Foundation of China [61602501, 61272146, 41375113]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Graphic process units (GPUs) are widely used in scientific computing, because of their high performance and energy efficiency. Nonetheless, GPUs are featured with a hierarchical memory system, on which code optimization requires an in-depth understanding for programmers. For this, we often measure the capability (latency or bandwidth) of the memory system with micro-benchmarks. Prior works focus on the latency of a single thread to disclose the unrevealed information. This per-thread measurement cannot reflect the actual process of a program execution, because the smallest executable unit of parallelism on a GPU comprises 32 threads (a warp of threads). This motivates us to benchmark the GPU memory system at the warp-level. In this paper, we benchmark the GPU memory system to quantify the capability of parallel accessing and broadcasting. Such warp-level measurements are performed on shared memory, constant memory, global memory and texture memory. Further, we discuss how to replace local memory with registers, how to avoid bank conflicts of share memory, and how to maximize global memory bandwidth with alternative data types. By analyzing the experimental results, we summarize the optimization guidelines for different types of memories, and build an optimization framework on GPU memories. Taking a case study of maximum noise fraction rotation in dimension reduction of hyperspectral images, we demonstrate that our framework is applicable and effective. Our work discloses the characteristics of GPU memories at the warp-level, and leads to optimization guidelines. The warp-level benchmarking results can facilitate the process of designing parallel algorithms, modeling and optimizing GPU programs. To the best of our knowledge, this is the first benchmarking effort at the warp-level for the GPU memory system. (C) 2017 Elsevier B.V. All rights reserved.

Benchmarking the GPU memory at the warp level

Journal

PARALLEL COMPUTING

Publisher

ELSEVIER SCIENCE BV

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Benchmarking the GPU memory at the warp level

Journal

PARALLEL COMPUTING

Publisher

ELSEVIER SCIENCE BV

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper