4.3 Article

Benchmarking the GPU memory at the warp level

期刊

PARALLEL COMPUTING
卷 71, 期 -, 页码 23-41

出版社

ELSEVIER SCIENCE BV
DOI: 10.1016/j.parco.2017.11.003

关键词

Graphic process unit (GPU); Micro-benchmarks; Warp-level latency

资金

  1. National Natural Science Foundation of China [61602501, 61272146, 41375113]

向作者/读者索取更多资源

Graphic process units (GPUs) are widely used in scientific computing, because of their high performance and energy efficiency. Nonetheless, GPUs are featured with a hierarchical memory system, on which code optimization requires an in-depth understanding for programmers. For this, we often measure the capability (latency or bandwidth) of the memory system with micro-benchmarks. Prior works focus on the latency of a single thread to disclose the unrevealed information. This per-thread measurement cannot reflect the actual process of a program execution, because the smallest executable unit of parallelism on a GPU comprises 32 threads (a warp of threads). This motivates us to benchmark the GPU memory system at the warp-level. In this paper, we benchmark the GPU memory system to quantify the capability of parallel accessing and broadcasting. Such warp-level measurements are performed on shared memory, constant memory, global memory and texture memory. Further, we discuss how to replace local memory with registers, how to avoid bank conflicts of share memory, and how to maximize global memory bandwidth with alternative data types. By analyzing the experimental results, we summarize the optimization guidelines for different types of memories, and build an optimization framework on GPU memories. Taking a case study of maximum noise fraction rotation in dimension reduction of hyperspectral images, we demonstrate that our framework is applicable and effective. Our work discloses the characteristics of GPU memories at the warp-level, and leads to optimization guidelines. The warp-level benchmarking results can facilitate the process of designing parallel algorithms, modeling and optimizing GPU programs. To the best of our knowledge, this is the first benchmarking effort at the warp-level for the GPU memory system. (C) 2017 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.3
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据