☆ 4.3 Article

Benchmarking the GPU memory at the warp level

PARALLEL COMPUTING (2018)

期刊

PARALLEL COMPUTING

卷 71, 期 -, 页码 23-41

出版社

ELSEVIER SCIENCE BV

DOI: 10.1016/j.parco.2017.11.003

关键词

Graphic process unit (GPU); Micro-benchmarks; Warp-level latency

类别

Computer Science, Theory & Methods

资金

National Natural Science Foundation of China [61602501, 61272146, 41375113]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Graphic process units (GPUs) are widely used in scientific computing, because of their high performance and energy efficiency. Nonetheless, GPUs are featured with a hierarchical memory system, on which code optimization requires an in-depth understanding for programmers. For this, we often measure the capability (latency or bandwidth) of the memory system with micro-benchmarks. Prior works focus on the latency of a single thread to disclose the unrevealed information. This per-thread measurement cannot reflect the actual process of a program execution, because the smallest executable unit of parallelism on a GPU comprises 32 threads (a warp of threads). This motivates us to benchmark the GPU memory system at the warp-level. In this paper, we benchmark the GPU memory system to quantify the capability of parallel accessing and broadcasting. Such warp-level measurements are performed on shared memory, constant memory, global memory and texture memory. Further, we discuss how to replace local memory with registers, how to avoid bank conflicts of share memory, and how to maximize global memory bandwidth with alternative data types. By analyzing the experimental results, we summarize the optimization guidelines for different types of memories, and build an optimization framework on GPU memories. Taking a case study of maximum noise fraction rotation in dimension reduction of hyperspectral images, we demonstrate that our framework is applicable and effective. Our work discloses the characteristics of GPU memories at the warp-level, and leads to optimization guidelines. The warp-level benchmarking results can facilitate the process of designing parallel algorithms, modeling and optimizing GPU programs. To the best of our knowledge, this is the first benchmarking effort at the warp-level for the GPU memory system. (C) 2017 Elsevier B.V. All rights reserved.

Benchmarking the GPU memory at the warp level

期刊

PARALLEL COMPUTING

出版社

ELSEVIER SCIENCE BV

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Benchmarking the GPU memory at the warp level

期刊

PARALLEL COMPUTING

出版社

ELSEVIER SCIENCE BV

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文