3.9 Article Proceedings Paper

Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs

期刊

ACM SIGPLAN NOTICES
卷 45, 期 5, 页码 115-125

出版社

ASSOC COMPUTING MACHINERY
DOI: 10.1145/1837853.1693471

关键词

Algorithms; Performance; GPU; sparse matrix-vector multiplication; performance modeling

资金

  1. National Science Foundation (NSF) [0833136]
  2. NSF TeraGrid allocation [CCR-090024]
  3. NSF / Semiconductor Research Corporation (SRC) [0903447, 1981]
  4. Defense Advanced Research Projects Agency (DARPA)
  5. Direct For Computer & Info Scie & Enginr
  6. Division of Computing and Communication Foundations [0953100] Funding Source: National Science Foundation
  7. Direct For Computer & Info Scie & Enginr
  8. Division of Computing and Communication Foundations [0903447, 0833136] Funding Source: National Science Foundation

向作者/读者索取更多资源

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in double-precision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8x and 1.5x for single-and double-precision respectively. However, achieving this level of performance requires input matrix-dependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and run-time estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15% of those found through exhaustive search.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.9
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据