☆ 3.8 Proceedings Paper

On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

2012 SYMPOSIUM ON APPLICATION ACCELERATORS IN HIGH PERFORMANCE COMPUTING (SAAHPC) (2012)

期刊

2012 SYMPOSIUM ON APPLICATION ACCELERATORS IN HIGH PERFORMANCE COMPUTING (SAAHPC)

卷 -, 期 -, 页码 74-83

出版社

IEEE COMPUTER SOC

DOI: 10.1109/SAAHPC.2012.12

关键词

GP-GPU; CUDA; concurrent kernel execution; multi-threaded applications

类别

Computer Science, Hardware & Architecture Engineering, Electrical & Electronic

资金

German BMBF [01IH11004A-G]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

General-purpose graphics processing units (GPUs) have been found to be viable solutions for large-scale numerical computations with an inherent potential for massive parallelism. In contrast, only few is known about using GPUs for small-scale computations. To have the GPU not be under-utilized for small problem sizes, a meaningful approach is to perform as many small-scale computations as possible in a concurrent manner. On NVIDIA Fermi GPUs, the concept of Concurrent Kernel Execution (CKE) allows for the execution of up to 16 GPU kernels on a single device. While using CKE in single-threaded CUDA programs is straightforward, for multi-threaded programs it might become a challenge to manage multiple host threads interacting with the GPU device, and in addition to have the CKE concept work properly. It can be observed that CKE performance breaks down when multiple host threads each invoke multiple GPU kernels in succession without synchronizing their actions. Since in real-world applications it is common that multiple host threads process their data independently, a mechanism is needed that helps avoiding CKE breakdown. We propose a producer-consumer principle approach to manage GPU kernel invocations from within parallel host regions by reordering the respective GPU kernels before actually invoking them. We are able to demonstrate significant performance improvements with this technique in a strong scaling simulation of a small molecule solvated within a nanodroplet.

On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

期刊

2012 SYMPOSIUM ON APPLICATION ACCELERATORS IN HIGH PERFORMANCE COMPUTING (SAAHPC)

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

期刊

2012 SYMPOSIUM ON APPLICATION ACCELERATORS IN HIGH PERFORMANCE COMPUTING (SAAHPC)

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文