4.5 Article

Agglomerative Memory and Thread Scheduling for High-Performance Ray-Tracing on GPUs

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCAD.2021.3058910

关键词

Ray tracing; Graphics processing units; Instruction sets; Microarchitecture; Acceleration; Hardware; Rendering (computer graphics); Graphics processing unit (GPU); irregularity; memory; ray-tracing; scheduling

资金

  1. Key Scientific Instrument and Equipment Development Project of China National Science Foundation [61527812]

向作者/读者索取更多资源

This paper proposes a scheduling mechanism to unleash parallelism in the ray-tracing process on GPUs, which, when combined with a tile-based ray-tracing framework, significantly improves memory efficiency and outperforms traditional GPU architecture.
Ray-tracing rendering has long been considered as a promising technology to enable a higher level of visual experience. The democratization of the ray-tracing rendering to consumer platforms, however, poses significant challenges to rendering hardware and software due to its highly irregular computing patterns. In fact, modern ray-tracing techniques typically depend on a tree-based acceleration structure to reduce the computing complexity of intersection testing of rays and graphics primitives. The traversal by a massive number of rays on a graphics processing unit (GPU) incurs a significant amount of irregular memory traffic, which turns out to be a major stumbling block for real-time performance. In this work, a scheduling mechanism, so-called Agglomerative Memory and Thread Scheduling, is proposed to unleash the inherence parallelism in the ray-tracing process on GPUs. It is associated with a tile-based ray-tracing framework in which the acceleration structure (i.e., KD-tree in this work) is partitioned into subtrees that can be completely loaded into the on-chip L1 cache inside a streaming multiprocessor. An effective scheduling mechanism collects threads with regard to the subtrees hit by their respective rays and regroup threads into warps for dispatching. In addition, subtrees are dynamically preloaded into the L1 cache of multiprocessors in an on-demand fashion. The proposed scheduler can be integrated on today's high-end GPUs with only minor overhead. Microarchitecture simulation results prove that the proposed framework significantly improves memory efficiency and outperforms a traditional GPU microarchitecture by 47.4% for average.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据