3.8 Proceedings Paper

Improving performance of GMRES by reducing communication and pipelining global collectives

出版社

IEEE
DOI: 10.1109/IPDPSW.2017.65

关键词

-

资金

  1. U.S. Department of Energy Office of Science [DE-FG0213ER26137, DE-SC0010042]
  2. U.S. National Science Foundation [1339822]
  3. U.S. Department of Energys National Nuclear Security Administration [DE-AC04-94AL85000]
  4. U.S. Department of Energy (DOE) [DE-SC0010042] Funding Source: U.S. Department of Energy (DOE)
  5. Direct For Computer & Info Scie & Enginr
  6. Office of Advanced Cyberinfrastructure (OAC) [1339822] Funding Source: National Science Foundation

向作者/读者索取更多资源

We compare the performance of pipelined and s-step GMRES, respectively referred to as l-GMRES and s-GMRES, on distributed-multicore CPUs. Compared to standard GMRES, s-GMRES requires fewer all-reduces, while l-GMRES overlaps the all-reduces with computation. To combine the best features of two algorithms, we propose another variant, (l, t)-GMRES, that not only does fewer global all-reduces than standard GMRES, but also overlaps those all-reduces with other work. We implemented the thread-parallelism and communication-overlap in two different ways. The first uses nonblocking MPI collectives with thread-parallel computational kernels. The second relies on a shared-memory task scheduler. In our experiments, (l, t)-GMRES performed better than l-GMRES by factors of up to 1.67x. In addition, though we only used 50 nodes, when the latency cost became significant, our variant performed up to 1.22x better than s-GMRES by hiding all-reduces.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据