☆ 3.8 Proceedings Paper

Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS) (2017)

期刊

2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS)

卷 -, 期 -, 页码 213-223

出版社

IEEE

DOI: 10.1109/IPDPS.2017.33

关键词

QMC; B-spline; SoA; AoSoA; vectorization; cache-blocking data-layouts and roofline

类别

Computer Science, Hardware & Architecture Computer Science, Theory & Methods Engineering, Electrical & Electronic

资金

Intel Corporation
Advanced Simulation and Computing - Physics and Engineering models program at Sandia National Laboratories
Predictive Theory and Modeling for Materials and Chemical Science program by the Office of Basic Energy Science (BES), Department of Energy (DOE)
Lockheed Martin Corporation, for the U.S. Department of Energys National Nuclear Security Administration [DE-AC04-94AL85000]
DOE Office of Science User Facility [DE-AC02-06CH11357]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

B-spline based orbital representations are widely used in Quantum Monte Carlo (QMC) simulations of solids, historically taking as much as 50% of the total run time. Random accesses to a large four-dimensional array make it challenging to efficiently utilize caches and wide vector units of modern CPUs. We present node-level optimizations of B-spline evaluations on multi/many-core shared memory processors. To increase SIMD efficiency and bandwidth utilization, we first apply data layout transformation from array-of-structures to structure-of-arrays (SoA). Then by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations. These optimizations are portable on four distinct cache-coherent architectures and result in up to 5.6x performance enhancements on Intel (R) Xeon Phi (TM) processor 7250P (KNL), 5.7x on Intel (R) Xeon PhiTM coprocessor 7120P, 10x on an Intel (R) Xeon (R) processor E5v4 CPU and 9.5x on BlueGene/Q processor. Our nested threading implementation shows nearly ideal parallel efficiency on KNL up to 16 threads. We employ roofline performance analysis to model the impacts of our optimizations. This work combined with our current efforts of optimizing other QMC kernels, result in greater than 4.5x speedup of miniQMC on KNL.

Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

期刊

2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS)

出版社

IEEE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

期刊

2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS)

出版社

IEEE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文