4.7 Article

Unraveling Network-Induced Memory Contention: Deeper Insights with Machine Learning

期刊

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TPDS.2017.2773483

关键词

Measurement; performance; memory contention; networks; asynchronous communication; machine learning

资金

  1. United States Department of Energy's National Nuclear Security Administration [DE-AC04-94AL85000]

向作者/读者索取更多资源

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems-enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. In this work we examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMC's resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMC's impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. Additionally, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Lastly, we evaluated three potential techniques to reduce NiMC's impact, namely hardware offloading, core reservation and software-based network throttling.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据