4.5 Article

Random forest implementation and optimization for Big Data analytics on LexisNexis's high performance computing cluster platform

期刊

JOURNAL OF BIG DATA
卷 6, 期 1, 页码 -

出版社

SPRINGERNATURE
DOI: 10.1186/s40537-019-0232-1

关键词

Random forest; LexisNexis's high performance computing cluster (HPCC) systems platform; Optimization for Big Data; Distributed machine learning; Turning recursion into iteration

资金

  1. NSF [1464537]
  2. Industry/University Cooperative Research Center [NSF 13-542]
  3. Direct For Computer & Info Scie & Enginr
  4. Division Of Computer and Network Systems [1464537] Funding Source: National Science Foundation

向作者/读者索取更多资源

In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest's learning process is based on the principle of recursive partitioning and although recursion per se is not allowed in ECL (HPCC's programming language), we were able to implement the recursive partition algorithm as an iterative split/partition process. In addition, we analyze the flaws found in our initial implementation and we thoroughly describe all the modifications required to overcome the bottleneck within the iterative split/partition process, i.e., the optimization of the data gathering of selected independent variables which are used for the node's best-split analysis. Essentially, we describe how our initial Random Forest implementation has been optimized and has become an efficient distributed machine learning implementation for Big Data. By taking full advantage of the HPCC Systems Platform's Big Data processing and analytics capabilities, we succeed in enhancing the data gathering method from an inefficient Pass them All and Filter approach into an effective and completely parallelized Fetching on Demand approach. Finally, based upon the results of our learning process runtime comparison between these two approaches, we confirm the speed up of our optimized Random Forest implementation.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据