☆ 4.5 Article

Random forest implementation and optimization for Big Data analytics on LexisNexis's high performance computing cluster platform

JOURNAL OF BIG DATA (2019)

期刊

JOURNAL OF BIG DATA

卷 6, 期 1, 页码 -

出版社

SPRINGERNATURE

DOI: 10.1186/s40537-019-0232-1

关键词

Random forest; LexisNexis's high performance computing cluster (HPCC) systems platform; Optimization for Big Data; Distributed machine learning; Turning recursion into iteration

类别

Computer Science, Theory & Methods

资金

NSF [1464537]
Industry/University Cooperative Research Center [NSF 13-542]
Direct For Computer & Info Scie & Enginr
Division Of Computer and Network Systems [1464537] Funding Source: National Science Foundation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest's learning process is based on the principle of recursive partitioning and although recursion per se is not allowed in ECL (HPCC's programming language), we were able to implement the recursive partition algorithm as an iterative split/partition process. In addition, we analyze the flaws found in our initial implementation and we thoroughly describe all the modifications required to overcome the bottleneck within the iterative split/partition process, i.e., the optimization of the data gathering of selected independent variables which are used for the node's best-split analysis. Essentially, we describe how our initial Random Forest implementation has been optimized and has become an efficient distributed machine learning implementation for Big Data. By taking full advantage of the HPCC Systems Platform's Big Data processing and analytics capabilities, we succeed in enhancing the data gathering method from an inefficient Pass them All and Filter approach into an effective and completely parallelized Fetching on Demand approach. Finally, based upon the results of our learning process runtime comparison between these two approaches, we confirm the speed up of our optimized Random Forest implementation.

Random forest implementation and optimization for Big Data analytics on LexisNexis's high performance computing cluster platform

期刊

JOURNAL OF BIG DATA

出版社

SPRINGERNATURE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Random forest implementation and optimization for Big Data analytics on LexisNexis's high performance computing cluster platform

期刊

JOURNAL OF BIG DATA

出版社

SPRINGERNATURE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文