3.8 Proceedings Paper

Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition

Publisher

SPRINGER INTERNATIONAL PUBLISHING AG
DOI: 10.1007/978-3-030-60245-1_31

Keywords

Big data analysis; Approximate computing; Random sample partition; Ensemble classification; Parallel distributed computing

Funding

  1. National Natural Science Foundation of China [61836005]
  2. Opening Project of Shanghai Trusted Industrial Control Platform [TICPSH202003008-ZC]

Ask authors/readers for more resources

In this article, we use a Kullback-Leibler random sample partition data model to generate a set of disjoint data blocks, where each block is a good representation of the entire data set. Every random sample partition (RSP) block has a sample distribution function similar to the entire data set. To obtain the statistical measure between them, Kernel Density Estimation (KDE) with a dual-tree recursion data structure is firstly applied to fast estimate the probability density of each block. Then, based on the Kullback-Leibler (KL) divergence measure, we can obtain the statistical similarity between a randomly selected RSP data block and other RSP data blocks. We rank the RSP data blocks according to their divergence values in descending order and choose the first ten for an ensemble classification learning. The classification models are established in parallel for the selected RSP data blocks and the final ensemble classification model is obtained by the weighted voting ensemble strategy. The experiments were conducted by building XGboost model based on those ten blocks in parallel, and we incrementally ensemble them according to their KL values. The testing classification results show that our method can increase the generalization capability of the ensemble classification model. It could reduce the model building time in parallel computation environment by using less than 15% of the entire data, which could also solve the memory constraints of big data analysis.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available