4.6 Article

An Asymptotic Ensemble Learning Framework for Big Data Analysis

Journal

IEEE ACCESS
Volume 7, Issue -, Pages 3675-3693

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2018.2889355

Keywords

Big data analysis; cluster computing; random sample partition; block-level sampling; distributed and parallel computing; approximate computing; random sampling; ensemble methods

Funding

  1. National Natural Science Foundations of China [61503252, 61473194]
  2. China Postdoctoral Science Foundation [2016T90799]
  3. National Key R&D Program of China [2017YFC0822604-2]

Ask authors/readers for more resources

In order to enable big data analysis when data volume goes beyond the available computing resources, we propose a new method for big data analysis. This method uses only a few random sample data blocks of a big data set to obtain approximate results for the entire data set. The random sample partition (RSP) distributed data model is used to represent a big data set as a set of non-overlapping random sample data blocks. Each block is saved as an RSP data block file that can be used directly to estimate the statistical properties of the entire data set. A subset of RSP data blocks is randomly selected and analyzed with existing sequential algorithms in parallel. Then, the results from these blocks are combined to obtain ensemble estimates and models which can be improved gradually by appending new results from the newly analyzed RSP data blocks. To this end, we propose a distributed data-parallel framework (Alpha framework) and develop a prototype of this framework using Microsoft R Server packages and Hadoop distributed file system. The experimental results of three real data sets show that a subset of RSP data blocks of a data set is sufficient to obtain estimates and models which are equivalent to those computed from the entire data set.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available