4.8 Article

Random Sample Partition: A Distributed Data Model for Big Data Analysis

Journal

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
Volume 15, Issue 11, Pages 5846-5854

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TII.2019.2912723

Keywords

Big Data; Data models; Distributed databases; Computational modeling; Informatics; Cluster computing; Data analysis; Approximate computing; big data analysis; cluster computing; data partitioning; random sampling

Funding

  1. National Natural Science Foundation of China [61473194, TII-18-2736]

Ask authors/readers for more resources

With the ever-increasing volume of data, alternative strategies are required to divide big data into statistically consistent data blocks that can be used directly as representative samples of the entire data set in big data analysis. In this paper, we propose the Random Sample Partition (RSP) distributed data model to represent a big data set as a set of disjoint data blocks, called RSP blocks. Each RSP block has a probability distribution similar to that of the entire data set. RSP blocks can be used to estimate the statistical properties of the data and build predictive models without computing the entire data set. We demonstrate the implications of the RSP model on sampling from big data and introduce a new RSP-based method for approximate big data analysis which can be applied to different scenarios in the industry. This method significantly reduces the computational burden of big data and increases the productivity of data scientists.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available