☆ 4.8 Article

Random Sample Partition: A Distributed Data Model for Big Data Analysis

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS (2019)

Journal

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

Volume 15, Issue 11, Pages 5846-5854

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TII.2019.2912723

Keywords

Big Data; Data models; Distributed databases; Computational modeling; Informatics; Cluster computing; Data analysis; Approximate computing; big data analysis; cluster computing; data partitioning; random sampling

Funding

National Natural Science Foundation of China [61473194, TII-18-2736]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

With the ever-increasing volume of data, alternative strategies are required to divide big data into statistically consistent data blocks that can be used directly as representative samples of the entire data set in big data analysis. In this paper, we propose the Random Sample Partition (RSP) distributed data model to represent a big data set as a set of disjoint data blocks, called RSP blocks. Each RSP block has a probability distribution similar to that of the entire data set. RSP blocks can be used to estimate the statistical properties of the data and build predictive models without computing the entire data set. We demonstrate the implications of the RSP model on sampling from big data and introduce a new RSP-based method for approximate big data analysis which can be applied to different scenarios in the industry. This method significantly reduces the computational burden of big data and increases the productivity of data scientists.

Random Sample Partition: A Distributed Data Model for Big Data Analysis

Journal

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Random Sample Partition: A Distributed Data Model for Big Data Analysis

Journal

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper