Journal
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
Volume 15, Issue 11, Pages 5846-5854Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TII.2019.2912723
Keywords
Big Data; Data models; Distributed databases; Computational modeling; Informatics; Cluster computing; Data analysis; Approximate computing; big data analysis; cluster computing; data partitioning; random sampling
Categories
Funding
- National Natural Science Foundation of China [61473194, TII-18-2736]
Ask authors/readers for more resources
With the ever-increasing volume of data, alternative strategies are required to divide big data into statistically consistent data blocks that can be used directly as representative samples of the entire data set in big data analysis. In this paper, we propose the Random Sample Partition (RSP) distributed data model to represent a big data set as a set of disjoint data blocks, called RSP blocks. Each RSP block has a probability distribution similar to that of the entire data set. RSP blocks can be used to estimate the statistical properties of the data and build predictive models without computing the entire data set. We demonstrate the implications of the RSP model on sampling from big data and introduce a new RSP-based method for approximate big data analysis which can be applied to different scenarios in the industry. This method significantly reduces the computational burden of big data and increases the productivity of data scientists.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available