4.2 Article

Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference

Journal

INTERNATIONAL STATISTICAL REVIEW
Volume 89, Issue 2, Pages 382-401

Publisher

WILEY
DOI: 10.1111/insr.12434

Keywords

Calibration weighting; Measurement error; Non-response; Regression estimation; Selection bias

Funding

  1. US National Science Foundation [MMS-1733572]

Ask authors/readers for more resources

The statistical challenges in making valid statistical inference using big data for finite populations are primarily due to statistical bias from under-coverage and measurement errors. By stratifying the population and using a fully responding probability sample, we can estimate the missing data stratum and the population as a whole through a data integration estimator.
The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under-coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classificationmethod for identifying the overlapping units and develop a biascorrected data integration estimator under misclassification errors. Finally, we develop a two-step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing-at-random assumptions for the methods to work. The proposed method is applied to the real data example using 2015-2016 Australian Agricultural Census data.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.2
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available