4.7 Article

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Volume 34, Issue 8, Pages 3912-3926

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TKDE.2020.3029146

Keywords

Estimation; Curing; Parallel processing; Indexes; Software; Statistical learning; Parallel fractional hot-deck imputation; incomplete big data; multivariate missing data curing; parallel Jackknife variance estimation

Funding

  1. Department of Civil, Construction, and Environmental Engineering of Iowa State University
  2. HPC@ISU equipment at ISU
  3. National Science Foundation under MRI grant [CNS 1229081]
  4. National Science Foundation [CBET-1605275, OAC-1931380]
  5. National Science Foundation under CRI grant [1205413]
  6. Division Of Computer and Network Systems
  7. Direct For Computer & Info Scie & Enginr [1205413] Funding Source: National Science Foundation

Ask authors/readers for more resources

The fractional hot-deck imputation (FHDI) is a general imputation method for handling multivariate missing data. However, it lacks efficiency when dealing with big incomplete data. To overcome this limitation, a parallel version called P-FHDI is developed, which shows favorable speedup for large incomplete datasets.
The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI J. Im, I. Cho, and J. K. Kim, An R package for fractional hot deck imputation, R J., vol. 10, no. 1, pp. 140-154, 2018 holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big-n) or high-dimensionality (big-p) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available