☆ 4.7 Article

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2022)

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Volume 34, Issue 8, Pages 3912-3926

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TKDE.2020.3029146

Keywords

Estimation; Curing; Parallel processing; Indexes; Software; Statistical learning; Parallel fractional hot-deck imputation; incomplete big data; multivariate missing data curing; parallel Jackknife variance estimation

Funding

Department of Civil, Construction, and Environmental Engineering of Iowa State University
HPC@ISU equipment at ISU
National Science Foundation under MRI grant [CNS 1229081]
National Science Foundation [CBET-1605275, OAC-1931380]
National Science Foundation under CRI grant [1205413]
Division Of Computer and Network Systems
Direct For Computer & Info Scie & Enginr [1205413] Funding Source: National Science Foundation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The fractional hot-deck imputation (FHDI) is a general imputation method for handling multivariate missing data. However, it lacks efficiency when dealing with big incomplete data. To overcome this limitation, a parallel version called P-FHDI is developed, which shows favorable speedup for large incomplete datasets.

The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI J. Im, I. Cho, and J. K. Kim, An R package for fractional hot deck imputation, R J., vol. 10, no. 1, pp. 140-154, 2018 holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big-n) or high-dimensionality (big-p) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper