4.7 Article

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

期刊

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TKDE.2020.3029146

关键词

Estimation; Curing; Parallel processing; Indexes; Software; Statistical learning; Parallel fractional hot-deck imputation; incomplete big data; multivariate missing data curing; parallel Jackknife variance estimation

资金

  1. Department of Civil, Construction, and Environmental Engineering of Iowa State University
  2. HPC@ISU equipment at ISU
  3. National Science Foundation under MRI grant [CNS 1229081]
  4. National Science Foundation [CBET-1605275, OAC-1931380]
  5. National Science Foundation under CRI grant [1205413]
  6. Division Of Computer and Network Systems
  7. Direct For Computer & Info Scie & Enginr [1205413] Funding Source: National Science Foundation

向作者/读者索取更多资源

The fractional hot-deck imputation (FHDI) is a general imputation method for handling multivariate missing data. However, it lacks efficiency when dealing with big incomplete data. To overcome this limitation, a parallel version called P-FHDI is developed, which shows favorable speedup for large incomplete datasets.
The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI J. Im, I. Cho, and J. K. Kim, An R package for fractional hot deck imputation, R J., vol. 10, no. 1, pp. 140-154, 2018 holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big-n) or high-dimensionality (big-p) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据