期刊
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
卷 34, 期 8, 页码 3912-3926出版社
IEEE COMPUTER SOC
DOI: 10.1109/TKDE.2020.3029146
关键词
Estimation; Curing; Parallel processing; Indexes; Software; Statistical learning; Parallel fractional hot-deck imputation; incomplete big data; multivariate missing data curing; parallel Jackknife variance estimation
类别
资金
- Department of Civil, Construction, and Environmental Engineering of Iowa State University
- HPC@ISU equipment at ISU
- National Science Foundation under MRI grant [CNS 1229081]
- National Science Foundation [CBET-1605275, OAC-1931380]
- National Science Foundation under CRI grant [1205413]
- Division Of Computer and Network Systems
- Direct For Computer & Info Scie & Enginr [1205413] Funding Source: National Science Foundation
The fractional hot-deck imputation (FHDI) is a general imputation method for handling multivariate missing data. However, it lacks efficiency when dealing with big incomplete data. To overcome this limitation, a parallel version called P-FHDI is developed, which shows favorable speedup for large incomplete datasets.
The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI J. Im, I. Cho, and J. K. Kim, An R package for fractional hot deck imputation, R J., vol. 10, no. 1, pp. 140-154, 2018 holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big-n) or high-dimensionality (big-p) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据