☆ 4.7 Article

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2022)

期刊

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

卷 34, 期 8, 页码 3912-3926

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TKDE.2020.3029146

关键词

Estimation; Curing; Parallel processing; Indexes; Software; Statistical learning; Parallel fractional hot-deck imputation; incomplete big data; multivariate missing data curing; parallel Jackknife variance estimation

类别

Computer Science, Artificial Intelligence Computer Science, Information Systems Engineering, Electrical & Electronic

资金

Department of Civil, Construction, and Environmental Engineering of Iowa State University
HPC@ISU equipment at ISU
National Science Foundation under MRI grant [CNS 1229081]
National Science Foundation [CBET-1605275, OAC-1931380]
National Science Foundation under CRI grant [1205413]
Division Of Computer and Network Systems
Direct For Computer & Info Scie & Enginr [1205413] Funding Source: National Science Foundation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The fractional hot-deck imputation (FHDI) is a general imputation method for handling multivariate missing data. However, it lacks efficiency when dealing with big incomplete data. To overcome this limitation, a parallel version called P-FHDI is developed, which shows favorable speedup for large incomplete datasets.

The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI J. Im, I. Cho, and J. K. Kim, An R package for fractional hot deck imputation, R J., vol. 10, no. 1, pp. 140-154, 2018 holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big-n) or high-dimensionality (big-p) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

期刊

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

期刊

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文