4.5 Article

How to Address the Data Quality Issues in Regression Models: A Guided Process for Data Cleaning

期刊

SYMMETRY-BASEL
卷 10, 期 4, 页码 -

出版社

MDPI
DOI: 10.3390/sym10040099

关键词

data cleaning in regression models (DC-RM); data quality issue; data cleaning task; regression model

资金

  1. COLCIENCIAS
  2. Spanish Ministry of Economy, Industry and Competitiveness [TRA2015-63708-R, TRA2016-78886-C3-1-R]

向作者/读者索取更多资源

Today, data availability has gone from scarce to superabundant. Technologies like IoT, trends in social media and the capabilities of smart-phones are producing and digitizing lots of data that was previously unavailable. This massive increase of data creates opportunities to gain new business models, but also demands new techniques and methods of data quality in knowledge discovery, especially when the data comes from different sources (e.g., sensors, social networks, cameras, etc.). The data quality process of the data set proposes conclusions about the information they contain. This is increasingly done with the aid of data cleaning approaches. Therefore, guaranteeing a high data quality is considered as the primary goal of the data scientist. In this paper, we propose a process for data cleaning in regression models (DC-RM). The proposed data cleaning process is evaluated through a real datasets coming from the UCI Repository of Machine Learning Databases. With the aim of assessing the data cleaning process, the dataset that is cleaned by DC-RM was used to train the same regression models proposed by the authors of UCI datasets. The results achieved by the trained models with the dataset produced by DC-RM are better than or equal to that presented by the datasets' authors.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据