4.6 Article

Data-Driven Diabetes Risk Factor Prediction Using Machine Learning Algorithms with Feature Selection Technique

期刊

SUSTAINABILITY
卷 15, 期 6, 页码 -

出版社

MDPI
DOI: 10.3390/su15064930

关键词

diabetes; feature selection; risk factors; machine learning

向作者/读者索取更多资源

With the increasing prevalence of type 2 diabetes worldwide, predicting its risk factors becomes crucial. However, there is a lack of certainty in predicting these factors. Thus, this study aimed to use machine learning algorithms to predict diabetes risk factors. Two-fold feature selection techniques, principal component analysis (PCA) and information gain (IG), were applied to improve prediction accuracy. The optimized features were then fed into five machine learning algorithms, including decision tree, random forest, support vector machine, logistic regression, and KNN. The primary data used for training the model followed the safety procedure described in the Helsinki Declaration, 2013, and included 738 records in the final analysis. The results showed an accuracy level of over 82.2% with an AUC value of 87.2%. This research not only identified important clinical and nonclinical factors in diabetes prediction but also discovered the relevance of clinical risk factors such as glucose and dietary factors. The significant contribution of this research lies in the identification of previously unclassified factors considered from both clinical and non-clinical aspects.
As type 2 diabetes becomes more prevalent across the globe, predicting its sources becomes more important. However, there is a big void in predicting the risk factors of this disease. Thus, the purpose of this study is to predict diabetes risk factors by applying machine learning (ML) algorithms. Two-fold feature selection techniques (i.e., principal component analysis, PCA, and information gain, IG) have been applied to boost the prediction accuracy. Then, the optimal features are fed into five ML algorithms, namely decision tree, random forest, support vector machine, logistic regression, and KNN. The primary data used to train the ML model were collected based on the safety procedure described in the Helsinki Declaration, 2013, and 738 records were included in the final analysis. The result has shown an accuracy level of over 82.2%, with an AUC (area under the ROC curve) value of 87.2%. This research not only identified the most important clinical and nonclinical factors in diabetes prediction, but it also found that the clinical risk factor (glucose) is the most relevant for diabetes prediction, followed by dietary factors. The noteworthy contribution of this research is the identification of previously unclassified factors left over from the previous study that considered both clinical and non-clinical aspects.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据