☆ 4.7 Article

LASSO and Elastic Net Tend to Over-Select Features

MATHEMATICS (2023)

期刊

MATHEMATICS

卷 11, 期 17, 页码 -

出版社

MDPI

DOI: 10.3390/math11173738

关键词

logistic regression; machine learning; prediction model; ROC curve; variable selection

类别

Mathematics

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper discusses the standard approach of using machine learning methods to select features and build prediction models, and the issues with popular methods like LASSO and elastic net. The paper proposes a combination of standard regression methods and stepwise variable selection to overcome these issues and highlights the advantages of this method in terms of statistical significance and prediction accuracy compared to LASSO and elastic net.

Machine learning methods have been a standard approach to select features that are associated with an outcome and to build a prediction model when the number of candidate features is large. LASSO is one of the most popular approaches to this end. The LASSO approach selects features with large regression estimates, rather than based on statistical significance, that are associated with the outcome by imposing an L1-norm penalty to overcome the high dimensionality of the candidate features. As a result, LASSO may select insignificant features while possibly missing significant ones. Furthermore, from our experience, LASSO has been found to select too many features. By selecting features that are not associated with the outcome, we may have to spend more cost to collect and manage them in the future use of a fitted prediction model. Using the combination of L1- and L2-norm penalties, elastic net (EN) tends to select even more features than LASSO. The overly selected features that are not associated with the outcome act like white noise, so that the fitted prediction model may lose prediction accuracy. In this paper, we propose to use standard regression methods, without any penalizing approach, combined with a stepwise variable selection procedure to overcome these issues. Unlike LASSO and EN, this method selects features based on statistical significance. Through extensive simulations, we show that this maximum likelihood estimation-based method selects a very small number of features while maintaining a high prediction power, whereas LASSO and EN make a large number of false selections to result in loss of prediction accuracy. Contrary to LASSO and EN, the regression methods combined with a stepwise variable selection method is a standard statistical method, so that any biostatistician can use it to analyze high-dimensional data, even without advanced bioinformatics knowledge.

LASSO and Elastic Net Tend to Over-Select Features

期刊

MATHEMATICS

出版社

MDPI

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

LASSO and Elastic Net Tend to Over-Select Features

期刊

MATHEMATICS

出版社

MDPI

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文