期刊
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS
卷 19, 期 5, 页码 2817-2828出版社
IEEE COMPUTER SOC
DOI: 10.1109/TCBB.2021.3089417
关键词
Feature extraction; Genomics; Bioinformatics; Random forests; Proteins; Ontologies; Support vector machines; Ensemble methods; weighted random sampling; enriched random forest; high-dimensional data; genomic analyses
类别
资金
- National Heart, Lung, and Blood Institute (NHLBI), National Institutes of Health [R01-HL150065]
Enriched Random Forest is developed to enhance the performance of traditional random forest by reducing the contribution of less informative features. It improves the prediction accuracy, especially when relevant features are few.
Ensemble methods such as random forest works well on high-dimensional datasets. However, when the number of features is extremely large compared to the number of samples and the percentage of truly informative feature is very small, performance of traditional random forest decline significantly. To this end, we develop a novel approach that enhance the performance of traditional random forest by reducing the contribution of trees whose nodes are populated with less informative features. The proposed method selects eligible subsets at each node by weighted random sampling as opposed to simple random sampling in traditional random forest. We refer to this modified random forest algorithm as Enriched Random Forest. Using several high-dimensional micro-array datasets, we evaluate the performance of our approach in both regression and classification settings. In addition, we also demonstrate the effectiveness of balanced leave-one-out cross-validation to reduce computational load and decrease sample size while computing feature weights. Overall, the results indicate that enriched random forest improves the prediction accuracy of traditional random forest, especially when relevant features are very few.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据