☆ 4.5 Article

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions

BIODATA MINING (2021)

期刊

BIODATA MINING

卷 14, 期 1, 页码 -

出版社

BMC

DOI: 10.1186/s13040-021-00243-0

关键词

Machine learning; Feature importances; Random forest; Epistasis; Simulation; Alzheimer's disease; Glaucoma

类别

Mathematical & Computational Biology

资金

National Institutes of Health (USA) [LM010098, AI116794]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The study compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. The results showed that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions, particularly in simulated datasets.

Background: Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer's, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model's performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis. Results: To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions. Conclusions: By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions

期刊

BIODATA MINING

出版社

BMC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions

期刊

BIODATA MINING

出版社

BMC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文