☆ 4.1 Article

Conditional feature importance for mixed data

ASTA-ADVANCES IN STATISTICAL ANALYSIS (2023)

期刊

ASTA-ADVANCES IN STATISTICAL ANALYSIS

卷 -, 期 -, 页码 -

出版社

SPRINGER

DOI: 10.1007/s10182-023-00477-9

关键词

Interpretable machine learning; Feature importance; Knockoffs; Explainable artificial intelligence

类别

Statistics & Probability

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. Our work draws attention to the distinction between marginal and conditional FI measures and their implications. We propose a workflow that combines the conditional predictive impact (CPI) framework with sequential knockoff sampling to provide statistically adequate measurement of conditional FI for mixed data.

Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable's importance before and after adjusting for covariates-i.e., between marginal and conditional measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs-hence, generating synthetic data with similar statistical properties-for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.

Conditional feature importance for mixed data

期刊

ASTA-ADVANCES IN STATISTICAL ANALYSIS

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Conditional feature importance for mixed data

期刊

ASTA-ADVANCES IN STATISTICAL ANALYSIS

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文