☆ 4.5 Article

Fair train-test split in machine learning: Mitigating spatial autocorrelation for improved prediction accuracy

JOURNAL OF PETROLEUM SCIENCE AND ENGINEERING (2022)

期刊

JOURNAL OF PETROLEUM SCIENCE AND ENGINEERING

卷 209, 期 -, 页码 -

出版社

ELSEVIER

DOI: 10.1016/j.petrol.2021.109885

关键词

Fairness; Spatial autocorrelation; Train-test split; Kriging; Cross-validation

类别

Energy & Fuels Engineering, Petroleum

资金

Equinor
University of Texas at Austin

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study proposes a new method that takes into account spatial autocorrelation in machine learning and designs a fair train-test split. By applying the semivariogram model and modified rejection sampling, the method generates a test set with similar prediction difficulty as the planned real-world use of the model. The method outperforms other approaches in several empirical analyses and provides spatial aware sets ready for predictive machine learning problems.

Machine learning supports prediction and inference in multivariate and complex datasets where observations are spatially related to one another. Frequently, these datasets depict spatial autocorrelation that violates the assumption of identically and independently distributed data. Overlooking this correlation result in overoptimistic models that fail to account for the geographical configuration of data. Furthermore, although different data split methods account for spatial autocorrelation, these methods are inflexible, and the parameter training and hyperparameter tuning of the machine learning model is set with a different prediction difficulty than the planned real-world use of the model. In other words, it is an unfair training-testing process. We present a novel method that considers spatial autocorrelation and planned real-world use of the spatial prediction model to design a fair train-test split. Demonstrations include two examples of the planned real-world use of the model using a realistic multivariate synthetic dataset and the analysis of 148 wells from an undisclosed Equinor play. First, the workflow applies the semivariogram model of the target to compute the simple kriging variance as a proxy of spatial estimation difficulty based on the spatial data configuration. Second, the workflow employs a modified rejection sampling to generate a test set with similar prediction difficulty as the planned real-world use of the model. Third, we compare 100 test sets' realizations to the model's planned real-world use, using probability distributions and two divergence metrics: the Jensen-Shannon distance and the mean squared error. The analysis ranks the spatial fair train-test split method as the only one to replicate the difficulty (i.e., kriging variance) compared to the validation set approach and spatial cross-validation. Moreover, the proposed method outperforms the validation set approach, yielding a minor mean percentage error when predicting a target feature in an undisclosed Equinor play using a random forest model. The resulting outputs are training and test sets ready for model fit and assessment with any machine learning algorithm. Thus, the proposed workflow offers spatial aware sets ready for predictive machine learning problems with similar estimation difficulty as the planned real-world use of the model and compatible with any spatial data analysis task.

Fair train-test split in machine learning: Mitigating spatial autocorrelation for improved prediction accuracy

期刊

JOURNAL OF PETROLEUM SCIENCE AND ENGINEERING

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Fair train-test split in machine learning: Mitigating spatial autocorrelation for improved prediction accuracy

期刊

JOURNAL OF PETROLEUM SCIENCE AND ENGINEERING

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文