4.5 Article

Fair train-test split in machine learning: Mitigating spatial autocorrelation for improved prediction accuracy

期刊

出版社

ELSEVIER
DOI: 10.1016/j.petrol.2021.109885

关键词

Fairness; Spatial autocorrelation; Train-test split; Kriging; Cross-validation

资金

  1. Equinor
  2. University of Texas at Austin

向作者/读者索取更多资源

This study proposes a new method that takes into account spatial autocorrelation in machine learning and designs a fair train-test split. By applying the semivariogram model and modified rejection sampling, the method generates a test set with similar prediction difficulty as the planned real-world use of the model. The method outperforms other approaches in several empirical analyses and provides spatial aware sets ready for predictive machine learning problems.
Machine learning supports prediction and inference in multivariate and complex datasets where observations are spatially related to one another. Frequently, these datasets depict spatial autocorrelation that violates the assumption of identically and independently distributed data. Overlooking this correlation result in overoptimistic models that fail to account for the geographical configuration of data. Furthermore, although different data split methods account for spatial autocorrelation, these methods are inflexible, and the parameter training and hyperparameter tuning of the machine learning model is set with a different prediction difficulty than the planned real-world use of the model. In other words, it is an unfair training-testing process. We present a novel method that considers spatial autocorrelation and planned real-world use of the spatial prediction model to design a fair train-test split. Demonstrations include two examples of the planned real-world use of the model using a realistic multivariate synthetic dataset and the analysis of 148 wells from an undisclosed Equinor play. First, the workflow applies the semivariogram model of the target to compute the simple kriging variance as a proxy of spatial estimation difficulty based on the spatial data configuration. Second, the workflow employs a modified rejection sampling to generate a test set with similar prediction difficulty as the planned real-world use of the model. Third, we compare 100 test sets' realizations to the model's planned real-world use, using probability distributions and two divergence metrics: the Jensen-Shannon distance and the mean squared error. The analysis ranks the spatial fair train-test split method as the only one to replicate the difficulty (i.e., kriging variance) compared to the validation set approach and spatial cross-validation. Moreover, the proposed method outperforms the validation set approach, yielding a minor mean percentage error when predicting a target feature in an undisclosed Equinor play using a random forest model. The resulting outputs are training and test sets ready for model fit and assessment with any machine learning algorithm. Thus, the proposed workflow offers spatial aware sets ready for predictive machine learning problems with similar estimation difficulty as the planned real-world use of the model and compatible with any spatial data analysis task.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据