☆ 4.6 Article

The dilemma of determining the superiority of data mining models: optimal sampling balance and end users' perspectives matter

BULLETIN OF ENGINEERING GEOLOGY AND THE ENVIRONMENT (2020)

Journal

BULLETIN OF ENGINEERING GEOLOGY AND THE ENVIRONMENT

Volume 79, Issue 4, Pages 1707-1720

Publisher

SPRINGER HEIDELBERG

DOI: 10.1007/s10064-019-01687-9

Keywords

MaxEnt; SVM; ANFIS-ICA; ROC curve; False negative; False positive

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

This work pinpoints two main understated issues in landslide susceptibility modeling: (1) how assumptions regarding data sampling balances can significantly affect models' performances and (2) how different modeling perspectives and, in particular, craving for specific attributes in the models can considerably influence the sieving process of the models. Three data mining models and their two-mode ensembles were selected as the basis of our experiment, namely, support vector machine (SVM), maximum entropy (MaxEnt), the ensemble of the adaptive neuro-fuzzy inference system and the imperialistic competitive algorithm (ANFIS-ICA), and their addition/multiplicity ensemble modes (WAE and WME). Further, we imitated four community groups and the main goals they aspire, namely, a speculative builder or a financial risk analyst (seeking the highest economic opportunities), people or NGOs (seeking the lowest human casualties and economic losses), the government (seeking a trade-off between the two latter goals), and a mechanical engineering supervisor (seeking the most robust and stable model design). Results revealed that, in contrast to some assumptions made by several researchers in different literature, the 70:30% partitioned training/validation samples would not give satisfactory results in our study area but, instead, 60:40% partition seems to be a good trade-off for the models' learning and prediction powers. Moreover, the area under the receiver operating characteristic (AUROC) curves suggested that the hybrid of ANFIS-ICA shows excellent results compared with its counterparts. Regarding the model selection stage at the optimal sample balance of 60:40%, it was conceived that although the WME model showed the lowest error type II (false negative) in both training and validation stages, it manifested the highest error type I (false positive) while other models placed somewhere in between. Conversely, the WAE outperformed other models in terms of the lowest error type I. Further, the robustness analysis suggested that SVM and MaxEnt models can provide more stable results compared with their counterparts. Hence, in the process of model selection, perspectives matter the most as there is no one model that performs best for every problem.

The dilemma of determining the superiority of data mining models: optimal sampling balance and end users' perspectives matter

Journal

BULLETIN OF ENGINEERING GEOLOGY AND THE ENVIRONMENT

Publisher

SPRINGER HEIDELBERG

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

The dilemma of determining the superiority of data mining models: optimal sampling balance and end users' perspectives matter

Journal

BULLETIN OF ENGINEERING GEOLOGY AND THE ENVIRONMENT

Publisher

SPRINGER HEIDELBERG

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper