☆ 4.7 Article

Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING (2015)

期刊

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING

卷 105, 期 -, 页码 155-168

出版社

ELSEVIER

DOI: 10.1016/j.isprsjprs.2015.03.014

关键词

Ensemble margin; Training data; Classification; Remote sensing; Imbalance; Mislabelling

类别

Geography, Physical Geosciences, Multidisciplinary Remote Sensing Imaging Science & Photographic Technology

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Studies have demonstrated the robust performance of the ensemble machine learning classifier, random forests, for remote sensing land cover classification, particularly across complex landscapes. This study introduces new ensemble margin criteria to evaluate the performance of Random Forests (RF) in the context of large area land cover classification and examines the effect of different training data characteristics (imbalance and mislabelling) on classification accuracy and uncertainty. The study presents a new margin weighted confusion matrix, which used in combination with the traditional confusion matrix, provides confidence estimates associated with correctly and misclassified instances in the RF classification model. Landsat TM satellite imagery, topographic and climate ancillary data are used to build binary (forest/non-forest) and multiclass (forest canopy cover classes) classification models, trained using sample aerial photograph maps, across Victoria, Australia. Experiments were undertaken to reveal insights into the behaviour of RF over large and complex data, in which training data are not evenly distributed among classes (imbalance) and contain systematically mislabelled instances. Results of experiments reveal that while the error rate of the RF classifier is relatively insensitive to mislabelled training data (in the multiclass experiment, overall 78.3% Kappa with no mislabelled instances to 70.1% with 25% mislabelling in each class), the level of associated confidence falls at a faster rate than overall accuracy with increasing amounts of mislabelled training data. In general, balanced training data resulted in the lowest overall error rates for classification experiments (82.3% and 78.3% for the binary and multiclass experiments respectively). However, results of the study demonstrate that imbalance can be introduced to improve error rates of more difficult classes, without adversely affecting overall classification accuracy. (C) 2015 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.

Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin

期刊

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING

出版社

ELSEVIER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin

期刊

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING

出版社

ELSEVIER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文