☆ 4.7 Article

Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING (2015)

Journal

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING

Volume 105, Issue -, Pages 155-168

Publisher

ELSEVIER

DOI: 10.1016/j.isprsjprs.2015.03.014

Keywords

Ensemble margin; Training data; Classification; Remote sensing; Imbalance; Mislabelling

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Studies have demonstrated the robust performance of the ensemble machine learning classifier, random forests, for remote sensing land cover classification, particularly across complex landscapes. This study introduces new ensemble margin criteria to evaluate the performance of Random Forests (RF) in the context of large area land cover classification and examines the effect of different training data characteristics (imbalance and mislabelling) on classification accuracy and uncertainty. The study presents a new margin weighted confusion matrix, which used in combination with the traditional confusion matrix, provides confidence estimates associated with correctly and misclassified instances in the RF classification model. Landsat TM satellite imagery, topographic and climate ancillary data are used to build binary (forest/non-forest) and multiclass (forest canopy cover classes) classification models, trained using sample aerial photograph maps, across Victoria, Australia. Experiments were undertaken to reveal insights into the behaviour of RF over large and complex data, in which training data are not evenly distributed among classes (imbalance) and contain systematically mislabelled instances. Results of experiments reveal that while the error rate of the RF classifier is relatively insensitive to mislabelled training data (in the multiclass experiment, overall 78.3% Kappa with no mislabelled instances to 70.1% with 25% mislabelling in each class), the level of associated confidence falls at a faster rate than overall accuracy with increasing amounts of mislabelled training data. In general, balanced training data resulted in the lowest overall error rates for classification experiments (82.3% and 78.3% for the binary and multiclass experiments respectively). However, results of the study demonstrate that imbalance can be introduced to improve error rates of more difficult classes, without adversely affecting overall classification accuracy. (C) 2015 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.

Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin

Journal

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin

Journal

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper