4.7 Article

A Systematic Approach for Variable Selection With Random Forests: Achieving Stable Variable Importance Values

Journal

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS
Volume 14, Issue 11, Pages 1988-1992

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/LGRS.2017.2745049

Keywords

Mean decrease in accuracy (MDA); mean decrease in Gini (MDG) index; random forest; variable reduction

Funding

  1. Environment and Climate Change Canada
  2. Defence Research and Development Canada

Ask authors/readers for more resources

Random Forests variable importance measures are often used to rank variables by their relevance to a classification problem and subsequently reduce the number of model inputs in high-dimensional data sets, thus increasing computational efficiency. However, as a result of the way that training data and predictor variables are randomly selected for use in constructing each tree and splitting each node, it is also well known that if too few trees are generated, variable importance rankings tend to differ between model runs. In this letter, we characterize the effect of the number of trees (ntree) and class separability on the stability of variable importance rankings and develop a systematic approach to define the number of model runs and/or trees required to achieve stability in variable importance measures. Results demonstrate that both a large ntree for a single model run, or averaged values across multiple model runs with fewer trees, are sufficient for achieving stable mean importance values. While the latter is far more computationally efficient, both the methods tend to lead to the same ranking of variables. Moreover, the optimal number of model runs differs depending on the separability of classes. Recommendations are made to users regarding how to determine the number of model runs and/or trees that are required to achieve stable variable importance rankings.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available