4.5 Article

Hellinger distance decision trees are robust and skew-insensitive

期刊

DATA MINING AND KNOWLEDGE DISCOVERY
卷 24, 期 1, 页码 136-158

出版社

SPRINGER
DOI: 10.1007/s10618-011-0222-1

关键词

Imbalanced data; Decision tree; Hellinger distance

资金

  1. NSF [ECCS-0926170]
  2. US Department of Energy through ASC CSEE [DE-AC04-76DO00789]
  3. Arthur J. Schmitt Fellowship
  4. Div Of Electrical, Commun & Cyber Sys
  5. Directorate For Engineering [0926170] Funding Source: National Science Foundation

向作者/读者索取更多资源

Learning from imbalanced data is an important and common problem. Decision trees, supplemented with sampling techniques, have proven to be an effective way to address the imbalanced data problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties we propose a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion. We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). We apply a comprehensive empirical evaluation framework testing against commonly used sampling and ensemble methods, considering performance across 58 varied datasets. We demonstrate the superiority (using robust tests of statistical significance) of HDDT on imbalanced data, as well as its competitive performance on balanced datasets. We thereby arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging (BG) without any sampling methods. We provide all the datasets and software for this paper online (http://www.nd.edu/similar to dial/hddt).

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据