☆ 4.7 Article

Hellinger distance decision trees for PU learning in imbalanced data sets

MACHINE LEARNING (2023)

Journal

MACHINE LEARNING

Volume -, Issue -, Pages -

Publisher

SPRINGER

DOI: 10.1007/s10994-023-06323-y

Keywords

PU Learning; Weakly supervised learning; Imbalanced classification; Ensemble learning

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Learning from positive and unlabeled data (PU learning) is challenging when there is class imbalance. This paper proposes PU Hellinger Decision Tree (PU-HDT) to directly handle imbalanced PU data sets. Moreover, PU Stratified Hellinger Random Forest (PU-SHRF) is introduced as an ensemble method that outperforms existing PU learning methods for imbalanced data sets in most experimental settings.

Learning from positive and unlabeled data, or PU learning, is the setting in which a binary classifier can only train from positive and unlabeled instances, the latter containing both positive as well as negative instances. Many PU applications, e.g., fraud detection, are also characterized by class imbalance, which creates a challenging setting. Not only are fewer minority class examples compared to the case where all labels are known, there is also only a small fraction of unlabeled observations that would actually be positive. Despite the relevance of the topic, only a few studies have considered a class imbalance setting in PU learning. In this paper, we propose a novel technique that can directly handle imbalanced PU data, named the PU Hellinger Decision Tree (PU-HDT). Our technique exploits the class prior to estimate the counts of positives and negatives in every node in the tree. Moreover, the Hellinger distance is used instead of more conventional splitting criteria because it has been shown to be class-imbalance insensitive. This simple yet effective adaptation allows PU-HDT to perform well in highly imbalanced PU data sets. We also introduce PU Stratified Hellinger Random Forest (PU-SHRF), which uses PU-HDT as its base learner and integrates a stratified bootstrap sampling. Our empirical analysis shows that PU-SHRF substantially outperforms state-of-the-art PU learning methods for imbalanced data sets in most experimental settings.

Hellinger distance decision trees for PU learning in imbalanced data sets

Journal

MACHINE LEARNING

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Hellinger distance decision trees for PU learning in imbalanced data sets

Journal

MACHINE LEARNING

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper