☆ 4.5 Article

Quantifying counts and costs via classification

DATA MINING AND KNOWLEDGE DISCOVERY (2008)

Journal

DATA MINING AND KNOWLEDGE DISCOVERY

Volume 17, Issue 2, Pages 164-206

Publisher

SPRINGER

DOI: 10.1007/s10618-008-0097-y

Keywords

supervised machine learning; classification; prevalence estimation; class distribution estimation; quantification research methodology; detecting and tracking trends; concept drift; class imbalance; text mining

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Many business applications track changes over time, for example, measuring the monthly prevalence of influenza incidents. In situations where a classifier is needed to identify the relevant incidents, imperfect classification accuracy can cause substantial bias in estimating class prevalence. The paper defines two research challenges for machine learning. The 'quantification' task is to accurately estimate the number of positive cases (or class distribution) in a test set, using a training set that may have a substantially different distribution. The 'cost quantification' variant estimates the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the expense to resolve the case. Quantification has a very different utility model from traditional classification research. For both forms of quantification, the paper describes a variety of methods and evaluates them with a suitable methodology, revealing which methods give reliable estimates when training data is scarce, the testing class distribution differs widely from training, and the positive class is rare, e.g., 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor.

Quantifying counts and costs via classification

Journal

DATA MINING AND KNOWLEDGE DISCOVERY

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Quantifying counts and costs via classification

Journal

DATA MINING AND KNOWLEDGE DISCOVERY

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper