4.5 Article

For real: a thorough look at numeric attributes in subgroup discovery

Journal

DATA MINING AND KNOWLEDGE DISCOVERY
Volume 35, Issue 1, Pages 158-212

Publisher

SPRINGER
DOI: 10.1007/s10618-020-00703-x

Keywords

Subgroup discovery; Supervised local pattern mining; Numeric attributes; Discretisation

Ask authors/readers for more resources

Subgroup discovery is an exploratory pattern mining paradigm that is particularly useful for large real-world data with multiple attributes and data types. This paper presents a generic framework for dealing with numeric data and conducts experimental comparisons of a wide range of numeric strategies, organized according to four central dimensions. Results suggest that dynamic, fine-grained threshold determination with binary splits and consideration of multiple candidate thresholds per attribute is often the best approach.
Subgroup discovery (SD) is an exploratory pattern mining paradigm that comes into its own when dealing with large real-world data, which typically involves many attributes, of a mixture of data types. Essential is the ability to deal with numeric attributes, whether they concern the target (a regression setting) or the description attributes (by which subgroups are identified). Various specific algorithms have been proposed in the literature for both cases, but a systematic review of the available options is missing. This paper presents a generic framework that can be instantiated in various ways in order to create different strategies for dealing with numeric data. The bulk of the work in this paper describes an experimental comparison of a considerable range of numeric strategies in SD, where these strategies are organised according to four central dimensions. These experiments are furthermore repeated for both the classification task (target is nominal) and regression task (target is numeric), and the strategies are compared based on the quality of the top subgroup, and the quality and redundancy of the top-kresult set. Results of three search strategies are compared: traditional beam search, complete search, and a variant of diverse subgroup set discovery called cover-based subgroup selection. Although there are various subtleties in the outcome of the experiments, the following general conclusions can be drawn: it is often best to determine numeric thresholds dynamically (locally), in a fine-grained manner, with binary splits, while considering multiple candidate thresholds per attribute.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available