4.5 Article

Unsupervised interaction-preserving discretization of multivariate data

期刊

DATA MINING AND KNOWLEDGE DISCOVERY
卷 28, 期 5-6, 页码 1366-1397

出版社

SPRINGER
DOI: 10.1007/s10618-014-0350-5

关键词

Discretization; Interaction preservation; Pattern mining; Outlier mining; Classification

资金

  1. German Research Foundation (DFG) [GRK 1194]
  2. YIG program of KIT as part of the German Excellence Initiative
  3. Cluster of Excellence Multimodal Computing and Interaction within the Excellence Initiative of the German Federal Government
  4. Research Foundation-Flanders (FWO)

向作者/读者索取更多资源

Discretization is the transformation of continuous data into discrete bins. It is an important and general pre-processing technique, and a critical element of many data mining and data management tasks. The general goal is to obtain data that retains as much information in the continuous original as possible. In general, but in particular for exploratory tasks, a key open question is how to discretize multivariate data such that significant associations and patterns are preserved. That is exactly the problem we study in this paper. We propose IPD, an information-theoretic method for unsupervised discretization that focuses on preserving multivariate interactions. To this end, when discretizing a dimension, we consider the distribution of the data over all other dimensions. In particular, our method examines consecutive multivariate regions and combines them if (a) their multivariate data distributions are statistically similar, and (b) this merge reduces the MDL encoding cost. To assess the similarities, we propose , a novel interaction distance that does not require assuming a distribution and permits computation in closed form. We give an efficient algorithm for finding the optimal bin merge, as well as a fast well-performing heuristic. Empirical evaluation through pattern-based compression, outlier mining, and classification shows that by preserving interactions we consistently outperform the state of the art in both quality and speed.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据