期刊
DATA MINING AND KNOWLEDGE DISCOVERY
卷 28, 期 5-6, 页码 1366-1397出版社
SPRINGER
DOI: 10.1007/s10618-014-0350-5
关键词
Discretization; Interaction preservation; Pattern mining; Outlier mining; Classification
资金
- German Research Foundation (DFG) [GRK 1194]
- YIG program of KIT as part of the German Excellence Initiative
- Cluster of Excellence Multimodal Computing and Interaction within the Excellence Initiative of the German Federal Government
- Research Foundation-Flanders (FWO)
Discretization is the transformation of continuous data into discrete bins. It is an important and general pre-processing technique, and a critical element of many data mining and data management tasks. The general goal is to obtain data that retains as much information in the continuous original as possible. In general, but in particular for exploratory tasks, a key open question is how to discretize multivariate data such that significant associations and patterns are preserved. That is exactly the problem we study in this paper. We propose IPD, an information-theoretic method for unsupervised discretization that focuses on preserving multivariate interactions. To this end, when discretizing a dimension, we consider the distribution of the data over all other dimensions. In particular, our method examines consecutive multivariate regions and combines them if (a) their multivariate data distributions are statistically similar, and (b) this merge reduces the MDL encoding cost. To assess the similarities, we propose , a novel interaction distance that does not require assuming a distribution and permits computation in closed form. We give an efficient algorithm for finding the optimal bin merge, as well as a fast well-performing heuristic. Empirical evaluation through pattern-based compression, outlier mining, and classification shows that by preserving interactions we consistently outperform the state of the art in both quality and speed.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据