Journal
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Volume 17, Issue 9, Pages 1166-1173Publisher
IEEE COMPUTER SOC
DOI: 10.1109/TKDE.2005.135
Keywords
data mining; data preparation; discretization; entropy; heterogeneity
Ask authors/readers for more resources
Discretization, as a preprocessing step for data mining, is a process of converting the continuous attributes of a data set into discrete ones so that they can be treated as the nominal features by machine learning algorithms. Those various discretization methods, that use entropy-based criteria, form a large class of algorithm. However, as a measure of class homogeneity, entropy cannot always accurately reflect the degree of class homogeneity of an interval. Therefore, in this paper, we propose a new measure of class heterogeneity of intervals from the viewpoint of class probability itself, Based on the definition of heterogeneity, we present a new criterion to evaluate a discretization scheme and analyze its property theoretically. Also, a heuristic method is proposed to find the approximate optimal discretization scheme. Finally, our method is compared, in terms of predictive error rate and tree size, with Ent-MDLC, a representative entropy-based discretization method well-known for its good performance. Our method is shown to produce better results than those of Ent-MDLC, although the improvement is not significant. It can be a good alternative to entropy-based discretization methods.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available