4.7 Article

A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data

Journal

INFORMATION SCIENCES
Volume 345, Issue -, Pages 271-293

Publisher

ELSEVIER SCIENCE INC
DOI: 10.1016/j.ins.2016.01.071

Keywords

Data mining; Mixed attributes; Data stream clustering; Peak field intensity; Mixed distance measure metrics

Funding

  1. National Natural Science Foundation of China [61502423]
  2. Zhejiang Provincial Natural Science Foundation [Y14F020092]

Ask authors/readers for more resources

Most data streams encountered in real life are data objects with mixed numerical and categorical attributes. Currently most data stream algorithms have shortcomings including low clustering quality, difficulties in determining cluster centers, poor ability for dealing with outliers' issue. A fast density-based data stream clustering algorithm with cluster centers automatically determined in the initialization stage is proposed. Based on data attribute relationships analysis, mixed data sets are filed into three types whose corresponding distance measure metrics are designed. Based on field intensity-distance distribution graph for each data object, linear regression model and residuals analysis are used to find the outliers of the graph, enabling cluster centers automatic determination. After the cluster centers are found, all data objects can be clustered according to their distance with centers. The data stream clustering algorithm adopts an online/offline two-stage processing framework, and a new micro cluster characteristic vector to maintain the arriving data objects dynamically. Micro clusters decay function and deletion mechanism of micro clusters are used to maintain the micro clusters, which reflects the data stream evolution process accurately. Finally, the performances of the proposed algorithm are testified by a series of experiments on real-world mixed data sets in comparison with several outstanding clustering algorithms in terms of the clustering purity, efficiency and time complexity. (C) 2016 Elsevier Inc. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available