☆ 4.5 Article

DOLPHIN: An Efficient Algorithm for Mining Distance-Based Outliers in Very Large Datasets

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA (2009)

期刊

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA

卷 3, 期 1, 页码 -

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.1145/1497577.1497581

关键词

Data mining; outlier detection; distance-based outliers

类别

Computer Science, Information Systems Computer Science, Software Engineering

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

In this work a novel distance-based outlier detection algorithm, named DOLPHIN, working on disk-resident datasets and whose I/O cost corresponds to the cost of sequentially reading the input dataset file twice, is presented. It is both theoretically and empirically shown that the main memory usage of DOLPHIN amounts to a small fraction of the dataset and that DOLPHIN has linear time performance with respect to the dataset size. DOLPHIN gains efficiency by naturally merging together in a unified schema three strategies, namely the selection policy of objects to be maintained in main memory, usage of pruning rules, and similarity search techniques. Importantly, similarity search is accomplished by the algorithm without the need of preliminarily indexing the whole dataset, as other methods do. The algorithm is simple to implement and it can be used with any type of data, belonging to either metric or nonmetric spaces. Moreover, a modification to the basic method allows DOLPHIN to deal with the scenario in which the available buffer of main memory is smaller than its standard requirements. DOLPHIN has been compared with state-of-the-art distance-based outlier detection algorithms, showing that it is much more efficient.

DOLPHIN: An Efficient Algorithm for Mining Distance-Based Outliers in Very Large Datasets

期刊

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

DOLPHIN: An Efficient Algorithm for Mining Distance-Based Outliers in Very Large Datasets

期刊

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文