☆ 4.7 Article

Instance selection of linear complexity for big data

KNOWLEDGE-BASED SYSTEMS (2016)

期刊

KNOWLEDGE-BASED SYSTEMS

卷 107, 期 -, 页码 83-95

出版社

ELSEVIER

DOI: 10.1016/j.knosys.2016.05.056

关键词

Nearest neighbor; Data reduction; Instance selection; Hashing; Big data

类别

Computer Science, Artificial Intelligence

资金

Spanish Ministry of Economy and Competitiveness [TIN 2011-24046, TIN 2015-67534-P]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O(n(2)), or log-linear, O(n log n)) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances). (C) 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license.

Instance selection of linear complexity for big data

期刊

KNOWLEDGE-BASED SYSTEMS

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Instance selection of linear complexity for big data

期刊

KNOWLEDGE-BASED SYSTEMS

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文