☆ 4.7 Article

Distributed nearest neighbor classification for large-scale multi-label data on spark

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE (2018)

期刊

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE

卷 87, 期 -, 页码 66-82

出版社

ELSEVIER

DOI: 10.1016/j.future.2018.04.094

关键词

Apache spark; MapReduce; Distributed computing; Big data; Multi-label classification; Nearest neighbors

类别

Computer Science, Theory & Methods

资金

VCU Research Support Funds, USA
Spanish Ministry of Economy and Competitiveness, Spain
European Regional Development Fund, Belgium [TIN2017-83445-P]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Modern data is characterized by its ever-increasing volume and complexity, particularly when data instances belong to many categories simultaneously. This learning paradigm is known as multi-label classification and one of its most renowned methods is the multi-label k nearest neighbor (ML-KNN). The traditional implementations of this method are not feasible for large-scale multi-label data due to its complexity and memory restrictions. We propose a distributed ML-KNN implementation based on the MapReduce programming model, implemented on Apache Spark. We compare three strategies for distributed nearest neighbor search: 1) iteratively broadcasting instances, 2) using a distributed tree-based index structure, and 3) building hash tables to group instances. The experimental study evaluates the trade-off between the quality of the predictions and runtimes on 22 benchmark datasets, and compares the scalability using different sizes of data. The results indicate that the tree-based index strategy outperforms the other approaches, having a speedup of up to 266x for the largest dataset, while achieving an accuracy equivalent to the exact methods. This strategy enables ML-KNN to scale efficiently with respect to the size of the problem. (C) 2018 Elsevier B.V. All rights reserved.

Distributed nearest neighbor classification for large-scale multi-label data on spark

期刊

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Distributed nearest neighbor classification for large-scale multi-label data on spark

期刊

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文