4.6 Article

Random Partition Based Adaptive Distributed Kernelized SVM for Big Data

Journal

IEEE ACCESS
Volume 10, Issue -, Pages 95623-95637

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2022.3204114

Keywords

Support vector machines; Distributed databases; Training data; Big Data; Data models; Optimization; Learning systems; Distributed processing; Storage management; Classification algorithms; Distributed learning; large datasets; SVM; classification; distributed processing; distributed storage

Ask authors/readers for more resources

In this paper, a distributed classification technique for big data is presented, which efficiently utilizes distributed storage architecture and data processing units of a cluster. The proposed method does not require pre-structured data partitioning technique and is adaptive to big data mining tools. Extensive empirical analysis shows the effectiveness of the classifiers on benchmark datasets compared to other existing approaches.
In this paper, we present a distributed classification technique for big data by efficiently using distributed storage architecture and data processing units of a cluster. While handling such large data, the existing approaches consider specific data partitioning techniques which demand complete data be processed before partitioning. This leads to an excessive overhead of high computation and data communication. The proposed method does not require any pre-structured data partitioning technique and is also adaptive to big data mining tools. We hypothesize that an effective aggregation of the information generated from data partitions by subprocesses of the complete learning process can lead to accurate prediction results while reducing the overall time complexity. We build three SVM based classifiers, namely one phase voting SVM (1PVSVM), two phase voting SVM (2PVSVM), and similarity based SVM (SIMSVM). Each of these classifiers utilizes the support vectors as the local information to construct the synthesized learner for efficiently reducing the training time and ensuring minimal communication between processing units. In this context, an extensive empirical analysis demonstrates the effectiveness of our classifiers when compared to other existing approaches on several benchmark datasets. However, among existing methods and three of our proposed (1PVSVM, 2PVSIM, and SIMSVM) methods, SIMSVM is the most efficient. Considering MNIST dataset, SIMSVM achieves an average speedup ratio of 0.78 and minimum scalability of 73% when the data size is scaled up to 10 times. It also retains high accuracy (99%) similar to centralized approaches.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available