4.4 Article

Time-Aware Data Partition Optimization and Heterogeneous Task Scheduling Strategies in Spark Clusters

Journal

COMPUTER JOURNAL
Volume -, Issue -, Pages -

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/comjnl/bxad017

Keywords

balanced partition of data; efficient scheduling; cache management; Spark

Ask authors/readers for more resources

This article proposes an efficient strategy to address the problems of data partitioning and job scheduling in the Spark framework. It introduces a balanced partitioning strategy for intermediate data and a task scheduling algorithm using a greedy strategy. Experimental results show that these strategies effectively improve system task processing efficiency and decrease overall task completion time.
The Spark computing framework provides an efficient solution to address the major requirements of big data processing, but data partitioning and job scheduling in the Spark framework are the two major bottlenecks that limit Spark's performance. In the Spark Shuffle phase, the data skewing problem caused by unbalanced data partitioning leads to the problem of increased job completion time. In response to the above problems, a balanced partitioning strategy for intermediate data is proposed in this article, which considers the characteristics of intermediate data, establishes a data skewing model and proposes a dynamic partitioning algorithm. In Spark heterogeneous clusters, because of the differences in node performance and task requirements, the default task scheduling algorithm cannot complete scheduling efficiently, which leads to low system task processing efficiency. In order to deal with the above problems, an efficient job scheduling strategy is proposed in this article, which integrates node performance and task requirements, and proposes a task scheduling algorithm using greedy strategy. The experimental results prove that the dynamic partitioning algorithm for intermediate data proposed in this article effectively alleviates the problem that data skew leads to the decrease of system task processing efficiency and shortens the overall task completion time. The efficient job scheduling strategy proposed in this article can efficiently complete the job scheduling tasks under heterogeneous clusters, allocate jobs to nodes in a balanced manner, decrease the overall job completion time and increase the system resource utilization.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.4
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available