4.7 Article

Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments

Journal

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TPDS.2021.3124670

Keywords

Sparks; Cloud computing; Costs; Task analysis; Service level agreements; Big Data; Reinforcement learning; Cloud computing; cost-efficiency; performance improvement; deep reinforcement learning

Funding

  1. Australian Research Council (ARC)

Ask authors/readers for more resources

This article introduces the job scheduling problem of a cloud-deployed Spark cluster and proposes a deep reinforcement learning (DRL) model as a solution. The proposed DRL-based scheduler is able to consider multiple objectives and learn the characteristics of different types of jobs to reduce the total cost.
Big data frameworks such as Spark and Hadoop are widely adopted to run analytics jobs in both research and industry. Cloud offers affordable compute resources which are easier to manage. Hence, many organizations are shifting towards a cloud deployment of their big data computing clusters. However, job scheduling is a complex problem in the presence of various Service Level Agreement (SLA) objectives such as monetary cost reduction, and job performance improvement. Most of the existing research does not address multiple objectives together and fail to capture the inherent cluster and workload characteristics. In this article, we formulate the job scheduling problem of a cloud-deployed Spark cluster and propose a novel Reinforcement Learning (RL) model to accommodate the SLA objectives. We develop the RL cluster environment and implement two Deep Reinforce Learning (DRL) based schedulers in TF-Agents framework. The proposed DRL-based scheduling agents work at a fine-grained level to place the executors of jobs while leveraging the pricing model of cloud VM instances. In addition, the DRL-based agents can also learn the inherent characteristics of different types of jobs to find a proper placement to reduce both the total cluster VM usage cost and the average job duration. The results show that the proposed DRL-based algorithms can reduce the VM usage cost up to 30%.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available