☆ 4.7 Article

RLQ: Workload Allocation With Reinforcement Learning in Distributed Queues

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2023)

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Volume 34, Issue 3, Pages 856-868

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TPDS.2022.3231981

Keywords

Task analysis; Resource management; Costs; Reinforcement learning; Hardware; Prediction algorithms; Decision making; Distributed task queuing; reinforcement learning; task allocation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Distributed workload queues are widely used due to their advantages in decoupling, resilience, and scaling. However, existing task allocation strategies may result in high execution times and costs when task information is unavailable and worker node capabilities are not homogeneous. In this work, we propose RLQ, a reinforcement learning-based task allocation solution, which achieves significant improvements in execution cost, time, and waiting time compared to traditional solutions.

Distributed workload queues are nowadays widely used due to their significant advantages in terms of decoupling, resilience, and scaling. Task allocation to worker nodes in distributed queue systems is typically simplistic (e.g., Least Recently Used) or uses hand-crafted heuristics that require task-specific information (e.g., task resource demands or expected time of execution). When such task information is not available and worker node capabilities are not homogeneous, the existing placement strategies may lead to unnecessarily large execution timings and usage costs. In this work, we formulate the task allocation problem in the Markov Decision Process framework, in which an agent assigns tasks to an available resource, and receives a numerical reward signal upon task completion. Our adaptive and learning-based task allocation solution, Reinforcement Learning based Queues (RLQ), is implemented and integrated with the popular Celery task queuing system for Python. We compare RLQ against traditional solutions using both synthetic and real workload traces. On average, using synthetic workloads, RLQ reduces the execution cost by approximately 70%, the execution time by a factor of at least 3x, and the waiting time by almost 7x. Using real traces, we observe an improvement of about 20% for execution cost, around 70% improvement for execution time, and a reduction of approximately 20x in waiting time. We also compare RLQ with a strategy inspired by E-PVM, a state-of-the-art solution used in Google's Borg cluster manager, showing we are able to outperform it in five out of six scenarios.

RLQ: Workload Allocation With Reinforcement Learning in Distributed Queues

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

RLQ: Workload Allocation With Reinforcement Learning in Distributed Queues

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper