☆ 4.7 Article

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING (2022)

Journal

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING

Volume 9, Issue 4, Pages 1951-1969

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TNSE.2021.3104513

Keywords

Servers; Training; Optimization; Scheduling algorithms; Resource management; Approximation algorithms; Heuristic algorithms; Online resource scheduling; distributed machine learning; approximation algorithm

Funding

NSF [CNS-2110259, CNS-2102233, CCF-2110252, ECCS-2140277, CNS-2112694, HKU-17204619, HKU-17208920]
Google

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The rapid growth of distributed machine learning frameworks has posed technical challenges in computing system design and optimization. This paper proposes an online scheduling algorithm that optimizes resource allocation and locality decisions, achieving good performance through approximate algorithm design and analysis.

Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a key question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and observations on the worker-parameter server locality configurations, we transform the problem into a mixed packing and covering integer program, which enables approximation algorithm design; iii) We propose a meticulously designed approximation algorithm based on randomized rounding and rigorously analyze its performance. Collectively, our results contribute to the state of the art of distributed ML system optimization and algorithm design.

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Journal

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Journal

IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper