DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

Article Computer Science, Theory & Methods

Elastic Deep Learning in Multi-Tenant GPU Clusters

Yidi Wu et al.

Summary: This study focuses on the ability to dynamically adjust parallelism in GPU clusters for deep neural network training. The proposed EDL method enables elastic deep learning with a simple API and incorporates techniques to reduce adjustment overhead. Experiments demonstrate that EDL can bring significant benefits to GPU cluster management.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2022)

添加到收藏夹

Article Computer Science, Theory & Methods

Elastic Parameter Server: Accelerating ML Training With Scalable Resource Scheduling

Shaoqi Wang et al.

Summary: This article introduces a new approach for distributed machine learning training called Elastic Parameter Server (EPS). EPS allows for dynamic adjustment of the number of workers and servers in a job, resulting in faster training speed and improved resource utilization. Experimental results show that EPS achieves a 1.5x improvement in training speed compared to the traditional Parameter Server (PS) model.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2022)

添加到收藏夹

Proceedings Paper Computer Science, Hardware & Architecture

Distributed Inference with Deep Learning Models across Heterogeneous Edge Devices

Chenghao Hu et al.

Summary: This paper presents EdgeFlow, a new distributed inference mechanism designed for general DAG structured deep learning models. By using a new progressive model partitioning algorithm, EdgeFlow partitions the model layers into independent execution units, improving the runtime performance of distributed inference. Experimental results show that EdgeFlow reduces the inference latency by up to 40.2% compared to other approaches.

IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2022) (2022)

添加到收藏夹

Proceedings Paper Computer Science, Information Systems

CHRONUS: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs

Wei Gao et al.

Summary: Modern GPU clusters support distributed Deep Learning training jobs, and job scheduling is crucial for improving performance, resource utilization, and fairness. Chronus is an end-to-end scheduling system designed to provide deadline guarantee for SLO jobs and optimize the performance of best-effort jobs, leveraging unique features of DLT jobs. Large-scale simulations demonstrate that Chronus can significantly reduce deadline miss rates and completion times compared to existing schedulers.

PROCEEDINGS OF THE 2021 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '21) (2021)

添加到收藏夹

Proceedings Paper Computer Science, Information Systems

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

Xupeng Miao et al.

Summary: The paper introduces a novel variant of All-reduce called partial-reduce, which improves tolerance and performance in heterogeneous environments by decomposing synchronous operations into parallel-asynchronous partial reduce operations, with a sub-linear convergence rate similar to distributed SGD.

SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (2021)

添加到收藏夹

Proceedings Paper Computer Science, Hardware & Architecture

RubberBand: Cloud-based Hyperparameter Tuning

Ujval Misra et al.

Summary: Hyperparameter tuning is essential for achieving state-of-the-art accuracy in machine learning. RubberBand is a framework developed to efficiently and elastically execute hyperparameter tuning jobs in the cloud, reducing costs by up to 2x compared to static allocation baselines.

PROCEEDINGS OF THE SIXTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS (EUROSYS '21) (2021)

添加到收藏夹

Proceedings Paper Computer Science, Hardware & Architecture