4.7 Article

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

相关参考文献

注意:仅列出部分参考文献,下载原文获取全部文献信息。
Article Computer Science, Theory & Methods

Elastic Deep Learning in Multi-Tenant GPU Clusters

Yidi Wu et al.

Summary: This study focuses on the ability to dynamically adjust parallelism in GPU clusters for deep neural network training. The proposed EDL method enables elastic deep learning with a simple API and incorporates techniques to reduce adjustment overhead. Experiments demonstrate that EDL can bring significant benefits to GPU cluster management.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2022)

Article Computer Science, Theory & Methods

Elastic Parameter Server: Accelerating ML Training With Scalable Resource Scheduling

Shaoqi Wang et al.

Summary: This article introduces a new approach for distributed machine learning training called Elastic Parameter Server (EPS). EPS allows for dynamic adjustment of the number of workers and servers in a job, resulting in faster training speed and improved resource utilization. Experimental results show that EPS achieves a 1.5x improvement in training speed compared to the traditional Parameter Server (PS) model.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2022)

Proceedings Paper Computer Science, Hardware & Architecture

Distributed Inference with Deep Learning Models across Heterogeneous Edge Devices

Chenghao Hu et al.

Summary: This paper presents EdgeFlow, a new distributed inference mechanism designed for general DAG structured deep learning models. By using a new progressive model partitioning algorithm, EdgeFlow partitions the model layers into independent execution units, improving the runtime performance of distributed inference. Experimental results show that EdgeFlow reduces the inference latency by up to 40.2% compared to other approaches.

IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2022) (2022)

Proceedings Paper Computer Science, Information Systems

CHRONUS: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs

Wei Gao et al.

Summary: Modern GPU clusters support distributed Deep Learning training jobs, and job scheduling is crucial for improving performance, resource utilization, and fairness. Chronus is an end-to-end scheduling system designed to provide deadline guarantee for SLO jobs and optimize the performance of best-effort jobs, leveraging unique features of DLT jobs. Large-scale simulations demonstrate that Chronus can significantly reduce deadline miss rates and completion times compared to existing schedulers.

PROCEEDINGS OF THE 2021 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '21) (2021)

Proceedings Paper Computer Science, Information Systems

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

Xupeng Miao et al.

Summary: The paper introduces a novel variant of All-reduce called partial-reduce, which improves tolerance and performance in heterogeneous environments by decomposing synchronous operations into parallel-asynchronous partial reduce operations, with a sub-linear convergence rate similar to distributed SGD.

SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (2021)

Proceedings Paper Computer Science, Hardware & Architecture

RubberBand: Cloud-based Hyperparameter Tuning

Ujval Misra et al.

Summary: Hyperparameter tuning is essential for achieving state-of-the-art accuracy in machine learning. RubberBand is a framework developed to efficiently and elastically execute hyperparameter tuning jobs in the cloud, reducing costs by up to 2x compared to static allocation baselines.

PROCEEDINGS OF THE SIXTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS (EUROSYS '21) (2021)

Proceedings Paper Computer Science, Hardware & Architecture

Elan: Towards Generic and Efficient Elastic Training for Deep Learning

Lei Xie et al.

2020 IEEE 40TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS) (2020)

Proceedings Paper Computer Science, Hardware & Architecture

Distributed Inference Acceleration with Adaptive DNN Partitioning and Offloading

Thaha Mohammed et al.

IEEE INFOCOM 2020 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (2020)

Proceedings Paper Computer Science, Information Systems

Neural Collaborative Filtering

Xiangnan He et al.

PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'17) (2017)

Article Computer Science, Hardware & Architecture

Borg, Omega, and Kubernetes

Brendan Burns et al.

COMMUNICATIONS OF THE ACM (2016)

Article Computer Science, Artificial Intelligence

The MovieLens Datasets: History and Context

F. Maxwell Harper et al.

ACM TRANSACTIONS ON INTERACTIVE INTELLIGENT SYSTEMS (2016)

Article Computer Science, Artificial Intelligence

A fast and elitist multiobjective genetic algorithm: NSGA-II

K Deb et al.

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION (2002)