☆ 4.7 Article

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2023)

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

卷 34, 期 9, 页码 2553-2567

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TPDS.2023.3293835

关键词

Deep learning system; distributed training; elastic deep learning; GPU cluster scheduling

类别

Computer Science, Theory & Methods Engineering, Electrical & Electronic

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

DeepBoot solves the challenges by utilizing idle GPUs in the inference cluster for the training DLTs. Specifically, it designs an adaptive task scaling (ATS) algorithm to allocate GPUs in the training and inference clusters for training DLTs and minimize the performance loss when reclaiming inference GPUs. Results show that DeepBoot achieves significant average JCT reduction compared with the scheduler without utilizing idle GPUs in the inference cluster.

Deep learning tasks (DLT) include training and inference tasks, where training DLTs have requirements on minimizing average job completion time (JCT) and inference tasks need sufficient GPUs to meet real-time performance. Unfortunately, existing work separately deploys multi-tenant training and inference GPU cluster, leading to the high JCT of training DLTs with limited GPUs when the inference cluster is under insufficient GPU utilization due to the periodic inference workload. DeepBoot solves the challenges by utilizing idle GPUs in the inference cluster for the training DLTs. Specifically, 1) DeepBoot designs adaptive task scaling (ATS) algorithm to allocate GPUs in the training and inference clusters for training DLTs and minimize the performance loss when reclaiming inference GPUs. 2) DeepBoot implements auto-fast elastic (AFE) training based on Pollux to reduce the restart overhead by inference GPU reclaiming. Our implementation on the testbed and large-scale simulation in Microsoft deep learning workload shows that DeepBoot can achieve 32% and 38% average JCT reduction respectively compared with the scheduler without utilizing idle GPUs in the inference cluster.

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

出版社

IEEE COMPUTER SOC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

期刊

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

出版社

IEEE COMPUTER SOC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文