☆ 4.7 Article

Distributed Training of Deep Learning Models: A Taxonomic Perspective

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2020)

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Volume 31, Issue 12, Pages 2802-2818

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TPDS.2020.3003307

Keywords

Deep learning; Training; Computational modeling; Parallel processing; Taxonomy; Data models; Biological system modeling; Survey; machine learning; deep learning; distributed systems; stochastic gradient descent; big data

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster. Developers of DDLS are required to make many decisions to process their particular workloads in their chosen environment efficiently. The advent of GPU-based deep learning, the ever-increasing size of datasets, and deep neural network models, in combination with the bandwidth constraints that exist in cluster environments require developers of DDLS to be innovative in order to train high-quality models quickly. Comparing DDLS side-by-side is difficult due to their extensive feature lists and architectural deviations. We aim to shine some light on the fundamental principles that are at work when training deep neural networks in a cluster of independent machines by analyzing the general properties associated with training deep learning models and how such workloads can be distributed in a cluster to achieve collaborative model training. Thereby we provide an overview of the different techniques that are used by contemporary DDLS and discuss their influence and implications on the training process. To conceptualize and compare DDLS, we group different techniques into categories, thus establishing a taxonomy of distributed deep learning systems.

Distributed Training of Deep Learning Models: A Taxonomic Perspective

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Distributed Training of Deep Learning Models: A Taxonomic Perspective

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper