☆ 4.7 Article

MPCA SGD-A Method for Distributed Training of Deep Learning Models on Spark

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2018)

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Volume 29, Issue 11, Pages 2540-2556

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TPDS.2018.2833074

Keywords

Deep learning; distributed computing; machine learning; neural networks; spark; stochastic gradient descent

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Many distributed deep learning systems have been published over the past few years, often accompanied by impressive performance claims. In practice these figures are often achieved in high performance computing (HPC) environments with fast InfiniBand network connections. For average deep learning practitioners this is usually an unrealistic scenario, since they cannot afford access to these facilities. Simple re-implementations of algorithms such as EASGD [1] for standard Ethernet environments often fail to replicate the scalability and performance of the original works [2]. In this paper, we explore this particular problem domain and present MPCA SGD, a method for distributed training of deep neural networks that is specifically designed to run in low-budget environments. MPCA SGD tries to make the best possible use of available resources, and can operate well if network bandwidth is constrained. Furthermore, MPCA SGD runs on top of the popular Apache Spark [3] framework. Thus, it can easily be deployed in existing data centers and office environments where Spark is already used. When training large deep learning models in a gigabit Ethernet cluster, MPCA SGD achieves significantly faster convergence rates than many popular alternatives. For example, MPCA SGD can train ResNet-152 [4] up to 5.3x faster than state-of-the-art systems like MXNet [5], up to 5.3x faster than bulk-synchronous systems like SparkNet [6] and up to 5.3x faster than decentral asynchronous systems like EASGD [1].

MPCA SGD-A Method for Distributed Training of Deep Learning Models on Spark

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

MPCA SGD-A Method for Distributed Training of Deep Learning Models on Spark

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper