4.7 Article

MPCA SGD-A Method for Distributed Training of Deep Learning Models on Spark

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
Volume 29, Issue 11, Pages 2540-2556

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TPDS.2018.2833074

Keywords

Deep learning; distributed computing; machine learning; neural networks; spark; stochastic gradient descent

Ask authors/readers for more resources

Many distributed deep learning systems have been published over the past few years, often accompanied by impressive performance claims. In practice these figures are often achieved in high performance computing (HPC) environments with fast InfiniBand network connections. For average deep learning practitioners this is usually an unrealistic scenario, since they cannot afford access to these facilities. Simple re-implementations of algorithms such as EASGD [1] for standard Ethernet environments often fail to replicate the scalability and performance of the original works [2]. In this paper, we explore this particular problem domain and present MPCA SGD, a method for distributed training of deep neural networks that is specifically designed to run in low-budget environments. MPCA SGD tries to make the best possible use of available resources, and can operate well if network bandwidth is constrained. Furthermore, MPCA SGD runs on top of the popular Apache Spark [3] framework. Thus, it can easily be deployed in existing data centers and office environments where Spark is already used. When training large deep learning models in a gigabit Ethernet cluster, MPCA SGD achieves significantly faster convergence rates than many popular alternatives. For example, MPCA SGD can train ResNet-152 [4] up to 5.3x faster than state-of-the-art systems like MXNet [5], up to 5.3x faster than bulk-synchronous systems like SparkNet [6] and up to 5.3x faster than decentral asynchronous systems like EASGD [1].

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available