☆ 3.8 Article

DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS (2018)

Journal

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS

Volume 4, Issue 4, Pages 635-648

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TMSCS.2018.2845886

Keywords

DLoBD; deep learning; big data; CaffeOnSpark; TensorFlowOnSpark; MMLSpark (CNTKOnSpark); BigDL; RDMA

Funding

US National Science Foundation [OCI-1053575]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Deep Learning over Big Data (DLoBD) is an emerging paradigm to mine value from the massive amount of gathered data. Many Deep Learning frameworks, like Caffe, TensorFlow, etc., start running over Big Data stacks, such as Apache Hadoop and Spark. Even though a lot of activities are happening in the field, there is a lack of comprehensive studies on analyzing the impact of RDMA-capable networks and CPUs/GPUs on DLoBD stacks. To fill this gap, we propose a systematical characterization methodology and conduct extensive performance evaluations on four representative DLoBD stacks (i.e., CaffeOnSpark, TensorFlowOnSpark, MMLSpark/CNTKOnSpark, and BigDL) to expose the interesting trends regarding performance, scalability, accuracy, and resource utilization. Our observations show that RDMA-based design for DLoBD stacks can achieve up to 2.7x speedup compared to the IPoIB-based scheme. The RDMA scheme also scales better and utilizes resources more efficiently than IPoIB. For most cases, GPU-based schemes can outperform CPU-based designs, but we see that for LeNet on MNIST, CPU + MKL can achieve better performance than GPU and GPU + cuDNN on 16 nodes. Through our evaluation and an in-depth analysis on TensorFlowOnSpark, we find that there are large rooms to improve the designs of current-generation DLoBD stacks.

DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters

Journal

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters

Journal

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper