☆ 4.6 Article

BPCM: A Flexible High-Speed Bypass Parallel Communication Mechanism for GPU Cluster

IEEE ACCESS (2020)

Journal

IEEE ACCESS

Volume 8, Issue -, Pages 103256-103272

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/ACCESS.2020.2999096

Keywords

DPDK; GPU cluster; multi-core; multi-NIC; data link layer; bypass parallel communication

Funding

National Natural Science Foundation of China [61572325, 60970012]
Ministry of Education Doctoral Fund of Ph.D. Supervisor of China [20113120110008]
Shanghai Key Science and Technology Project in Information Technology Field [14511107902, 16DZ1203603]
Shanghai Leading Academic Discipline775 Project [XTKX2012]
Shanghai Engineering Research Center Project [GCZX14014, C14001]
Intel Asia Pacic Research and Development Center

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

With the increasing complexity of computational tasks faced by artificial intelligence technology, the scale of machine learning models continues to expand, and the data volume and frequency of parameter synchronization also increase. This will cause the communication bandwidth within the GPU cluster to become the biggest bottleneck for distributed model training. Many existing solutions cannot be widely promoted due to the need for professional equipment support, high cost, and difficulty in use. To solve this problem, this paper proposes a multi-network card bypass parallel communication mechanism based on Intel DPDK technology to increase the bandwidth within the GPU cluster at a lower cost and make full use of the idle CPU resources of the GPU server to accelerate data transmission. Firstly, we propose a data transmission model based on multiple network cards, and design a port load balancing algorithm to ensure load balancing of multiple network cards. Secondly, the model and algorithm of CPU multi-core scheduling are implemented to reduce CPU energy consumption, resource occupation, and the impact on other applications. Furthermore, for multiple application scenarios, a rate adjustment model and algorithm are designed and implemented to ensure fair use of application bandwidth. Finally, the experimental results show that this mechanism can provide high bandwidth for GPU clusters with inexpensive multi-network cards, and provide superimposed bandwidth of multi-network cards in a single connection, which has high reliability and transmission efficiency, and is simple to use and flexible to expand.

BPCM: A Flexible High-Speed Bypass Parallel Communication Mechanism for GPU Cluster

Journal

IEEE ACCESS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

BPCM: A Flexible High-Speed Bypass Parallel Communication Mechanism for GPU Cluster

Journal

IEEE ACCESS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper