4.5 Article

SAFA: A Semi-Asynchronous Protocol for Fast Federated Learning With Low Overhead

Journal

IEEE TRANSACTIONS ON COMPUTERS
Volume 70, Issue 5, Pages 655-668

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TC.2020.2994391

Keywords

Protocols; Training; Machine learning; Data models; Optimization; Convergence; Distributed databases; Distributed computing; machine learning; edge intelligence; federated learning

Funding

  1. Worldwide Byte Security Information Technology Company ltd.
  2. National Natural Science Foundation of China [61772205]
  3. Guangzhou Development Zone Science and Technology [2018GH17]
  4. Major Program and of Guangdong Basic and Applied Research [2019B030302002]
  5. Guangdong project [2017B030314073, 2018B030325002]
  6. EPSRC Centre for Doctoral Training in Urban Science [EP/L016400/1]
  7. Alan Turing Institute under EPSRC [EP/N510129/1]
  8. National Center of Excellence for IoT Systems Cybersecurity [EP/S035362/1]
  9. Alan Turing Institute under EPSRC Grant PETRAS
  10. EPSRC [EP/N510129/1, EP/R007195/1, EP/L016400/1] Funding Source: UKRI

Ask authors/readers for more resources

SAFA is a semi-asynchronous FL protocol proposed to address issues such as low round efficiency and poor convergence rate in extreme conditions. With novel designs in model distribution, client selection, and global aggregation, it mitigates the impacts of stragglers, crashes, and model staleness to boost efficiency and improve the quality of the global model.
Federated learning (FL) has attracted increasing attention as a promising approach to driving a vast number of end devices with artificial intelligence. However, it is very challenging to guarantee the efficiency of FL considering the unreliable nature of end devices while the cost of device-server communication cannot be neglected. In this article, we propose SAFA, a semi-asynchronous FL protocol, to address the problems in federated learning such as low round efficiency and poor convergence rate in extreme conditions (e.g., clients dropping offline frequently). We introduce novel designs in the steps of model distribution, client selection and global aggregation to mitigate the impacts of stragglers, crashes and model staleness in order to boost efficiency and improve the quality of the global model. We have conducted extensive experiments with typical machine learning tasks. The results demonstrate that the proposed protocol is effective in terms of shortening federated round duration, reducing local resource wastage, and improving the accuracy of the global model at an acceptable communication cost.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available