☆ 4.5 Article

Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA (2023)

期刊

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA

卷 17, 期 2, 页码 -

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3544782

关键词

Deep learning optimization; hyperparameter tuning; SGD momentum; multistage QHM; asynchronous SGD; generalization bound

类别

Computer Science, Information Systems Computer Science, Software Engineering

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This article investigates the scheduling of hyperparameters to improve the generalization of centralized single-machine stochastic gradient descent (SGD) and distributed asynchronous SGD (ASGD). It proposes a unified framework, called multistage quasi-hyperbolic momentum (Multistage QHM), that covers a large family of momentum variants and shows improved generalization compared to traditional tuning methods.

This article(1) studies how to schedule hyperparameters to improve generalization of both centralized singlemachine stochastic gradient descent (SGD) and distributed asynchronous SGD (ASGD). SGD augmented with momentum variants (e.g., heavy ball momentum (SHB) and Nesterov's accelerated gradient (NAG)) has been the default optimizer for many tasks, in both centralized and distributed environments. However, many advanced momentum variants, despite empirical advantage over classical SHB/NAG, introduce extra hyperparameters to tune. The error-prone tuning is the main barrier for AutoML. Centralized SGD: We first focus on centralized single-machine SGD and show how to efficiently schedule the hyperparameters of a large class of momentum variants to improve generalization. We propose a unified framework called multistage quasi-hyperbolic momentum (Multistage QHM), which covers a large family of momentum variants as its special cases (e.g., vanilla SGD/SHB/NAG). Existing works mainly focus on only scheduling learning rate a's decay, while multistage QHM allows additional varying hyperparameters (e.g., momentum factor), and demonstrates better generalization than only tuning a. We show the convergence of multistage QHM for general non-convex objectives. Distributed SGD: We then extend our theory to distributed asynchronous SGD (ASGD), in which a parameter server distributes data batches to several worker machines and updates parameters via aggregating batch gradients from workers. We quantify the asynchrony between different workers (i.e., gradient staleness), model the dynamics of asynchronous iterations via a stochastic differential equation (SDE), and then derive a PAC-Bayesian generalization bound for ASGD. As a byproduct, we show how a moderately large learning rate helps ASGD to generalize better. Our tuning strategies have rigorous justifications rather than a blind trial-and-error as we theoretically prove why our tuning strategies could decrease our derived generalization errors in both cases. Our strategies simplify the tuning process and beat competitive optimizers in test accuracy empirically. Our codes are publicly available https://github.com/jsycsjh/centralized-asynchronous-tuning.

Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD

期刊

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD

期刊

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文