4.5 Article

Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD

出版社

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3544782

关键词

Deep learning optimization; hyperparameter tuning; SGD momentum; multistage QHM; asynchronous SGD; generalization bound

向作者/读者索取更多资源

This article investigates the scheduling of hyperparameters to improve the generalization of centralized single-machine stochastic gradient descent (SGD) and distributed asynchronous SGD (ASGD). It proposes a unified framework, called multistage quasi-hyperbolic momentum (Multistage QHM), that covers a large family of momentum variants and shows improved generalization compared to traditional tuning methods.
This article(1) studies how to schedule hyperparameters to improve generalization of both centralized singlemachine stochastic gradient descent (SGD) and distributed asynchronous SGD (ASGD). SGD augmented with momentum variants (e.g., heavy ball momentum (SHB) and Nesterov's accelerated gradient (NAG)) has been the default optimizer for many tasks, in both centralized and distributed environments. However, many advanced momentum variants, despite empirical advantage over classical SHB/NAG, introduce extra hyperparameters to tune. The error-prone tuning is the main barrier for AutoML. Centralized SGD: We first focus on centralized single-machine SGD and show how to efficiently schedule the hyperparameters of a large class of momentum variants to improve generalization. We propose a unified framework called multistage quasi-hyperbolic momentum (Multistage QHM), which covers a large family of momentum variants as its special cases (e.g., vanilla SGD/SHB/NAG). Existing works mainly focus on only scheduling learning rate a's decay, while multistage QHM allows additional varying hyperparameters (e.g., momentum factor), and demonstrates better generalization than only tuning a. We show the convergence of multistage QHM for general non-convex objectives. Distributed SGD: We then extend our theory to distributed asynchronous SGD (ASGD), in which a parameter server distributes data batches to several worker machines and updates parameters via aggregating batch gradients from workers. We quantify the asynchrony between different workers (i.e., gradient staleness), model the dynamics of asynchronous iterations via a stochastic differential equation (SDE), and then derive a PAC-Bayesian generalization bound for ASGD. As a byproduct, we show how a moderately large learning rate helps ASGD to generalize better. Our tuning strategies have rigorous justifications rather than a blind trial-and-error as we theoretically prove why our tuning strategies could decrease our derived generalization errors in both cases. Our strategies simplify the tuning process and beat competitive optimizers in test accuracy empirically. Our codes are publicly available https://github.com/jsycsjh/centralized-asynchronous-tuning.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据