4.6 Article

MAKING THE LAST ITERATE OF SGD INFORMATION THEORETICALLY OPTIMAL

期刊

SIAM JOURNAL ON OPTIMIZATION
卷 31, 期 2, 页码 1108-1130

出版社

SIAM PUBLICATIONS
DOI: 10.1137/19M128908X

关键词

stochastic gradient descent; convex optimization; gradient descent; last iterate guarantees

资金

  1. ONR [N00014-17-1-2147]
  2. MIT-IBM Watson AI Lab

向作者/读者索取更多资源

Stochastic gradient descent (SGD) is widely used for large-scale optimization, but the conventional step size sequences often lead to suboptimal convergence rates. This work presents new step size sequences that achieve information theoretically optimal bounds on the suboptimality, improving the final iterate significantly compared to standard sequences.
Stochastic gradient descent (SGD) is one of the most widely used algorithms for large-scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) averages of iterates and obtains information theoretically optimal bounds on suboptimality, the last point of SGD is, by far, the most preferred choice in practice. The best known results for the last point of SGD [O. Shamir and T. Zhang, Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 71-79] however, are suboptimal compared to information theoretic lower bounds by a log T factor, where T is the number of iterations. Harvey, Liaw, Plan, and Randhawa [Conference on Learning Theory, PMLR, 2019, pp. 1579-1613] shows that in fact, this additional log T factor is tight for standard step size sequences of Theta(1/root t) and Theta(1/t) for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to nonsmooth, convex functions, the best known step size sequences still lead to O(log T)-suboptimal convergence rates (on the final iterate). The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of the last point of SGD as well as GD. We achieve this by designing a modification scheme that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same suboptimality guarantees as the average of SGD/GD with original sequence. We also show that our result holds with high probability. We validate our results through simulations which demonstrate that the new step size sequence indeed improves the final iterate significantly compared to the standard step size sequences.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据