☆ 4.7 Article

Is SGD a Bayesian sampler? Well, almost

JOURNAL OF MACHINE LEARNING RESEARCH (2021)

Journal

JOURNAL OF MACHINE LEARNING RESEARCH

Volume 22, Issue -, Pages -

Publisher

MICROTOME PUBL

Keywords

stochastic gradient descent; Bayesian neural networks; deep learning; Gaussian processes; generalisation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The study found that deep neural networks exhibit strong inductive bias in the overparameterised regime, primarily due to the characteristics of the parameter-function map. The Bayesian posterior probability is a key factor influencing DNN's generalisation ability, closely related to the performance of stochastic gradient descent.

Deep neural networks (DNNs) generalise remarkably well in the overparameterised regime, suggesting a strong inductive bias towards functions with low generalisation error. We empirically investigate this bias by calculating, for a range of architectures and datasets, the probability P-SGD(f vertical bar S) that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function f consistent with a training set S. We also use Gaussian processes to estimate the Bayesian posterior probability P-B(f vertical bar S) that the DNN expresses f upon random sampling of its parameters, conditioned on S. Our main findings are that P-SGD(f vertical bar S) correlates remarkably well with P-B(f vertical bar S) and that P-B(f vertical bar S) is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines P-B(f vertical bar S)), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior P-B(f vertical bar S) is the first order determinant of P-SGD(f vertical bar S), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on P(SGD()f vertical bar S) and/or P-B(f vertical bar S), can shed light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.

Is SGD a Bayesian sampler? Well, almost

Journal

JOURNAL OF MACHINE LEARNING RESEARCH

Publisher

MICROTOME PUBL

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Is SGD a Bayesian sampler? Well, almost

Journal

JOURNAL OF MACHINE LEARNING RESEARCH

Publisher

MICROTOME PUBL

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper