4.7 Article

Is SGD a Bayesian sampler? Well, almost

Journal

JOURNAL OF MACHINE LEARNING RESEARCH
Volume 22, Issue -, Pages -

Publisher

MICROTOME PUBL

Keywords

stochastic gradient descent; Bayesian neural networks; deep learning; Gaussian processes; generalisation

Ask authors/readers for more resources

The study found that deep neural networks exhibit strong inductive bias in the overparameterised regime, primarily due to the characteristics of the parameter-function map. The Bayesian posterior probability is a key factor influencing DNN's generalisation ability, closely related to the performance of stochastic gradient descent.
Deep neural networks (DNNs) generalise remarkably well in the overparameterised regime, suggesting a strong inductive bias towards functions with low generalisation error. We empirically investigate this bias by calculating, for a range of architectures and datasets, the probability P-SGD(f vertical bar S) that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function f consistent with a training set S. We also use Gaussian processes to estimate the Bayesian posterior probability P-B(f vertical bar S) that the DNN expresses f upon random sampling of its parameters, conditioned on S. Our main findings are that P-SGD(f vertical bar S) correlates remarkably well with P-B(f vertical bar S) and that P-B(f vertical bar S) is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines P-B(f vertical bar S)), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior P-B(f vertical bar S) is the first order determinant of P-SGD(f vertical bar S), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on P(SGD()f vertical bar S) and/or P-B(f vertical bar S), can shed light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available