4.7 Article

Is SGD a Bayesian sampler? Well, almost

期刊

出版社

MICROTOME PUBL

关键词

stochastic gradient descent; Bayesian neural networks; deep learning; Gaussian processes; generalisation

向作者/读者索取更多资源

The study found that deep neural networks exhibit strong inductive bias in the overparameterised regime, primarily due to the characteristics of the parameter-function map. The Bayesian posterior probability is a key factor influencing DNN's generalisation ability, closely related to the performance of stochastic gradient descent.
Deep neural networks (DNNs) generalise remarkably well in the overparameterised regime, suggesting a strong inductive bias towards functions with low generalisation error. We empirically investigate this bias by calculating, for a range of architectures and datasets, the probability P-SGD(f vertical bar S) that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function f consistent with a training set S. We also use Gaussian processes to estimate the Bayesian posterior probability P-B(f vertical bar S) that the DNN expresses f upon random sampling of its parameters, conditioned on S. Our main findings are that P-SGD(f vertical bar S) correlates remarkably well with P-B(f vertical bar S) and that P-B(f vertical bar S) is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines P-B(f vertical bar S)), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior P-B(f vertical bar S) is the first order determinant of P-SGD(f vertical bar S), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on P(SGD()f vertical bar S) and/or P-B(f vertical bar S), can shed light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据