4.7 Article

Anomalous diffusion dynamics of learning in deep neural networks

期刊

NEURAL NETWORKS
卷 149, 期 -, 页码 18-28

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.neunet.2022.01.019

关键词

Deep neural networks; Stochastic gradient descent; Complex systems

资金

  1. Australian Research Council [DP160104316, DP160104368]

向作者/读者索取更多资源

This study reveals the effectiveness of stochastic gradient descent (SGD) in deep learning by investigating its interactions with the geometrical structure of the loss landscape. The study finds that SGD exhibits rich, complex dynamics with superdiffusion in the initial learning phase and subdiffusion at long times. These learning dynamics are observed in different types of deep neural networks and are independent of batch size and learning rate settings. The superdiffusion process is attributed to the interactions between SGD and fractal-like regions of the loss landscape.
Learning in deep neural networks (DNNs) is implemented through minimizing a highly non-convex loss function, typically by a stochastic gradient descent (SGD) method. This learning process can effectively find generalizable solutions at flat minima. In this study, we present a novel account of how such effective deep learning emerges through the interactions of the SGD and the geometrical structure of the loss landscape. We find that the SGD exhibits rich, complex dynamics when navigating through the loss landscape; initially, the SGD exhibits superdiffusion, which attenuates gradually and changes to subdiffusion at long times when approaching a solution. Such learning dynamics happen ubiquitously in different DNN types such as ResNet, VGG-like networks and Vision Transformers; similar results emerge for various batch size and learning rate settings. The superdiffusion process during the initial learning phase indicates that the motion of SGD along the loss landscape possesses intermittent, big jumps; this non-equilibrium property enables the SGD to effectively explore the loss landscape. By adapting methods developed for studying energy landscapes in complex physical systems, we find that such superdiffusive learning processes are due to the interactions of the SGD and the fractallike regions of the loss landscape. We further develop a phenomenological model to demonstrate the mechanistic role of the fractal-like loss landscape in enabling the SGD to effectively find flat minima. Our results reveal the effectiveness of SGD in deep learning from a novel perspective and have implications for designing efficient deep neural networks.(C) 2022 Elsevier Ltd. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据