4.7 Article

Anomalous diffusion dynamics of learning in deep neural networks

Journal

NEURAL NETWORKS
Volume 149, Issue -, Pages 18-28

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.neunet.2022.01.019

Keywords

Deep neural networks; Stochastic gradient descent; Complex systems

Funding

  1. Australian Research Council [DP160104316, DP160104368]

Ask authors/readers for more resources

This study reveals the effectiveness of stochastic gradient descent (SGD) in deep learning by investigating its interactions with the geometrical structure of the loss landscape. The study finds that SGD exhibits rich, complex dynamics with superdiffusion in the initial learning phase and subdiffusion at long times. These learning dynamics are observed in different types of deep neural networks and are independent of batch size and learning rate settings. The superdiffusion process is attributed to the interactions between SGD and fractal-like regions of the loss landscape.
Learning in deep neural networks (DNNs) is implemented through minimizing a highly non-convex loss function, typically by a stochastic gradient descent (SGD) method. This learning process can effectively find generalizable solutions at flat minima. In this study, we present a novel account of how such effective deep learning emerges through the interactions of the SGD and the geometrical structure of the loss landscape. We find that the SGD exhibits rich, complex dynamics when navigating through the loss landscape; initially, the SGD exhibits superdiffusion, which attenuates gradually and changes to subdiffusion at long times when approaching a solution. Such learning dynamics happen ubiquitously in different DNN types such as ResNet, VGG-like networks and Vision Transformers; similar results emerge for various batch size and learning rate settings. The superdiffusion process during the initial learning phase indicates that the motion of SGD along the loss landscape possesses intermittent, big jumps; this non-equilibrium property enables the SGD to effectively explore the loss landscape. By adapting methods developed for studying energy landscapes in complex physical systems, we find that such superdiffusive learning processes are due to the interactions of the SGD and the fractallike regions of the loss landscape. We further develop a phenomenological model to demonstrate the mechanistic role of the fractal-like loss landscape in enabling the SGD to effectively find flat minima. Our results reveal the effectiveness of SGD in deep learning from a novel perspective and have implications for designing efficient deep neural networks.(C) 2022 Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available