Literature DB >> 33619091

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima.

Yu Feng1,2, Yuhai Tu3.   

Abstract

Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.

Entities:  

Keywords:  generalization; loss landscape; machine learning; statistical physics; stochastic gradient descent

Year:  2021        PMID: 33619091      PMCID: PMC7936325          DOI: 10.1073/pnas.2015617118

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


  9 in total

1.  Structure of stochastic dynamics near fixed points.

Authors:  Chulan Kwon; Ping Ao; David J Thouless
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-01       Impact factor: 11.205

2.  Optimization by simulated annealing.

Authors:  S Kirkpatrick; C D Gelatt; M P Vecchi
Journal:  Science       Date:  1983-05-13       Impact factor: 47.728

3.  Flat minima.

Authors:  S Hochreiter; J Schmidhuber
Journal:  Neural Comput       Date:  1997-01-01       Impact factor: 2.026

Review 4.  Deep learning.

Authors:  Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal:  Nature       Date:  2015-05-28       Impact factor: 49.962

5.  Reconciling modern machine-learning practice and the classical bias-variance trade-off.

Authors:  Mikhail Belkin; Daniel Hsu; Siyuan Ma; Soumik Mandal
Journal:  Proc Natl Acad Sci U S A       Date:  2019-07-24       Impact factor: 11.205

6.  Overcoming catastrophic forgetting in neural networks.

Authors:  James Kirkpatrick; Razvan Pascanu; Neil Rabinowitz; Joel Veness; Guillaume Desjardins; Andrei A Rusu; Kieran Milan; John Quan; Tiago Ramalho; Agnieszka Grabska-Barwinska; Demis Hassabis; Claudia Clopath; Dharshan Kumaran; Raia Hadsell
Journal:  Proc Natl Acad Sci U S A       Date:  2017-03-14       Impact factor: 11.205

7.  Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes.

Authors:  Carlo Baldassi; Christian Borgs; Jennifer T Chayes; Alessandro Ingrosso; Carlo Lucibello; Luca Saglietti; Riccardo Zecchina
Journal:  Proc Natl Acad Sci U S A       Date:  2016-11-15       Impact factor: 11.205

8.  A mean field view of the landscape of two-layer neural networks.

Authors:  Song Mei; Andrea Montanari; Phan-Minh Nguyen
Journal:  Proc Natl Acad Sci U S A       Date:  2018-07-27       Impact factor: 11.205

9.  Shaping the learning landscape in neural networks around wide flat minima.

Authors:  Carlo Baldassi; Fabrizio Pittorino; Riccardo Zecchina
Journal:  Proc Natl Acad Sci U S A       Date:  2019-12-23       Impact factor: 11.205

  9 in total
  2 in total

1.  Let the robotic games begin.

Authors:  Herbert Levine
Journal:  Proc Natl Acad Sci U S A       Date:  2022-04-19       Impact factor: 12.779

2.  Understanding cytoskeletal avalanches using mechanical stability analysis.

Authors:  Carlos Floyd; Herbert Levine; Christopher Jarzynski; Garegin A Papoian
Journal:  Proc Natl Acad Sci U S A       Date:  2021-10-12       Impact factor: 11.205

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.