Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima.

Literature DB >> 33619091

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima.

Abstract

Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.

Entities: Disease

Keywords: generalization; loss landscape; machine learning; statistical physics; stochastic gradient descent

Year: 2021 PMID： 33619091 PMCID： PMC7936325 DOI： 10.1073/pnas.2015617118

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

9 in total

1. Structure of stochastic dynamics near fixed points.

Authors: Chulan Kwon; Ping Ao; David J Thouless
Journal: Proc Natl Acad Sci U S A Date: 2005-09-01 Impact factor: 11.205

2. Optimization by simulated annealing.

Authors: S Kirkpatrick; C D Gelatt; M P Vecchi
Journal: Science Date: 1983-05-13 Impact factor: 47.728

3. Flat minima.

Authors: S Hochreiter; J Schmidhuber
Journal: Neural Comput Date: 1997-01-01 Impact factor: 2.026

Review 4. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

5. Reconciling modern machine-learning practice and the classical bias-variance trade-off.

Authors: Mikhail Belkin; Daniel Hsu; Siyuan Ma; Soumik Mandal
Journal: Proc Natl Acad Sci U S A Date: 2019-07-24 Impact factor: 11.205

6. Overcoming catastrophic forgetting in neural networks.

Authors: James Kirkpatrick; Razvan Pascanu; Neil Rabinowitz; Joel Veness; Guillaume Desjardins; Andrei A Rusu; Kieran Milan; John Quan; Tiago Ramalho; Agnieszka Grabska-Barwinska; Demis Hassabis; Claudia Clopath; Dharshan Kumaran; Raia Hadsell
Journal: Proc Natl Acad Sci U S A Date: 2017-03-14 Impact factor: 11.205

7. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes.

Authors: Carlo Baldassi; Christian Borgs; Jennifer T Chayes; Alessandro Ingrosso; Carlo Lucibello; Luca Saglietti; Riccardo Zecchina
Journal: Proc Natl Acad Sci U S A Date: 2016-11-15 Impact factor: 11.205

8. A mean field view of the landscape of two-layer neural networks.

Authors: Song Mei; Andrea Montanari; Phan-Minh Nguyen
Journal: Proc Natl Acad Sci U S A Date: 2018-07-27 Impact factor: 11.205

9. Shaping the learning landscape in neural networks around wide flat minima.

Authors: Carlo Baldassi; Fabrizio Pittorino; Riccardo Zecchina
Journal: Proc Natl Acad Sci U S A Date: 2019-12-23 Impact factor: 11.205

9 in total

2 in total

1. Let the robotic games begin.

Authors: Herbert Levine
Journal: Proc Natl Acad Sci U S A Date: 2022-04-19 Impact factor: 12.779

2. Understanding cytoskeletal avalanches using mechanical stability analysis.

Authors: Carlos Floyd; Herbert Levine; Christopher Jarzynski; Garegin A Papoian
Journal: Proc Natl Acad Sci U S A Date: 2021-10-12 Impact factor: 11.205

2 in total