Literature DB >> 30446662

Barren plateaus in quantum neural network training landscapes.

Jarrod R McClean¹, Sergio Boixo², Vadim N Smelyanskiy³, Ryan Babbush⁴, Hartmut Neven⁴.

Abstract

Many experimental proposals for noisy intermediate scale quantum devices involve training a parameterized quantum circuit with a classical optimization loop. Such hybrid quantum-classical algorithms are popular for applications in quantum simulation, optimization, and machine learning. Due to its simplicity and hardware efficiency, random circuits are often proposed as initial guesses for exploring the space of quantum states. We show that the exponential dimension of Hilbert space and the gradient estimation complexity make this choice unsuitable for hybrid quantum-classical algorithms run on more than a few qubits. Specifically, we show that for a wide class of reasonable parameterized quantum circuits, the probability that the gradient along any reasonable direction is non-zero to some fixed precision is exponentially small as a function of the number of qubits. We argue that this is related to the 2-design characteristic of random circuits, and that solutions to this problem must be studied.

Entities: Chemical Disease Gene

Year: 2018 PMID： 30446662 PMCID： PMC6240101 DOI： 10.1038/s41467-018-07090-4

Source DB: PubMed Journal: Nat Commun ISSN： 2041-1723 Impact factor: 14.919

Introduction

Rapid developments in quantum hardware have motivated advances in algorithms to run in the so-called noisy intermediate scale quantum (NISQ) regime[1]. Many of the most promising application-oriented approaches are hybrid quantum–classical algorithms that rely on optimization of a parameterized quantum circuit[2-8]. The resilience of these approaches to certain types of errors and high flexibility with respect to coherence time and gate requirements make them especially attractive for NISQ implementations[3,9-11]. The first implementation of such algorithms was developed in the context of quantum simulation with the variational quantum eigensolver[2,3]. This algorithm has been successfully demonstrated on a number of experimental setups with extensions to excited states and other forms of incoherent error mitigation[2,9,12-16]. Since then, the quantum approximate optimization algorithm was developed in a similar context to address hard optimization problems[5,17-19]. This algorithm has also been demonstrated on quantum devices[20]. These approaches have even been extended to both quantum machine learning and error correction[6,7,20-23]. While the precise formulation of these methods and their domains of applicability differ considerably, they typically tend to rely upon the optimization of some parameterized unitary circuit with respect to an objective function that is typically a simple sum of Pauli operators or fidelity with respect to some state. This framework is reminiscent of the methodology of classical neural networks[23,24]. As with any non-linear optimization, the choice of both the parameterization and the initial state is important. In quantum simulation, there is often a choice inspired by physical domain knowledge[3,17,25-29]. However, in all domains of applicability, there have been implementations that utilize parametrized random circuits of varying depth[7,13,21,23,30]. Within quantum simulation that approach has been referred to as a “hardware efficient ansatz”[13]. This is in contrast to the previous proposals, such as the variational quantum eigensolver[2,3,9], which used parametrized structured circuits inspired by the problem at hand, such as unitary coupled cluster. When little structure is known about the problem, or constraints of the existing quantum hardware may prevent utilizing that structure, choosing a random implementable circuit seems to provide an unbiased choice. One might also expect, based on recent experimental designs for “quantum supremacy”, that random quantum circuits are a powerful tool for such a task[31]. Also, despite concerns about gradient-based methods in classical deep neural networks[32-34], they are successful[24], even if using random initialization[33,35]. However, in the quantum case one must remember that the estimation of even a single gradient component will scale as O(1/ε) for some small power α[36] as opposed to classical implementations where the same is achieved in O(log(1/ε)) time, where ε is the desired accuracy in the gradient that is inevitably tied to its magnitude. We will present results related to random quantum circuits in the context of the exponential dimension of Hilbert space and gradient-based hybrid quantum–classical algorithms. A cartoon depiction of this is given in Fig. 1. We show that for a large class of random circuits, the average value of the gradient of the objective function is zero, and the probability that any given instance of such a random circuit deviates from this average value by a small constant ε is exponentially small in the number of qubits. This can be understood in the geometric context of concentration of measure[37-39] for high-dimensional spaces. When the measure of the space concentrates in this way, the value of any reasonably smooth function will tend towards its average with exponential probability, a fact made formal by Levy’s lemma[40]. In our context, this means that the gradient is zero over vast reaches of quantum space. The region where the gradient is zero does not correspond to local minima of interest, but rather an exponentially large plateau of states that have exponentially small deviations in the objective value from the average of the totally mixed state. We argue that the depth of circuits which achieve these undesirable properties are modest, requiring only O(n1/) depth circuits on a d dimensional array, and numerically evaluate the constant factors one expects to encounter for small instances of this kind. While our results highlight the importance of avoiding random initialization in parametric circuit approaches, they do not discount the value of random quantum circuits in other applications such as information security or demonstrations of quantum supremacy. We close with an outlook on how this result should shape strategies in ansatz design for scaling to larger experiments.

Fig. 1

Cartoon of concentration of quantum observables. The sphere depicts the phenomenon of concentration of measure in quantum space: the fraction of states that fall outside a fixed angular distance from zero along any coordinate decreases exponentially in the number of qubits[40]. This implies a flat plateau where observables concentrate on their average over Hilbert space and the gradient is exponentially small. The fact that only an exponentially small fraction of states fall outside of this band means that searches resembling random walks will have an exponentially small probability of exiting this “barren plateau”

Results

Gradient concentration in random circuits

We will discuss random parameterized quantum circuits (RPQCs)where U(θ) = exp(−iθV), V is a Hermitian operator, and W is a generic unitary operator that does not depend on any angle θ. Circuits of this form are a natural choice due to a straightforward evaluation of the gradient with respect to most objective functions and have been introduced in a number of contexts already[26,41]. Consider an objective function E(θ) expressed as the expectation value over some Hermitian operator H, When the RPQCs are parameterized in this way, the gradient of the objective function takes a simple form:where we introduce the notations , , and henceforth drop the subscript k from V → V for ease of exposition. Finally, we will define our RPQCs U() to have the property that for any gradient direction ∂E defined above, the circuit implementing U() is sufficiently random such that either U−, U+, or both match the Haar distribution up to the second moment, and the circuits U− and U+ are independent. Our results make use of properties of the Haar measure on the unitary group dμHaar(U) ≡ dμ(U), which is the unique left- and right-invariant measure such thatfor any f(U) and V∈U(N), where the integration domain will be implied to be U(N) when not explicitly listed. While this property is valuable for proofs, quantum circuits that exactly achieve this invariance generically require exponential resources. This motivates the concept of unitary t-designs[42-44], which satisfy the above properties for restricted classes of f(U), often requiring only modest polynomial resources. Suppose {p, V} is an ensemble of unitary operators, with unitary V being sampled with probability p. The ensemble {p, V} is a t-design if This definition is equivalent to the property that if f(U) is a polynomial of at most degree t in the matrix elements of U and at most degree t in the matrix elements of U*, then averaging over the t-design {p, V} will yield the same result as averaging over the unitary group with the respect to the Haar measure. The average value of the gradient is a concept that requires additional specification because, for a given point, the gradient can only be defined in terms of the circuit that led to that point. We will use a practical definition that leads to the value we are interested in, namelywhere p(U) is the probability distribution function of U. A review on the properties of products of independent random matrices can be found in ref.[45]. The assumptions of independence and at least one of U− or U+ forming a 1-design in our RPQCs implies that 〈∂E〉 = 0, as shown in the Methods. Levy’s lemma informs our intuition about the the expected variance of this quantity through simple geometric arguments. In particular, Haar random unitaries on n qubits will output states uniformly in the D = 2 − 1 dimensional hypersphere. The derivative with respect to the parameters θ is Lipschitz continuous with some parameter η that depends on the operator H. Levy’s lemma then implies that the variance of measurements will decrease exponentially in the number of qubits. This intuition may be made more precise through explicit calculation of the variance, which is done in more detail in the Methods. The result to first order iswhere the notation indicates the average with u drawn from p(U), and the first case corresponds to U− being a 2-design and not U+, the second to U+ being a 2-design but not U−, and the third to both U+ and U− being 2-designs. We emphasize the fact that this variance depends at most on polynomials of degree 2 in U and polynomials of degree 2 in U*. Whereas a unitary 2-design will exhibit the correct variance[43,46], a unitary 1-design will exhibit the correct average value, but not necessarily the variance. As a result, if a circuit is of sufficient depth that for any ∂E, either U− or U+ forms a 2-design, then with high probability one will produce an ansatz state on a barren plateau of the quantum landscape, with no interesting search directions in sight. From these results, it is clear that only either U+ or U− needs to be sufficiently random to poison the gradient for the remainder of the circuit. For example, while it is somewhat unintuitive, even the first element of a circuit, k = 1, will have a vanishing gradient due to the circuit following it, U+. Additionally, we see that there is no detailed dependence on the structure of V, other than the rate at which they help randomize the circuit, determining at what depth one expects to find an approximate 2-design.

Numerical simulations

The previous section shows that for reasonable classes of RPQCs at a sufficient number of qubits and depth, one will end up on a barren plateau. Here we verify this result for even modest depth one-dimensional (1D) random circuits with numerical simulations. This helps to clarify the rate of concentration for realistic circuits and shows the transition as the circuit grows in length from a single layer to a circuit demonstrating statistics analogous to a 2-design. The circuits and objective functions used in our numerical experiments begin with a layer of R(π/4) = exp(−iπ/8 Y) gates to prevent X, Y, or Z from being an especially preferential direction with respect to gradients. Then, the circuit proceeds by a number of layers. Each layer consists of a parallel application of single qubit rotations to all qubits, given by R(θ) where P∈{X, Y, Z} is chosen with uniform probability and θ∈[0, 2π) is also chosen uniformly. This layer is followed by a layer of 1D nearest neighbor controlled phase gates, as in Fig. 2. Thus, the number of angles is the number of qubits times the number of layers.

Fig. 2

Structure of quantum circuits. a The generic subunit of circuits we study in this work, with a parameterized component U(θ) and non-parameterized unit W for each layer l. b Example schematic of the 1D random circuits used in our numerical experiments. The circuit begins with gates applied to all qubits followed by a specified number of layers of randomly chosen Pauli rotations applied to each qubit and then a 1D ladder of controlled Z gates. The initial gates are not repeated in each layer. The indices i and j in θ index the layer and qubit, respectively. For each layer and qub it P∈{X, Y, Z} and θ∈[0,2π) are sampled independently The objective operator H is chosen to be a single Pauli ZZ operator acting on the first and second qubits, H = Z1Z2. The gradient is evaluated with respect to the first parameter, θ1,1. This simple choice helps to extract the exponential scaling. As complex objectives can be written as sums of these operators, the results for large objectives can be inferred from these numbers. Moreover, it is clear that for any polynomial sum of these operators, the exponential decay of the signal in the gradient will not be circumvented. From Fig. 3 we see that for a single 2-local Pauli term, both the expected value of the gradient and its spread decay exponentially as a function of the number of qubits even when the number of layers is a modest linear function. Empirically for our linear connectivity, we see that value is about 10n where n is the number of qubits, following the expected scaling of O(n1/) where d is the dimension of the connectivity. For empirical reference, the expected gate depth in a chemistry ansatz such as unitary coupled cluster is at least O(n3), meaning that if the initial parameters were randomized, this effect could be expected on less than 10 orbitals, a truly small problem in chemical terms. We also observe in Fig. 4 that as the number of layers increases, there is a transition to a 2-design where the variance converges. This leads to a distinct plateau as the circuit length increases, where the height of the plateau is determined by the number of qubits. An additional example with an objective function defined by projection on a target state is provided as Supplementary Figures 1 and 2, showing the rapid decay of variance and similar plateaus as a function of circuit length. These results substantiate our conclusion that gradients in modest-sized random circuits tend to vanish without additional mitigating steps.

Fig. 3

Fig. 4

Convergence to 2-design limit. Here we show the sample variance of the gradient of the energy for the first circuit component of a two-local Pauli term plotted as a function of the number of layers, L, in a 1D quantum circuit. The different lines correspond to all even numbers of qubits between 2 and 24, with 2 qubits being the top line, and the rest being ordered by qubit number. The dotted black lines depict the 2-design asymptotes for this Hamiltonian as determined by our analytic results. This shows the convergence of the second moment as a function of the number of layers to a fixed value determined by the number of qubits

Exponential decay of variance. The sample variance of the gradient of the energy for the first circuit component of a two-local Pauli term plotted as a function of the number of qubits on a semi-log plot. As predicted, an exponential decay is observed as a function of the number of qubits, n, for both the expected value and its spread. The slope of the fit line is indicative of the rate of exponential decay as determined by the operator Convergence to 2-design limit. Here we show the sample variance of the gradient of the energy for the first circuit component of a two-local Pauli term plotted as a function of the number of layers, L, in a 1D quantum circuit. The different lines correspond to all even numbers of qubits between 2 and 24, with 2 qubits being the top line, and the rest being ordered by qubit number. The dotted black lines depict the 2-design asymptotes for this Hamiltonian as determined by our analytic results. This shows the convergence of the second moment as a function of the number of layers to a fixed value determined by the number of qubits

Contrast with gradients in classical deep networks

Finally, we contrast our results with the vanishing (and exploding) gradient problem of classical deep neural networks[32-34,47]. At least two key differences are present in the quantum case: (i) the different scaling of the vanishing gradient and (ii) the complexity of computing expected values. The gradient in a classical deep neural network can vanish exponentially in the number of layers[32,33], while in a quantum circuit the gradient may vanish exponentially in the number of qubits, as shown above. In the classical case, the gradient for a weight in a neuron depends on the sum of all the paths connecting that neuron to the output, and when the weights are initialized with random values the paths have random signs which cancels the signal[32]. The number of paths is exponential in the number of layers. In the quantum case, the number of paths is exponential in the number of gates, and also have random signs[31]. The gradient saturates to an exponential in the number of qubits because the output state is normalized. The estimation of the gradient for each training batch for a classical neural network is limited by machine precision and scales with O(log(1/ε)). Even if the gradient is small, as long as it is consistent enough between batches, the method may eventually succeed. On a quantum device, the cost of estimating the gradient scales as O(1/ε)[36]. For any number of measurements much lower than 1/||g||, where ||g|| is the norm of the gradient, a gradient-based optimization will result in a random walk. By concentration of measure, a random walk will have exponentially small probability of exiting the barren plateau. As a result, gradient descent without some additional strategy cannot circumvent this challenge on a quantum device in polynomial time.

Discussion

We have seen both analytically and numerically that for a wide class of random quantum circuits, the expected values of observables concentrate to their averages over Hilbert space and gradients concentrate to zero. This represents an interesting statement about the geometry of quantum circuits and landscapes related to hybrid quantum–classical algorithms. More practically, it means that randomly initialized circuits of sufficient depth will find relatively little utility in hybrid quantum–classical algorithms. Historically, vanishing gradients may have played a role in the early winter of deep neural networks[32,34,47]. However, multiple techniques have been proposed to mitigate this problem[24,35,48,49], and the amount of training data and computational power available has grown substantially. One approach to avoid these landscapes in the quantum setting is to use structured initial guesses, such as those adopted in quantum simulation. Another possibility is to use pre-training segment by segment, which was an early success in the classical setting[48,50]. These or other alternatives must be studied if these ansatze are to be successful beyond a few qubits.

Methods

We explicitly show the expectation value of the gradient is 0 and that under our assumptions the variance decays exponentially in the number of qubits. By our definition of RPQCs, we have that for any specified direction ∂E, both U− and U+ are independently distributed and either U− or U+ match the Haar distribution up to at least the second moment (they are a 2-design). The assumption of independence is equivalent towhich allows us to rewrite the expression as We will utilize explicit integration over the unitary group with respect to the Haar measure, which up to the first moment can be expressed as[51]where N is the dimension of the space, typically 2 for n qubits. Using this expression, one may readily verify thatwhich we use in the following. Now, making use of the assumption that either U+ or U− matches the Haar measure up to the first moment (it is a 1-design), we first examine the case where U− is at least a 1-design and find thatwhere we have defined and used the fact that the trace of a commutator of trace class operators is zero. In the second case, where we assume U+ is at least a 1-design,An advantage of the explicit polynomial formulas are that they allow an analytic calculation of the variance as well, which allows precise specification of the coefficient in Levy’s lemma. In cases where the integrals depend on up to two powers of elements of U and U*, one may make use of the elementwise formula[51] The variance of the gradient is defined byas we have seen above that 〈∂E〉 = 0. Through use of the above formula for integration up to the second moment of the Haar distribution, one may evaluate this expression in 3 separate cases. For simplicity and relevance, we evaluate them in the asymptotic case including only the dominant contribution as determined by the inverse dimension. In the case where U− is a 2-design but not U+,where and we have defined the notation to mean the average over u sampled from p(U). In the case where U+ is a 2-design but not U−,where . Finally in the case where both U+ and U− are 2-designs In all cases, the exponential decay of the gradient as a function of the number of qubits is evident.

12 in total

1. Exploiting Locality in Quantum Computation for Quantum Chemistry.

Authors: Jarrod R McClean; Ryan Babbush; Peter J Love; Alán Aspuru-Guzik
Journal: J Phys Chem Lett Date: 2014-12-08 Impact factor: 6.475

2. A fast learning algorithm for deep belief nets.

Authors: Geoffrey E Hinton; Simon Osindero; Yee-Whye Teh
Journal: Neural Comput Date: 2006-07 Impact factor: 2.026

3. Most quantum States are too entangled to be useful as computational resources.

Authors: D Gross; S T Flammia; J Eisert
Journal: Phys Rev Lett Date: 2009-05-11 Impact factor: 9.161

4. Are random pure States useful for quantum computation?

Authors: Michael J Bremner; Caterina Mora; Andreas Winter
Journal: Phys Rev Lett Date: 2009-05-11 Impact factor: 9.161

5. Cloud Quantum Computing of an Atomic Nucleus.

Authors: E F Dumitrescu; A J McCaskey; G Hagen; G R Jansen; T D Morris; T Papenbrock; R C Pooser; D J Dean; P Lougovski
Journal: Phys Rev Lett Date: 2018-05-25 Impact factor: 9.161

6. Quantum Simulation of Electronic Structure with Linear Depth and Connectivity.

Authors: Ian D Kivlichan; Jarrod McClean; Nathan Wiebe; Craig Gidney; Alán Aspuru-Guzik; Garnet Kin-Lic Chan; Ryan Babbush
Journal: Phys Rev Lett Date: 2018-03-16 Impact factor: 9.161

7. Quantum machine learning.

Authors: Jacob Biamonte; Peter Wittek; Nicola Pancotti; Patrick Rebentrost; Nathan Wiebe; Seth Lloyd
Journal: Nature Date: 2017-09-13 Impact factor: 49.962

8. Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets.

Authors: Abhinav Kandala; Antonio Mezzacapo; Kristan Temme; Maika Takita; Markus Brink; Jerry M Chow; Jay M Gambetta
Journal: Nature Date: 2017-09-13 Impact factor: 49.962

9. From transistor to trapped-ion computers for quantum chemistry.

Authors: M-H Yung; J Casanova; A Mezzacapo; J McClean; L Lamata; A Aspuru-Guzik; E Solano
Journal: Sci Rep Date: 2014-01-07 Impact factor: 4.379

10. Witnessing eigenstates for quantum simulation of Hamiltonian spectra.

Authors: Raffaele Santagati; Jianwei Wang; Antonio A Gentile; Stefano Paesani; Nathan Wiebe; Jarrod R McClean; Sam Morley-Short; Peter J Shadbolt; Damien Bonneau; Joshua W Silverstone; David P Tew; Xiaoqi Zhou; Jeremy L O'Brien; Mark G Thompson
Journal: Sci Adv Date: 2018-01-26 Impact factor: 14.136

12 in total

1. Variational quantum evolution equation solver.

Authors: Fong Yew Leong; Wei-Bin Ewe; Dax Enshan Koh
Journal: Sci Rep Date: 2022-06-25 Impact factor: 4.996

Review 2. Review of some existing QML frameworks and novel hybrid classical-quantum neural networks realising binary classification for the noisy datasets.

Authors: D Aghamalyan; P Griffin; M Boguslavsky; N Schetakis
Journal: Sci Rep Date: 2022-07-13 Impact factor: 4.996

3. Power of data in quantum machine learning.

Authors: Hsin-Yuan Huang; Michael Broughton; Masoud Mohseni; Ryan Babbush; Sergio Boixo; Hartmut Neven; Jarrod R McClean
Journal: Nat Commun Date: 2021-05-11 Impact factor: 14.919

4. Training of quantum circuits on a hybrid quantum computer.

Authors: D Zhu; N M Linke; M Benedetti; K A Landsman; N H Nguyen; C H Alderete; A Perdomo-Ortiz; N Korda; A Garfoot; C Brecque; L Egan; O Perdomo; C Monroe
Journal: Sci Adv Date: 2019-10-18 Impact factor: 14.136

5. Decoding quantum errors with subspace expansions.

Authors: Jarrod R McClean; Zhang Jiang; Nicholas C Rubin; Ryan Babbush; Hartmut Neven
Journal: Nat Commun Date: 2020-01-31 Impact factor: 14.919

6. Cost function dependent barren plateaus in shallow parametrized quantum circuits.

Authors: M Cerezo; Akira Sone; Tyler Volkoff; Lukasz Cincio; Patrick J Coles
Journal: Nat Commun Date: 2021-03-19 Impact factor: 14.919

7. QuASeR: Quantum Accelerated de novo DNA sequence reconstruction.

Authors: Aritra Sarkar; Zaid Al-Ars; Koen Bertels
Journal: PLoS One Date: 2021-04-12 Impact factor: 3.240

8. A systematic variational approach to band theory in a quantum computer.

Authors: Kyle Sherbert; Frank Cerasoli; Marco Buongiorno Nardelli
Journal: RSC Adv Date: 2021-12-10 Impact factor: 3.361

9. A quantum algorithm for spin chemistry: a Bayesian exchange coupling parameter calculator with broken-symmetry wave functions.

Authors: Kenji Sugisaki; Kazuo Toyota; Kazunobu Sato; Daisuke Shiomi; Takeji Takui
Journal: Chem Sci Date: 2020-12-24 Impact factor: 9.825

10. Variational quantum classifiers through the lens of the Hessian.

Authors: Pinaki Sen; Amandeep Singh Bhatia; Kamalpreet Singh Bhangu; Ahmed Elbeltagi
Journal: PLoS One Date: 2022-01-20 Impact factor: 3.240