Literature DB >> 36010799

Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks.

Abstract

We show that neural networks with an absolute value activation function and with network path norm, network sizes and network weights having logarithmic dependence on 1/ε can ε-approximate functions that are analytic on certain regions of Cd.

Entities: Chemical

Keywords: analytic functions; deep neural networks; exponential convergence; path norm regularization

Year: 2022 PMID： 36010799 PMCID： PMC9407526 DOI： 10.3390/e24081136

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.738

1. Introduction

Deep neural networks have found broad applications in many areas and disciplines, such as computer vision, speech and audio recognition and natural language processing. Two of the main characteristics of a given class of neural networks are its complexity and approximating capability. Once the activation function is selected, a class of networks is determined by the specification of the network architecture (namely, its depth and width) and the choice of network weights. Hence, the estimation of the complexity of a given class is carried out by regularizing (one of) those parameters, and the approximation properties of obtained regularized classes of networks are then investigated. The capability of shallow networks of depth 1 to approximate continuous functions is shown in the universal approximation theorem ([1]), and approximations of integrable functions by networks with fixed width are presented in [2]. Network-architecture-constrained approximations of analytic functions are given in [3], where it is shown that ReLU networks with depth depending logarithmically on and width can -approximate analytic functions on the closed subcubes of . The weight regularization of networks is usually carried out by imposing an -related constraint on network weights, . The most popular types of such constraints include the , and the path norm regularizations (see, respectively, [4,5,6] and references therein). Approximations of -smooth functions on by -regularized sparse ReLU networks are given in [5,7], and exponential rates of approximations of analytic functions by -regularized networks are derived in [8]. Path-norm-regularized classes of deep ReLU networks are considered in [4], where, together with other characteristics, the Rademacher complexities of those classes are estimated. The network size independence of those estimates makes the path norm regularization particularly remarkable. As the estimation only uses the Lipschitz continuity (with Lipschitz constant 1), the idempotency and the non-negative homogeneity of the ReLU function, it can be extended to networks with the absolute value activation function. Network characteristics similar to the path norm are also considered in the works [9,10], where they are called, respectively, a variation and a basis-path norm, and statistical features of classes of networks are described in terms of those characteristics. The objective of the present paper is the construction of path-norm-regularized networks that exponentially fast approximate analytic functions. Our goal is to achieve such convergence rates with activations that are idempotent, non-negative homogeneous and Lipschitz continuous with Lipschitz constant 1 so that the constructed path-norm-regularized networks fall within the scope of network classes studied in [4]. It turns out that networks with an absolute value activation function may suit this goal better than the networks with an ReLU activation function. More precisely, we show that analytic functions can be -approximated by networks with an absolute value activation function and with the path norm, the depth, the width and the weights all depending logarithmically on . Such an approximation holds (i) on any subset for analytic functions on with absolutely convergent power series; (ii) on the whole hypercube for functions that can be analytically continued to certain subsets of . Note that, as the network weights, as well as the total number of weights, depend logarithmically on , then the weight norms of the constructed approximating deep networks are also of logarithmic dependence on . Note that the absolute value activation function considered in this paper is among the common built-in activation functions of the software-based neural network evolving method NEAT-Python ([11]). Training algorithms for networks with an absolute value activation function are developed in the works [12,13]. In addition, the VC-dimensions and the structures of the loss surfaces of neural networks with piecewise linear activation functions, including the absolute value function, are described in the works [14,15]. Notation: For a matrix , we denote by the matrix obtained by taking the absolute values of the entries of W: . For brevity of presentation, we will say that the matrix is the absolute value of the matrix W (note that, in the literature, there are also other definitions of the notion of an absolute value of a matrix). The path norm of a neural network f is denoted by . For and , the degree of the monomial is defined to be . To assure that the matrix–vector multiplications are able to be accomplished, the vectors from , according to the context, may be treated as matrices either from or from .

2. The Class of Approximant Networks

Neural networks are constituted of weight matrices, biases and nonlinear activation functions acting neuron-wise in the hidden layers. The biases, also called shift vectors, can be omitted by adding a fixed coordinate 1 to the input vector and correspondingly modifying the weight matrices. As the definition of the path norm of networks does not assume the presence of shift vectors, we will add a coordinate 1 to the input vector and will consider classes of neural networks of the form where are the weight matrices, and is the width vector, with . The number of hidden layers L determines the depth of networks from and, in each layer, the activation function acts element-wise on the input vector. For given by let be the path norm of f, where denotes the norm of the dimensional vector obtained as a product of absolute values of the weight matrices of f. For , let be a path-norm-regularized subclass of . As the results obtained in [4] indicate, the path norm regularizations are particularly well-suited for networks whose activation function is Lipschitz continuous with Lipschitz constant 1; Idempotent, that is, , ; Non-negative homogeneous, that is, for , . We therefore aim to choose an activation possessing those properties such that analytic functions can be approximated by networks from with a small path norm constraint B. The most popular activation functions satisfying the above conditions are the ReLU function and the absolute value function . Below, we show that, with the absolute value activation function, the path norms of approximant networks may be significantly smaller than the path norms of the ReLU networks. The standard technique of neural network function approximation relies on approximating the product function , which then allows us to approximate monomials and polynomials of any desired degree. In [7], the approximation of the product is achieved by approximating the function . The latter is based on the observation that, for the triangle wave where is defined by and for any positive integer m, where The approximation of by networks with the ReLU activation function then follows from the representation Thus, in this case, we will obtain matrices containing weights 2 and 4, which will make the path norm of approximant networks big. Note that the same approach is also used in [3] for constructing ReLU network approximations of analytic functions. In [5], the approximation of the product is achieved by approximating the function , which, in turn, is based on the observation that, for the triangle wave where is defined by and for any positive integer m, Although in the representation (6), the coefficients (weights) are all in , the approximant in this case does not have the factors presented in the approximant in (4), which, again, will result in big values of path norms. Therefore, in order to take advantage of the presence of those reducing weights, we would like to represent the function in (5) by a linear combination of activation functions with smaller coefficients. This is possible if, instead of , we deploy the absolute value activation function . Indeed, in this case, we have that can be represented on as In the next section, we use the above representation (7) to show that analytic functions can be -approximated by networks from with each of and B, as well as the network weights having logarithmic dependence on . As all networks will have the same activation function , in the following, the subscript a will be omitted.

3. Results

We first construct a neural network with activation function , that, for the given , simultaneously approximates all d-dimensional monomials of a degree less than up to an error of . The depth of this network has order and its width is of order . Moreover, the entries of the product of the absolute values of matrices of the network have an order of at most (note the independence of m). For , let denote the number of d-dimensional monomials with degree . Then, and the following holds: There is a neural network Mon Moreover, the entries of the Taking in the above lemma , we obtain a neural network from , with L and having logarithmic dependence on , which simultaneously approximates the monomials of a degree at most of with error (up to a logarithmic factor). Moreover, the entries of the product of absolute values of matrices of this network will also have logarithmic dependence on . Below, we use this property to construct a neural network approximation of analytic and analytically continuable functions with an approximation error and with network parameters having logarithmic order. Let such that Note that an exponential convergence rate of deep ReLU network approximants on subintervals is also given in [3]. In our case, however, not only the depth and the width but also the path norm of the constructed network have logarithmic dependence on . Note that, in the above theorem, as approaches to 0, both and B, as well as the approximation error, grow polynomially on . In the next theorem, we use the properties of Chebyshev series to derive an exponential convergence rate on the whole hypercube . Recall that the Chebyshev polynomials are defined as , and Chebyshev polynomials play an important role in the approximation theory ([16]), and, in particular, it is known ([17], Theorem 3.1) that if f is Lipschitz continuous on , then it has a unique representation as an absolutely and uniformly convergent Chebyshev series Moreover, in case f can be analytically continued to an ellipse with foci and 1 and with the sum of semimajor and semiminor axes equal to , then the partial sums of the above Chebyshev series converge to f with a geometric rate and the coefficients also decay with a geometric rate. This result was first derived by Bernstein in [18] and its extension to the multivariate case was given in [19]. Note that the condition implies that , where and, for , denotes an open ellipse with foci 0 and d and the leftmost point . For , and , let be the space of functions that can be analytically continued to the region and are bounded there by F. Using the extension of Bernstein’s theorem to the multivariate case, we obtain Let with and Combining Lemma 1 and Lemma 2, we obtain the following. Let We conclude this part by estimating the weight regularization of networks constructed in Theorem 2. First, the total number of weights in those networks is bounded by From (7), it follows that all of the weights of network from Lemma 1 are in . In Theorem 2, the network is obtained by adding to a network with , a layer with coefficients of partial sums of power series of an approximated function. Thus, using (8), we obtain that the weight norm of the network constructed in Theorem 2 has order .

4. Proofs

In the following proofs, denotes an identity matrix of size and all of the networks have activation . The proof of Lemma 1 is based on the following two lemmas. For any positive integer m, there exists a neural network Mult and the product of absolute values of the matrices presented in Mult For , let denote a row of length k with a first entry equal to , last entry equal to 1 and all other entries equal to 0. Let be a matrix of size obtained by adding the -th row to the identity matrix . That is, In addition, let denote a matrix of size given by It then follows from (7) that where is the function defined in (3), . Thus, if is a row of length defined as then where is defined by (4). We have that As then, in the first layer of , we will obtain a vector and will then apply the network in a parallel manner from the first part of the proof to each of the pairs (1, x),(1, y) and (1, x + y). More precisely, for a given matrix M of size p × q, let be a matrix of size 3p × 3q defined as Then, for the network we have that which, together with and the triangle inequality, implies (9). It remains to be noted that the product of absolute values of the matrices presented in is equal to which completes the proof of the lemma. □ For any positive integer m, there exists a neural network Mult, with and, for the First, for a given , we construct a network with and such that In the first layer, we obtain a vector for which the first coordinate is 1 followed by triples that is, the vector . is then obtained by applying in parallel the network to each triple while keeping the first coordinate equal to 1. The product of absolute values of the matrices presented in this construction is a matrix of size having a form where and are the coordinates obtained in the previous lemma. Let us now construct the network . The first hidden layer of computes where . We then subsequently apply the networks and, in the last layer, we multiply the outcome by . From Lemma 3 and triangle inequality, we have that for . Hence, by induction on q, we obtain that . Note that the product of absolute values of matrices in each network has the above form, that is, in each row, it has at most three nonzero values, each of which is less than 2. As the matrices given in the first and the last layer of also satisfy this property, then each entry of the product of absolute values of all matrices of will not exceed . □ We have that, if , then , and if , then has only one non-zero coordinate, say, , which is equal to 1 and . Denote and let be the multi-indices satisfying . For with , denote by the -dimensional vector of the form The first layer of computes the -dimensional vector by multiplying the input vector by matrix of size . In the following layers, we do not change the first coordinates (by multiplying them by ), and, to each , we apply in parallel the network . Recall that, in Lemma 4, denotes the -dimensional vector obtained from the product of absolute values of the matrices of . We then have that the product of the absolute values of the matrices of has the form As the matrix only contains entries 0 and 1, then, applying Lemma 4, we obtain that the entries of M are bounded by □ Let Then, for , we have that Applying Lemma 1 with , we obtain that, for all where we used the inequalities and for . In order to approximate the partial sum we add one last layer with the coefficients of that partial sum to the network . As the sum of absolute values of those coefficients is bounded by F, then, combining (10) and (11), for the obtained network we obtain From Lemma 1 it follows that □ Let us now present the result from [19] that will be used to derive Lemma 2. First, if , then ([20], Theorem 4.1) f has a unique representation as an absolutely and uniformly convergent multivariate Chebyshev series Note that, for , the degree of a d-dimensional polynomial is . Then, for any non-negative integers the partial sum is a polynomial truncation of the multivariate Chebyshev series of f of degree . It is shown in [19] that For and, for the polynomial truncations p of the multivariate Chebyshev series of f, we have that Note that, from the recursive definition of the Chebyshev polynomials, it follows that, for any , the coefficients of the Chebyshev polynomial are all bounded by . Let p now be a polynomial given by (12) with degree . As the number of summands in the right-hand side of (12) is bounded by , then, using (13), we obtain that p can be rewritten as with where the last inequality follows from the condition . □ The proof follows from Lemmas 1 and 2 by taking and adding, to the network , the last layer with the coefficients of the polynomial from Lemma 2. For the obtained network we have that where C is the constant from Lemma 2. □

5. Discussion

Although various activation functions, including the ReLU, sigmoid and the Gaussian function, have already been used in the literature for neural network approximations of smooth and analytic functions (see [3,8,21]), approximating properties of neural networks with an absolute value activation function, which is a built-in activation function of software-based neural network evolving methods (such as NEAT-Python, [11]), has been barely covered previously. Whereas the algorithms developed in the works [12,13] allow us to train neural networks with an absolute value activation function, in the present paper, we study the capabilities of those networks to approximate analytic functions. While popular types of constraints imposed on approximating neural networks are either controlling the norms of network weights or adjusting their architectures, in the present work, we study approximating properties of neural networks with regularized path norms and show that networks with an absolute value activation function and with network path norms having logarithmic dependence on can -approximate functions that are analytic on certain regions of . The sizes and the weights of constructed networks also have logarithmic dependence on .

5 in total

Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks.

1. Introduction

2. The Class of Approximant Networks

3. Results

4. Proofs

5. Discussion

1. Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results.

2. A multilayer neural network with piecewise-linear structure and back-propagation learning.

3. Canonical piecewise-linear networks.

4. Statistical guarantees for regularized neural networks.

5. Error bounds for approximations with deep ReLU networks.