Literature DB >> 32786898

Machine Learning Approaches toward Orbital-free Density Functional Theory: Simultaneous Training on the Kinetic Energy Density Functional and Its Functional Derivative.

Ralf Meyer¹, Manuel Weichselbaum¹, Andreas W Hauser¹.

Abstract

Orbital-free approaches might offer a way to boost the applicability of density functional theory by orders of magnitude in system size. An important ingredient for this endeavor is the kinetic energy density functional. Snyder et al. [ Phys. Rev. Lett. 2012, 108, 253002] presented a machine learning approximation for this functional achieving chemical accuracy on a one-dimensional model system. However, a poor performance with respect to the functional derivative, a crucial element in iterative energy minimization procedures, enforced the application of a computationally expensive projection method. In this work we circumvent this issue by including the functional derivative into the training of various machine learning models. Besides kernel ridge regression, the original method of choice, we also test the performance of convolutional neural network techniques borrowed from the field of image recognition.

Entities: Chemical Disease Gene Mutation Species

Year: 2020 PMID： 32786898 PMCID： PMC7482319 DOI： 10.1021/acs.jctc.0c00580

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.006

Introduction

Over past decades density functional theory (DFT) has evolved into a powerful standard tool of computational chemistry.[1,2] Although originally intended as an orbital-free ansatz, where all contributions to the electronic energy of a system are represented by functionals of the electron density, the reintroduction of orbitals within the Kohn–Sham framework is a de facto standard of most modern programs.[3,4] The crucial term which triggered this development is the expression of the kinetic energy for a system of interacting fermions, which is much better covered within the picture of occupied molecular orbitals, i.e., eigenfunctions of an effective one-electron operator in a mean-field approximation. In modern functionals, the small deviations from the true kinetic energy are compensated by approximative functional expressions for the exchange and correlation interactions of an N-electron system. Although the local density approximation (LDA) of Kohn and Sham[4] is uniquely defined by the properties of the uniform gas, the strategy for further refinements is not clear at all. Among the most successful current approaches for kinetic energy density functionals (KEDFs) is the class of nonlocal functionals, which are built from three parts:with TTF = CTF∫n5/3 dr as the Thomas–Fermi functional,[5−7] as the semilocal von Weizsäcker functional,[8] and TNL as an additional nonlocal term. A widely used ansatz for the nonlocal part is of the formwith ω denoting a dimensionless kernel, typically assumed to be a function of |r – r′|, and the exponents α and β as parameters. This form encompasses state-of-the-art nonlocal functionals such as the Wang–Teter,[9] Smargiassi–Madden,[10] Perrot,[11] Wang–Govind–Carter,[12,13] Huang–Carter[14] and Mi–Genova–Pavanello[15] functionals. With varying degrees of success, these functionals have been applied to metallic and semiconducting bulk systems containing up to 1 million atoms,[16−19] to metallic clusters,[20−22] and to molecular systems.[23,24] Following ref (25), the idea of using machine learning (ML) methods to approximate density functionals has been investigated by several groups recently. The original ML model for the KEDF has been shown to successfully describe bond breaking[26] and was extended to include basis set independence[27] as well as scale-invariance conditions.[28] The same ML model has also been employed for direct fits of F[n], the universal part of the total energy density functional.[29] A very interesting ML model was investigated by Yao and Parkhill, who used a 1D convolutional neural network to fit the kinetic energy as a function of the density projected onto bond directions.[30] Machine learning approximations, in particular neural networks, have also been suggested for semilocal KEDFs.[31−33] However, all of these ML-based KEDFs were deemed to be inadequate for an application in iterative calculations of the minimum energy density, mostly due to large errors on the predicted functional derivative. As a consequence, the focus has shifted toward a direct prediction of the minimum energy density from the nuclear potential, thereby bypassing the need for iterative calculations.[34−39] One of the earliest machine learning models for density functionals was presented by Tozer et al. for the exchange–correlation (XC) functional.[40] In addition to explorations of the XC functional,[41−44] machine learning approximations have also been applied to other technicalities of DFT.[45−48] In this article, we follow up on the first tests by Snyder et al.[25] and investigate if the original idea of learning the kinetic energy functional for a usage in iterative calculations can be “salvaged” by a simultaneous training of the machine learning model on both the kinetic energy functional and its functional derivative. In addition to the application of kernel ridge regression, we evaluate the performance of convolutional neural networks, one of the most successful and widely used ML architectures to date. Our approach is motivated by the fact that the underlying mathematical expression is very similar to the nonlocal contribution given by eq (if the kernel ω is assumed to be a function of |r – r′|) and shows translational invariance, which might enable a better generalization, especially for large systems.

Methods

Data Generation

A one-dimensional model system of noninteracting spinless Fermions is used to train and test the ML KEDFs. It consists of N particles in a hard wall box within the interval 0 ≤ x ≤ 1 and an external potential built from a linear combination of three Gaussians:[25]with parameters a, b, and c randomly sampled from uniform distributions in the intervals [1, 10], [0.4, 0.6], and [0.03, 0.1], respectively. The 1D Schrödinger equation for these potentials is solved on a grid of G = 500 points using Numerov’s method,[49] yielding a set of eigenfunctions ψ(x) and corresponding eigenvalues E for each potential V(x), ordered from lowest to highest energy with increasing index k. These solutions are then used to calculate all components of the training data for an N-particle system, namely the densitythe kinetic energy density, here defined asthe kinetic energyand the kinetic energy functional derivativewith denoting the total energy per particle. The discretized version of these functions, written as vectors for clarity n(x) → n, τ(x) → τ, and δT[n]/δn(x) → ∇T/Δx, are used to train the ML models. The error in the eigenenergies due to discretization is estimated to be below 10–3 kcal/mol by comparing the solutions to calculations on a 10 times finer grid. However, in the context of the ML models all of the computed quantities are considered exact. The M = 100 parameter triplets for a, b, and c as listed in the Supporting Information of ref (25) are used as training data in order to recreate the original study as closely as possible. On the basis of eq , we generate 1000 additional random potentials as a test set.

Kernel Ridge Regression

We start with a brief review of the kernel ridge regression (KRR) approach as introduced in ref (25). This ansatz is then extended by the inclusion of the functional derivative into the training in order to improve its capabilities. Details on the derivation of the equations used in the following can be found in section 2 of the Supporting Information. An elaborate discussion of KRR model training can be found in ref (50). In KRR the simple regularized linear fit of ridge regression is extended toward nonlinear data through the introduction of a kernel function:with α as the fit coefficients, the kernel function k(n, n), which can be interpreted as a measure of similarity between two densities, and with {n1, ..., n} as the M training examples. The coefficients α are determined by minimizing the cost functionwhere the second term is a regularization function scaled by the parameter λ and T are the kinetic energies corresponding to the training densities n. Setting the derivative with respect to the α equal to zero yields a matrix equation for the fit coefficients: The matrix K contains the values of the kernel function for the M training examples, so K = k(n, n) and I is a unit matrix of size M. Snyder et al.[25] have shown already that the discretized functional derivative can easily be calculated from eq , yielding In our study, we expand on this idea by including the functional derivatives of the training examples into the model using additional fit coefficients β. The kinetic energy of the extended model is then given by Derivation with respect to the input density gives the new formula for the functional derivative: Note that each of the newly introduced coefficients β is a vector of size G. Therefore, the number of parameters grows from M to M(1 + G). The cost function is extended by the squared error of the functional derivative and an additional regularization term for the new weights β:with ∇T/Δx denoting the reference value for the discretized functional derivative corresponding to the training density n. Minimizing this extended cost function with respect to the coefficients yieldswith an extended regularization matrixwhere I and I are unit matrices of size M and MG, respectively, and an extended kernel matrixwhere K is a (M × M) matrix with elements K = k(n, n); J and J′ are matrices of size (MG × M) and (M × MG), respectively, and contain the gradient vectors of k(n, n) with respect to the input densitiesandFinally, H is a (MG × MG) blocked matrix consisting of the Hessian matrices of k(n, n) given by Following ref (25), we use the squared exponential kernel for all of the presented KRR models:where the hyperparameter σ denotes the length scale on which the training densities vary.

Convolutional Neural Networks

Convolutional neural networks[51,52] (CNNs) are on the forefront of the ongoing deep learning revolution[53,54] and achieve unprecedented accuracy in their main field of application: image recognition.[55] CNNs represent a subclass of standard feed-forward neural networks, designed for the specific purpose of an efficient inclusion of spatial information in pixel-based image processing. Despite their origin in visual pattern recognition, CNNs have been applied successfully to numerous other tasks, including also the approximation of density functionals.[30,43,45] A single convolutional layer typically consists of several filters or steps of input processing. A pass through a single convolutional filter in one dimension is given bywhere the index “(g)” refers to the gth element of a vector (parentheses are used to distinguish grid point indices from training example indices), f is an activation function, b is a bias parameter, w is a vector of weight parameters for the filter (commonly referred to as “kernel”), and σ is the filter width. Equation is only valid for indices (g) where the input and the convolutional kernel w fully overlap (referred to as “valid padding”). The resulting output vector z is therefore smaller than the input. Alternatively, the input vector can be padded with zeros to ensure that the output is of the same size as the input, a technique referred to as “same padding”. In the course of this article we investigate the performance of both a standard CNN and a residual neural network (ResNet),[56] with the latter referring to a network featuring a more sophisticated architecture: In addition to conventional convolutional layers, ResNets use so-called skip connections through which the feed-forward signal can bypass several layers and is directly added to the output of a later layer. Connections of this type are known to improve the training process, in particular if training data is limited, as they are forming a less complicated, “coarse” network within the actual network structure. Both investigated models use 32 filters per convolutional layer with a filter width of σ = 100 and employing the softplus activation function.[57,58] The standard CNN consists of five convolutional layers using valid padding. This results in an flattened output vector containing 160 entries, which is then reduced to a single scalar, the kinetic energy prediction, using a weighted sum, referred to as “linear dense layer” in community parlance. The more complicated ResNet model consists of three blocks of two convolutional layers. Each of these blocks is bypassed by a skip connection. The three blocks are followed by a final convolutional layer with a single filter. In order to allow for skip connections, all convolutions employ same padding. This architecture results in an output vector of the same size as the input density, which is interpreted as kinetic energy density. Finally, the kinetic energy (see eq ) is calculated by integrating over the output using the trapezoidal rule. The batch normalization layers[59] typically employed in ResNets worsen the training performance in regression tasks and are therefore not used. Schematics of both models are presented in Figure . We use the keras[60] and tensorflow[61] python packages to implement and train both types of neural networks.

Figure 1

Schematic depiction of the NN architectures used for the standard CNN (left) and the ResNet model (right). Note the appearance of skip connections for the latter.

Schematic depiction of the NN architectures used for the standard CNN (left) and the ResNet model (right). Note the appearance of skip connections for the latter. The bias and weight parameters are determined by minimizing a cost function similar to eq . Since the ResNet model offers predictions of the kinetic energy densities τ, an additional error term can be added to the cost functionwhere the weighting coefficients are set to ι = 0.2, ιτ = 0, κ = 1, and λ = 2.5 × 10–4 for the standard CNN (since it does not predict the kinetic energy density) and to ι = 0, ιτ = 1, κ = 1, and λ = 2.5 × 10–4 for the ResNet. The L2-regularization term is applied to weight parameters exclusively and not to bias parameters. The network parameters are initialized randomly according to the “Glorot uniform” tensorflow method[62,63] and trained using the Adam optimizer[64] for 100 000 epochs. We use a two-stage learning rate schedule, where the learning rate stays constant at 10–4 for the first 21 800 epochs and is lowered by 10% every 1000 epochs for the remaining training procedure, resulting in an exponential decay. This greatly improves the overall convergence, as the inclusion of derivative information leads to large variations of the cost function during the training. A more detailed discussion of the training procedure as well as its convergence behavior is given in section 4 of the Supporting Information.

Results and Discussion

Each of the following investigations can be split into two different parts with respect to their objectives. In the first part, the model performance is tested by using the exact densities of the test set as input to the ML models and evaluating the error of both the kinetic energy and the functional derivative. In the second part, the derivative prediction of the ML models is used to iteratively find the minimum energy density for the potentials of the test set. For these densities, the error of the kinetic energy is reported together with the deviation from the exact minimum energy density. This way, the impact and the magnitude of both types of error contributions, one stemming from the model itself and the other caused by wrong minimum energy density predictions, should become clear and traceable for the reader.

Training on the Functional Derivative

As a first test, we investigate if the inclusion of derivative information can improve the fit quality of the machine learning models on the data sets for N = 1. Table summarizes the mean value, the standard deviation, and the maximum value of both the absolute error of the kinetic energy |ΔT| and the integral over the absolute error of the functional derivative for all of the investigated models.

Table 1

Absolute Error Values on the N = 1 Test Set for All of the Machine Learning Models (in kcal/mol)

	\|ΔT\|
model	mean	std	max	mean	std	max
KRR, ref [25]	0.15	0.24	3.2	–	–	–
KRR, this work	0.163	0.29	4.6	29313.2	345.5	30610.9
ext KRR	0.004	0.02	0.6	3.4	4.3	50.7
CNN	0.044	0.10	2.3	31.5	25.0	370.1
ResNet	0.015	0.02	0.3	10.1	7.0	110.7

As a reference, we reproduce the KRR results from ref (25) by using the reported hyperparameters (σ = 43 and λ = 12 × 10–14). Slight deviations between our results and the previous work in Table can be attributed to the fact that we use a different randomly generated test set. The hyperparameters for the extended KRR model including derivative information (referred to as “ext KRR” in Table ) are determined using a rough grid search and 5-fold cross-validation. The minimum of the sum of the mean absolute validation errors for kinetic energy and functional derivative is obtained for σ = 30.58 and λ = 10–12. The influence of the weighting parameter κ, which is set to 1, as well as a more detailed description of the hyperparameter search is given in section 3 of the Supporting Information. Comparison of the two KRR models shows that the inclusion of derivative information into the KRR approach not only drastically reduces the error on the functional derivative, as illustrated for a sample potential in Figure , but also improves the accuracy of the kinetic energy prediction. Section 3 of the Supporting Information shows that the cross-validation error of the kinetic energy is actually lowest for the hyperparameter values σ = 11.50 and λ = 10–14, whereas the lowest error on the functional derivative is obtained for σ = 30.58 and λ = 10–12.

Figure 2

Comparison of exact functional derivative (solid lines) and predictions by standard KRR with and without the PCA projection detailed in section (dashed lines) as well as prediction from the ML models trained on derivative information (dashed–dotted lines). The parameters for the shown potential are a = {4.43, 7.18, 9.03}, b = {0.0532, 0.587, 0.568}, and c = {0.0754, 0.0406, 0.0554}. Both the simpler CNN and the more sophisticated ResNet achieve lower mean absolute errors for both the kinetic energy and its derivative than the standard KRR model. We attribute the better performance of the extended KRR method to a lack of smoothness exhibited by the neural network models as shown in the bottom panel of Figure .

Finding Minimum Energy Densities Using Principal Component Analysis

In the next step we address the question of applicability with respect to a direct minimization of kinetic energy density. As will be shown in the following, an unconstrained search still remains impossible despite the drastic improvements in the prediction accuracy of the functional derivative. The reason for this failure lies in the notoriously noisy nature of machine learning approximations. Already emphasized in ref (25), this was discussed in greater detail in a follow-up investigation on nonlinear gradient denoising.[65] For reasons of comparability, we use the same local principal component analysis (PCA) approach as was introduced in the original publication as an ad hoc remedy and investigate if the improved accuracy of the models trained on the functional derivative translates to lower errors on the iteratively found densities. Additionally, we test if the PCA search space can be increased for these models. Starting from the average density of all training examples, , the minimum energy density is found by simple gradient descentwhere the projection matrix P(n) is acting on the functional derivative and constraining the search space, η is the step size, and V is the discretized potential. We note that more sophisticated optimization methods such as conjugate gradient are known to significantly accelerate the convergence,[66] but this is not relevant for the intended comparison. For a given density n the local PCA algorithm starts by calculating the difference matrix XT = (n – n, ..., n – n) for the m closest training densities and diagonalizing the covariance matrix C = XTX/m. The projection matrix is then constructed from the eigenvectors w corresponding to the l largest magnitude eigenvalues . Figure shows the effect of this projection on the functional derivative prediction made by the standard KRR model with parameters m = 30 and l = 5. For all calculations presented in this article we keep m = 30, but we vary the size of the search space via the parameter l. Section 7 of the Supporting Information shows the effect of the projection on the functional derivative prediction for different values of l. The iterative minimization algorithm is considered converged once the integral over the absolute projected functional derivative is smaller than 10–6 hartree/particle. We use a step size of η = 10–3 and restrict the maximum number of iterations to 4000 cycles. The results for all of the 1000 random potentials in the test set are summarized in Table . Again, the inclusion of derivative information reduces both errors significantly when compared to those obtained with the simple KRR. The final error can be attributed to two different sources. The first contribution stems from the model error due to the ML approximation as has already been discussed in section . A second contribution arises due to the difference in the corresponding minimum energy densities Δn, which is in turn caused by the model error and the restriction of the search space in the PCA. This limited flexibility in the search for l = 5 is likely the cause of the similar error values achieved by all of the ML models using derivative information. We therefore also investigated larger l values while keeping m = 30. Runs using standard KRR fail for every value l > 5, not reaching convergence and predicting sharply peaked densities instead. Similarly, the performance of the simple CNN deteriorates quickly for l > 10. For the ResNet and extended KRR models the errors reduce up to an l value of about 15, at which point the iterative algorithm leaves the valid region of the ML models. Note, however, that for l values in the range between 10 and 15 the ML approximations including derivative information achieve errors an order of magnitude lower than standard KRR.

Table 2

Absolute Kinetic Energy Errors ΔT for the Iteratively Found Densities (in kcal/mol) as Well as the Integrated Absolute Error of the Densities Δn for the N = 1 Test Set

		\|ΔT\|			\|Δn\| × 10⁴
model	l	mean	std	max	mean	std	max
KRR, ref [25]	5	3.0	5.3	46	–	–	–
KRR, this work	5	2.85	7.00	87.34	45.0	54.0	503.9
ext KRR	5	0.46	1.05	15.35	14.9	11.5	85.0
ext KRR	10	0.04	0.22	5.95	0.8	0.7	10.7
ext KRR	15	0.04	0.22	5.97	0.3	0.5	10.8
CNN	5	0.57	1.40	21.69	15.2	12.0	95.5
CNN	10	0.29	0.77	13.26	5.5	8.5	171.0
ResNet	5	0.51	1.25	19.59	14.9	11.5	85.5
ResNet	10	0.09	0.21	5.72	1.0	0.9	14.7
ResNet	15	0.09	0.22	5.86	2.0	2.4	19.6

Alternatively, KRR can also be used in iterative calculations without the PCA by introducing a constant offsetand using a small length scale hyperparameter σ in the kernel function. This can be used to penalize densities far from the training data and effectively acts similarly to PCA, while using all of the training densities (m = M). This approach is inspired by the use of Gaussian process regression for molecular geometry optimization,[67−69] where a similar idea ensures that the iterative search does not stray too far from the training data. A more detailed explanation as well as results for the densities provided by this method can be found in section 6 the Supporting Information. However, both the need for local PCA and the alternative approach of small length scales and a constant offset are indications that the ML density functionals do not properly generalize and are only valid in close vicinity of the training examples.

Toward a Real-World Use Case

While section shows that models trained on the functional derivative allow for significantly larger search spaces, the iterations will, given enough flexibility, inevitably leave the region where the machine learning approximations are valid. This typically leads to sharply peaked or rapidly oscillating densities. A straightforward solution for this problem is to use physically motivated penalty terms for these unphysical densities such as the von Weizsäcker kinetic energy functional[8] and to train the machine learning model on the difference between the exact kinetic energy and the von Weizsäcker model: The prediction of a previously unseen density is then given by the sum of the machine learning and the von Weizsäcker model functionals, TML + TvW, where the derivative term n′(x) = dn/dx in the latter contribution is introducing an energy penalty for rapid changes in the density, which effectively restricts the search space to physically reasonable densities. Since the von Weizsäcker model already yields the exact solution for the case of a single spatial orbital (discussed in section ), we test this approach for two-particle densities instead. In our toy model, this already corresponds to the occupation of two spatial orbitals since there is no spin degree of freedom taken into consideration. At first, we again investigate the model performance for fixed densities. The hyperparameters for both the standard and the extended KRR models are readjusted using the same grid search and 5-fold cross-validation as in section . This yields σ = 35.16 and λ = 10–12 for the standard KRR and σ = 26.59, λ = 10–12, and κ = 1.0 for the extended KRR. More details on the hyperparameter search are provided in section 3 of the Supporting Information. As can be seen in Table , all of the three investigated models achieve significantly better performance on the two-particle densities than on the single-particle densities. The analysis of the training sets in section 1 of the Supporting Information suggests that even though kinetic energies in the N = 2 data set are showing a larger variance, most of it can be captured by a simple linear model. We attribute this to the fact that the second particle is less influenced by the relatively shallow potentials and the more difficult to learn semilocal contribution is already covered by the von Weizsäcker functional. Even though chemical accuracy is achieved by all models on this less challenging data set, extended KRR yields a mean absolute error for the kinetic energy that is 2 orders of magnitude lower than that obtained with either ResNet or standard KRR.

Table 3

Error Values of the Machine Learning Approximations on the N = 2 Test Set (in kcal/mol)

	\|ΔT\|
model	mean	std	max	mean	std	max
KRR	0.0355	0.0588	0.8752	2957.85	16.00	2990.36
ext KRR	0.0002	0.0008	0.0233	0.12	0.15	2.18
ResNet	0.0483	0.2837	6.8116	6.78	11.36	223.35

Regarding efficiency and feasibility, the local PCA introduces a significant computational overhead and would most likely prohibit a large scale application of ML density functionals to realistic problems. We therefore opt for a more traditional approach of using a basis of sine functions instead. This introduces the necessity of ensuring both the positivity and the proper normalization of the density throughout the iterative algorithm. While the correct norm can simply be enforced by a Lagrange multiplier, the positivity constraint is typically included by iterating on the variable instead of the density. The steepest descent update rule for this new variable is given bywhere P is a projection matrix and μ is the Lagrange multiplier used to ensure the conservation of the number of particles N. A detailed derivation of this result, a possible way of determining μ as well as an explanation why this procedure is not necessary for the local PCA, is provided in section 5 of the Supporting Information. The matrix used to project the functional derivative onto the basis of sine functions is constructed viawith . Note that this matrix does not need to be reconstructed in every iteration as it is no longer dependent on the density at step j. The size of the search space is now determined by the maximum wavenumber K. The starting value for the variable φ is the square root of the initial density n(0) projected onto the basis of sine functions , where the starting density is again given by the average over the training data . The maximum number of iterations is restricted to 4000 and calculations are considered converged once the integral over the absolute projected functional derivative drops below a threshold of . Note that this differs from the convergence criterion in section because convergence is monitored using the functional derivative with respect to the variable φ instead of the density. The step size is reduced to η = 10–4 in order to avoid oscillations in the convergence behavior. Our results are summarized in Table . The large errors on the functional derivative of the standard KRR model lead to poor results for iteratively found densities. This is further emphasized by the steadily increasing error when the search space grows from K = 10 to K = 20 and K = 40. The iterative search is, however, stable even for the larger K values due to the von Weizsäcker penalty term. Extended KRR clearly yields the best predictions for the minimum energy densities, while the ResNet approach barely manages to achieve a mean absolute error for the kinetic energy within chemical accuracy.

Table 4

Absolute Kinetic Energy Error ΔT for the Iteratively Found Densities (in kcal/mol) as Well as the Integrated Absolute Error of the Densities Δn on the N = 2 Test Set, Compared between the KRR variants and ResNet for Increased Search Spaces

		\|ΔT\|			\|Δn\| × 10⁴
model	K	mean	std	max	mean	std	max
KRR	10	8.431	1.138	16.686	109.0	23.0	222.8
KRR	20	21.365	0.826	22.456	133.9	18.4	194.9
KRR	40	23.882	0.846	25.003	139.8	18.4	200.5
ext KRR	10	0.523	0.827	7.353	24.8	16.0	102.1
ext KRR	20	0.074	0.069	0.789	0.5	0.6	2.9
ext KRR	40	0.076	0.069	0.789	0.1	0.1	1.1
ResNet	10	1.239	6.537	142.649	25.8	18.6	248.9
ResNet	20	0.877	6.634	146.324	2.9	11.8	253.3
ResNet	40	0.877	6.635	146.327	2.7	11.8	253.3

Larger Training Set

The previous sections are somewhat biased due to the small number of training examples—a regime where KRR excels. In a final test we therefore investigate the performance of the neural network based density functional for an increasing number of training examples. A similar investigation for the extended KRR model is not feasible since the training effort for the KRR model scales with and the memory requirements grow with . We note, however, that an alternative approach to reduce the computational effort would be to use sparse kernel based machine learning algorithms such as support vector regression[70−72] or sparsified Gaussian process regression.[73−75] These methods could combine the high accuracy of kernel ridge regression even for small training sets with the potential of increasing the region where the model is valid by incorporating a significantly wider range of training examples. The hyperparameters of the ResNet model and the training procedure have to be adjusted due to the increased number of examples. Instead of evaluating the cost function involving all of the training data (batch learning), a random subset or “batch” of 100 examples is used to calculate the gradient descent step and to update the NN parameters (i.e., minibatch learning). The models are trained for a total of 300 000 such iterations with a learning rate of 10–4 during the first 40 000 steps, after which the learning rate is reduced by 10% every 2000 steps. In addition, the regularization factor λ is lowered as well (see section 4 of the Supporting Information for details). Table shows that the larger amount of training examples leads to a steady reduction in every error score. The most significant improvement is observed for the standard deviation and the maximum error, the metrics most closely related to the generalization properties of the model.

Table 5

Absolute Error Values for the Kinetic Energy ΔT and Its Functional Derivative (in kcal/mol) on the N = 2 Test Set, Achieved by the ResNet Model Trained on Sets of Varying Size

	\|ΔT\|
M	mean	std	max	mean	std	max
100	0.049	0.284	6.814	6.78	11.36	223.36
1 000	0.012	0.063	1.922	3.12	2.81	66.31
10 000	0.007	0.018	0.528	2.79	1.92	45.19
100 000	0.007	0.009	0.138	2.39	1.36	19.40

We use a basis of K = 40 sine functions for the iterative calculation of minimum energy densities. Instead of enforcing a convergence threshold, the calculations stop after a fixed number of 10 000 iterations for the sake of simplified parallelization on GPUs. Typically, the final error values are reached within the first 10% of the iterations. The large overhead in terms of iterations is used to investigate the numerical stability of the iterations on noisy ML predictions of the derivative. The error scores on the iteratively found densities, summarized in Table , are clearly improving with an increasing number of training examples. The ResNet trained on 100 000 densities achieves a performance similar to the extended KRR model trained on just 100 examples. While this may suggest that KRR should be the obvious choice, one has to keep in mind that the evaluation times for the ResNet model are independent of the number of training examples, and that training examples are typically available in abundance: In fact, every single step in a self-consistent Kohn–Sham DFT calculation could serve as training input.

Table 6

Absolute Kinetic Energy Error ΔT (in kcal/mol) as Well as the Integrated Absolute Error Δn of the Iteratively Found Densities on the N = 2 Test Set, Employing the ResNet on Training Sets of Increasing Size

	\|ΔT\|			\|Δn\| × 10⁴
M	mean	std	max	mean	std	max
100	0.856	6.591	145.58	2.7	11.8	253.0
1 000	0.151	1.196	32.65	0.7	1.9	50.5
10 000	0.062	0.228	6.48	0.6	0.6	13.7
100 000	0.047	0.070	0.89	0.6	0.5	5.5

Conclusion

The predictive capabilities of kernel ridge regression and convolutional neural networks, two well-established machine learning techniques, have been tested on a one-dimensional model system of noninteracting spinless fermions with respect to the kinetic energy and its functional derivative. Extending the work of Snyder et al.,[25] we have investigated if the original idea of learning the kinetic energy functional for usage in iterative calculations of minimum energy densities can be “salvaged” by a simultaneous training of machine learning models on both the kinetic energy functional and its functional derivative. Besides kernel ridge regression, the method of choice in the original paper, we have evaluated the performance of convolutional neural networks, one of the most successful and widely used machine learning architectures to date. In general, the inclusion of the functional derivative not only improves the prediction accuracy for the functional derivative, but also leads to better generalization toward out-of-training data. This is underlined by the fact that iterative calculations of the minimum energy density are significantly more stable and lead to lower deviations in both the final kinetic energy and the converged density. However, the usage of derivative information in the kernel ridge regression technique increases the computational effort significantly and prohibits its application to larger data sets. Neural networks, on the other hand, do not show these limitations. Of the two flavors tested in this study, conventional convolutional networks and the more advanced ResNets, the latter variant achieves competitive results already on small training sets and improves its performance steadily with increasing data at minimal additional computational cost. Very recently, it has already been shown for the exchange–correlation functional that convolutional neural network based density functionals can easily be extended toward three-dimensional systems.[43] Using similar techniques for the kinetic energy functional might bring us closer to the ambitious objective of a truly orbital-free density functional theory.

25 in total

1. Kinetic-energy functional of the electron density.

Authors:
Journal: Phys Rev B Condens Matter Date: 1992-06-15

2. Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies.

Authors: Katja Hansen; Grégoire Montavon; Franziska Biegler; Siamac Fazli; Matthias Rupp; Matthias Scheffler; O Anatole von Lilienfeld; Alexandre Tkatchenko; Klaus-Robert Müller
Journal: J Chem Theory Comput Date: 2013-07-30 Impact factor: 6.006

3. Machine learning modeling of Wigner intracule functionals for two electrons in one-dimension.

Authors: Rutvij Bhavsar; Raghunathan Ramakrishnan
Journal: J Chem Phys Date: 2019-04-14 Impact factor: 3.488

4. Can exact conditions improve machine-learned density functionals?

Authors: Jacob Hollingsworth; Li Li; Thomas E Baker; Kieron Burke
Journal: J Chem Phys Date: 2018-06-28 Impact factor: 3.488

5. Finding density functionals with machine learning.

Authors: John C Snyder; Matthias Rupp; Katja Hansen; Klaus-Robert Müller; Kieron Burke
Journal: Phys Rev Lett Date: 2012-06-19 Impact factor: 9.161

6. Machine Learning the Physical Nonlocal Exchange-Correlation Functional of Density-Functional Theory.

Authors: Jonathan Schmidt; Carlos L Benavides-Riveros; Miguel A L Marques
Journal: J Phys Chem Lett Date: 2019-10-09 Impact factor: 6.475

7. Kinetic Energy of Hydrocarbons as a Function of Electron Density and Convolutional Neural Networks.

Authors: Kun Yao; John Parkhill
Journal: J Chem Theory Comput Date: 2016-02-08 Impact factor: 6.006

8. Bypassing the Kohn-Sham equations with machine learning.

Authors: Felix Brockherde; Leslie Vogt; Li Li; Mark E Tuckerman; Kieron Burke; Klaus-Robert Müller
Journal: Nat Commun Date: 2017-10-11 Impact factor: 14.919

9. Transferable Machine-Learning Model of the Electron Density.

Authors: Andrea Grisafi; Alberto Fabrizio; Benjamin Meyer; David M Wilkins; Clemence Corminboeuf; Michele Ceriotti
Journal: ACS Cent Sci Date: 2018-12-26 Impact factor: 14.553

10. Electron density learning of non-covalent systems.

Authors: Alberto Fabrizio; Andrea Grisafi; Benjamin Meyer; Michele Ceriotti; Clemence Corminboeuf
Journal: Chem Sci Date: 2019-09-09 Impact factor: 9.825

5 in total

1. Calculation of Metallocene Ionization Potentials via Auxiliary Field Quantum Monte Carlo: Toward Benchmark Quantum Chemistry for Transition Metals.

Authors: Benjamin Rudshteyn; John L Weber; Dilek Coskun; Pierre A Devlaminck; Shiwei Zhang; David R Reichman; James Shee; Richard A Friesner
Journal: J Chem Theory Comput Date: 2022-04-04 Impact factor: 6.578