Literature DB >> 35343234

Linear Weak Scalability of Density Functional Theory Calculations without Imposing Electron Localization.

Marcel D Fabian1, Ben Shpiro1, Roi Baer1.   

Abstract

Linear scaling density functional theory (DFT) approaches to the electronic structure of materials are often based on the tendency of electrons to localize in large atomic and molecular systems. However, in many cases of actual interest, such as semiconductor nanocrystals, system sizes can reach a substantial extension before significant electron localization sets in, causing a considerable deviation from linear scaling. Herein, we address this class of systems by developing a massively parallel DFT approach which does not rely on electron localization and is formally quadratic scaling yet enables highly efficient linear wall-time complexity in the weak scalability regime. The method extends from the stochastic DFT approach described in Fabian et al. ( WIRES: Comp. Mol. Sci. 2019, e1412) but is entirely deterministic. It uses standard quantum chemical atom-centered Gaussian basis sets to represent the electronic wave functions combined with Cartesian real-space grids for some operators and enables a fast solver for the Poisson equation. Our main conclusion is that when a processor-abundant high-performance computing (HPC) infrastructure is available, this type of approach has the potential to allow the study of large systems in regimes where quantum confinement or electron delocalization prevents linear scaling.

Entities:  

Year:  2022        PMID: 35343234      PMCID: PMC9009081          DOI: 10.1021/acs.jctc.1c00829

Source DB:  PubMed          Journal:  J Chem Theory Comput        ISSN: 1549-9618            Impact factor:   6.006


Introduction

In the past few decades, supercomputers’ massive number-crunching power, measured in floating-point operations per second (FLOPS), has grown a million-fold[1] and is currently pushing toward the exaflop (1018 FLOPS) realm. Combining this new technology with electronic structure calculations can revolutionize computational materials science and biochemistry, provided we complement it with algorithms that can efficiently exploit its massively parallel-based infrastructure. One of the key questions then becomes how to quantify the efficiency of a certain algorithm on a massively parallel machine. A crucial measure in this regard is the speedup, which we define as the ratiobetween the wall-times, T1(W), for executing a given computational work W using a single processor and T(W) for its execution using M processors working in parallel. In operational regimes where the speedup is nearly proportional to M, i.e., , there is a clear advantage in using a parallel multiprocessor approach where is the efficiency, with being ideal. The efficient use of parallel computing was discussed by Amdahl in his seminal paper,[2] in which he identified in W an inherently serial (subscript s) and parallelizable (subscript p) part, W = Ws + Wp. He assumed that the execution wall-time is independent of M for completing W and decreases linearly with M for Wp. Amdahl defined the serial fraction as measured on a single processor machine for a given job independent of M. With this definition, the speedup can be expressed as (Amdahl’s law, also called strong scalability) and saturates once M exceeds the value of 1/sA. Gustafson pointed out[3,4] that in real-world usage the definition for the serial fraction should depend on M, because one does not generally take a fixed-sized problem, as Amdahl did, but rather scales the workload W with the available computing power. He then defined the serial fraction as measured on the M-processor system and showed that the speedup can be expressed as (Gustafson’s law, also called weak scalability), enabling linear speed up which does not inherently saturate as M increases. These considerations can be applied to electronic structure calculations of extended systems in DFT codes that lower the cubic scaling by taking advantage of electron localization.[5−25] For linear-scaling schemes, the Amdahl serial fraction is expected to be system-size independent (since both timings in the numerator and the denominator scale linearly with system size) while for codes of higher algorithmic complexity, s decreases as system size increases.[26] In a weak scalability analysis of the linear scaling codes, Gustafson’s serial fraction is also expected to be system-size-independent (since both timings in the numerator and the denominator scale linearly with system size) and therefore take the form , where M0 is a constant (depending on the hardware and algorithm). For large M, the speedup saturates to , but if M0 is very large there is a sizable regime where M ≪ M0 and the s is essentially zero, so an ideal linear speedup emerges, as reported, for example, for the CONQUEST code,[6,27] even up to M = 200 000 cores on the Fujitsu-made K-computer. It is clear from the previous studies mentioned above that it is important to determine the strong and weak scalability properties of codes that can use massively parallel machines, because they are sensitive to many details concerning hardware, system size, algorithmic scaling, etc. In this paper we develop an efficiently parallelizable, (semi)local DFT approach which offers quadratic scaling with system size and does not involve approximations derived from assuming electron localization. It combines several approaches, such as atom-centered Gaussian basis sets and real-space grids for providing the electrostatic and exchange–correlation energies (similar to SIESTA[9] and CP2K/Quickstep[7]) as well as Chebyshev expansion techniques for representing the density matrix.[16,28−30] We describe the theory and implementation in section , where we also provide an illustration of the nonlocalized nature of electrons in the large benchmarking systems we use (see Figure ). Next, we present the algorithmic complexity and the parallel strong/weak scalability properties of our approach in section , and finally, we summarize and discuss the conclusions in section .
Figure 1

Fock (FKS), overlap (S), and DM (P) matrices of Si705H300 and (H2O)471 cluster (calculated within the LDA) are shown using a color-coded plot. The basis set in both systems is similar in size, with K ≈ 6000. For clearer inspection of the sparsity pattern, the rows and columns are permuted so as to achieve a minimum bandwidth around the diagonal. We applied the MinimumBandWidthOrdering command of Mathematica[34] to the atomic proximity matrix D = Θ(R0 – R) (where R is the distance between any pair of atoms A, B and R0 = 10a0 is the proximity distance), giving a permutation which is then used to order the atom-centered basis functions.

Fock (FKS), overlap (S), and DM (P) matrices of Si705H300 and (H2O)471 cluster (calculated within the LDA) are shown using a color-coded plot. The basis set in both systems is similar in size, with K ≈ 6000. For clearer inspection of the sparsity pattern, the rows and columns are permuted so as to achieve a minimum bandwidth around the diagonal. We applied the MinimumBandWidthOrdering command of Mathematica[34] to the atomic proximity matrix D = Θ(R0 – R) (where R is the distance between any pair of atoms A, B and R0 = 10a0 is the proximity distance), giving a permutation which is then used to order the atom-centered basis functions.

Method

In our method, we work with standard quantum chemistry basis sets, composed of atom-centered local functions ϕα(), α = 1, ···, K. For calculating the necessary integrals, solving the Poisson equations, and generating the exchange–correlation potentials, we use a 3D Cartesian real-space grid of equidistant points spanning a simulation box, containing the system’s atoms and electronic density. For this purpose, we developed an efficient method for evaluating the basis functions on a relevant set of grid points, outlined in Supporting Information A. Our method of combining basis functions and real-space grids is similar in spirit to those existing in the literature, such as SIESTA[9] and CP2K/Quickstep,[7] but differs in important details. Unlike SIESTA, we use standard nonorthogonal Gaussian basis sets, and unlike Quickstep, we represent the basis functions on the grid where all integrals are performed as summations. The first type of integral that we have to evaluate on the grid is the overlap matrix:where are the grid points and h is the grid-spacing. Next, the kinetic energy integrals are evaluated aswhere the derivatives of the basis functions are calculated analytically and then placed on the grid (see Supporting Information A.2.3 for details). To avoid an excessive number of grid points, the equally spaced grid is complemented with norm-conserving pseudopotentials,[31] representing the effects of the tightly bound core electrons (which are not treated explicitly) and taken into account in the KS Hamiltonian, represented by the Fock matrixwhereare the integrals for the nonlocal pseudopotential andare the KS potential integrals, whereIn eq , vH[n]() is the Hartree potential on the grid which is evaluated directly from the grid representation of the electron density n() by a reciprocal space-based method for treating long-range interactions.[32] The exchange–correlation potential vxc[n]() (within the local density approximation (LDA)) is also determined on the grid directly from the electron density. From the grid representation of the pseudopotentialsa we obtain the potential appearing in eq for nucleus C at position and, by grid integration, the matrix VNL appearing in eq . All integral calculations are performed in parallel for different basis function pairs; for more details, see Supporting Information C. The electron density on the grid is formally defined aswhere P is the density matrix (DM) and the factor of 2 comes from integration over spin degrees of freedom. The DM must obey an electron-conserving criterion, namely that the integral over all grid points evaluates to the total number of electrons in the system: h3 ∑n() = Ne. Indeed, performing this integral and using eqs and 2.7 we findThis relation is part of a more general requirement, that the Kohn–Sham eigenstates are populated according to the Fermi–Dirac function fFD(ε) = where ε is the corresponding energy eigenvalue. For the DM, this condition can be satisfied by defining[21]For finite-temperature DFT, β is the inverse temperature and μ is the chemical potential. For ground-state calculations, β obeys β(εL – εH) ≫ 1, where εL (εH) is the Kohn–Sham eigenvalue of the lowest unoccupied (highest occupied) molecular orbital. The chemical potential in the Fermi–Dirac function is adjusted to reproduce the systems’ number of electrons Ne through eq . The use of atom-centered local basis functions allows for sparsity in the basic matrices FKS and S, as illustrated in Figure for two systems of similar size but different chemical nature, a 2.5 nm (diameter) semiconductor nanocrystal Si705H300 and a 3 nm water cluster (H2O)471. For the matrix representation in Figure , we have ordered the atoms (and the basis functions associated with them) in a way that takes into account their spatial proximity (near atoms tend to have similar indices). Therefore, it is clear by mere inspection that FKS and S have a relatively small spatial range and are therefore quiet sparse. Our approach makes an effort to exploit this property by using sparse matrix algebra. Despite the spatial locality of FKS and S, P in these large systems is highly nonlocal, expressing the physical fact that the electronic coherence in these systems is long ranged. For the silicon system, this fits our intuition, namely that silicon is by nature a semiconductor, with properties which are close to those of metals. Although water is a large band gap system, it is known that under LDA it exhibits very small HOMO–LUMO gaps[35−38] (see also Figure ).
Figure 2

Density of state (DOS) as a function of energy, shifted by the chemical potential μ, for a water cluster (H2O)100 in the LDA at a fixed geometry. The three panels compare all-electron calculations (performed by Q-Chem[43]) with valence-electron-only calculations (performed by the present approach, using pseudopotentials). Each panel presents the results for different Gaussian basis sets, from single- to triple-ζ quality (STO-3G, 6-31G, and 6-311G). Both calculations use the eigenvalues ε of the converged KS Hamiltonian to obtain the DOS function ρDOS(ε) = where σ = 0.01E. The calculation in the present approach used β = 100E–1 with a real-space grid of spacing Δx = 0.33 a0.

Density of state (DOS) as a function of energy, shifted by the chemical potential μ, for a water cluster (H2O)100 in the LDA at a fixed geometry. The three panels compare all-electron calculations (performed by Q-Chem[43]) with valence-electron-only calculations (performed by the present approach, using pseudopotentials). Each panel presents the results for different Gaussian basis sets, from single- to triple-ζ quality (STO-3G, 6-31G, and 6-311G). Both calculations use the eigenvalues ε of the converged KS Hamiltonian to obtain the DOS function ρDOS(ε) = where σ = 0.01E. The calculation in the present approach used β = 100E–1 with a real-space grid of spacing Δx = 0.33 a0. The various expectation values of relevant observables (i.e., operators in the grid representation) can be expressed as trace operations:whereis the matrix representation of the one-body operator Ô in the atomic basis. In order to expedite the calculation we need to parallelize the computational work, and this can be done by representing the trace operations as a sum over unit column vectors uα (with coordinates (uα)β = δ, i.e., zeros in all positions except at α), computed column by column:For achieving this, we treat the DM as an operator; that is, we devise a linear-scaling method for applying it to the column vector uα, based on eq : Puα = fFD(S–1FKS)S–1uα. The operation S–1uα is performed by the linear-scaling preconditioned conjugate-gradient approach involving repeated application of the sparse overlap matrix S on column vectors.b The operation of fFD(S–1FKS) on the column vector S–1uα employs a Chebyshev expansion[28,40] of the function fFD(ε), which results in repeated applications of the operator S–1FKS to column vectors. Details are described in Supporting Information B. The entire procedure can be readily distributed over several processors in parallel, each commissioned with a distinct set of uα column vectors. This calculation method has the additional benefit that it avoids storage of the nonsparse DM. We discuss the algorithmic complexity of the approach, as well as its weak and strong scalability in section . Equations –2.9 and the techniques of their application discussed above form a series of nonlinear equations that must be solved together, to give the self-consistent-field (SCF) solution. The procedure is iterative and uses the direct inversion of the iterative subspace (DIIS) convergence acceleration method.[41] Once converged, various expectation values such as charges and multipoles, density of states, and polarizability can be calculated, as well as forces on the nuclei,[42] which can be used for structure optimization. In order to check and validate the implementation of the algorithm outlined above, we show in Figure the density of states (DOS) shifted for the chemical potential μ of a cluster of 100 water molecules, obtained with our program, and with the all-electron calculation performed in the commercially available quantum chemistry program Q-Chem.[43] Our code used β = 100Eh–1, but we tested also larger values of β to ascertain that the results are visibly identical. We made comparisons using three different basis sets, ranging from single- to triple-ζ quality (STO-3G, 6-31G, and 6-311G). To complement the picture, we also give the frontier orbital energies, band gaps, and chemical potentials corresponding to these calculations in Table .
Table 1

Comparison of Frontier Energy Levels, the Band Gap εg = εL – εH, and the Chemical Potential μ = (εH + εL)/2 for the DOS Calculations of Figure

basismethodεHεLεgμ
STO-3Gpresent–4.5–2.61.9–3.6
 all-electron1.34.53.12.9
 SBKJC–5.3–1.63.7–3.4
6-31Gpresent–3.5–1.91.5–2.7
 all-electron–3.5–2.21.4–2.8
6-311Gpresent–4.7–3.31.4–4.0
 all-electron–4.2–2.71.4–3.4
Looking at the shifted DOS, both the results of Q-Chem and those of the present code converge to indistinguishable values close to that of the all-electron highest-quality basis calculation. This validates our present code’s calculations, even though a small shift still exists between the chemical potentials (−0.6 eV), as seen in Table . It is noteworthy that the DOS in our code is less sensitive to basis set quality than the all-electron code, where for the smallest STO-3G basis set the all-electron calculations deviate strongly from the converged basis set values, showing a large (6.3 eV) shift and a band gap which is more than a factor two too large. The stability of our calculations in comparison to Q-Chem can be attributed to the use of the norm-conserving pseudopotentials. Indeed, in Supporting Information F we show that effective core potentials stabilize the Q-Chem small basis set calculations as well. An additional validation of our approach can be found in the Supporting Information G, where we compare the potential energy surface of the H2 molecule calculated with both our code and Q-Chem and where we show the influence of the grid spacing on the accuracy of the calculation. Overall, the approximations that we employ lead to a systematic difference of ∼0.2% in the electronic energy when compared with Q-Chem for most of the examined distance range (and maximally ∼0.4%) and a small corrugation which appears when the gridpoint spacing is larger than the width of the smallest Gaussian primitive. The relative errors in the electronic energy, and the fact that they are mostly a rigid shift, lead to deviance of the order of 0.05 eV in the bond energy, much smaller than typical 6-311G basis set errors.[44]

Scaling Properties of the Method

In this section we study the method’s algorithmic complexity and analyze the speedup achievable by parallelization in terms of strong and weak scalability.

Algorithmic Complexity

To understand the algorithmic complexity of our method, we have to examine how each part of our code scales as we increase the system size K. Here we are especially interested in the asymptotic behavior, meaning that the program part with the largest scaling will determine the overall algorithmic complexity. Our entire SCF cycle, that is described in detail in Supporting Information A.3, includes different integral calculations, solving the Poisson equation, and calculating the density. The integral calculation is expected to scale linearly with system size K, i.e. O(K), because the relevant matrices (FKS, S) are expected to become sparse (see also Figure ). The Poisson equation is solved by a fast Fourier transform (FFT) which scales as O(N log N), where N are the grid points, expected to scale linearly with system size. This leaves only the density calculation, which is done according to eq . The application of the DM P to a column vector uα, expressed through a Chebyshev series, involves repeated applications of the operator S–1FKS to the column vector v = S–1uα (see Supporting Information B for details). The length of the Chebyshev expansion, N, is independent of the system size K, and so the algorithmic complexity of the Puα operation is identical to that of one S–1FKSv operation, namely, linear with K. There are a total of K different Puα operations (see eq ), so that the overall algorithmic complexity of the method is asymptotically quadratic, i.e., O(K2). As the system size grows our algorithm could be modified to take advantage of the emerging sparsity of the DM, allowing for a K-independent complexity of each Puα operation. In such situations, one can expect an overall linear-scaling numerical complexity, i.e., O(K). However, in the present paper, we focus on the broad class of systems which are very large but for which the DM has not yet localized. Hence, we are in the formally quadratic complexity regime. To show that quadratic complexity is indeed what we achieve with this method, we plot in Figure the wall-time per SCF cycle versus system size for water clusters (taken from http://www.ergoscf.org/xyz/h2o.php, accessed on 2022-03-05) and hydrogen-terminated silicon nanocrystals (we use a series of nanocrystals, starting from Si35H36 reaching Si2785H780; for details, see Supporting Information D), using STO-3G and the larger 6-31G basis sets. Going from the smaller to the larger basis set increases wall-time by a factor of 10–20. This result is a combination of several characteristics beyond the mere size of the basis set. For example, the magnitude of the Gaussian exponents of the basis set’s primitives are relevant for the dimensioning of the grid. Higher-valued Gaussian exponents require a finer mesh and also increase the kinetic energy component of the Hamiltonian, which increases the Chebyshev expansion length. Smaller (diffuse) Gaussian exponents lead to larger grid windows (see also Supporting Information A.1.1) and hence an increase in overall grid size as well. Furthermore, the implementation of the linear scaling operation of S–1, involving the incomplete Cholesky decomposition and preconditioned conjugate gradients algorithms, is sensitive to the condition number of S, determined by near linear dependencies between basis functions. As seen in the figure, all cases show overall quadratic algorithmic complexity. It is worth noting that the small and intermediate sized systems in the figure exhibit a varying algorithmic complexity with system size associated with the interplay between linear complexity processes having a large prefactor and cubic stages because of the nonsparse nature of the Hamiltonian and overlap matrices.
Figure 3

Wall-time as a function of system size, for the water clusters and the silicon nanocrystals, calculated using two basis sets within the LDA. The calculations were performed on eight Intel Xeon Gold 6132 CPU @ 2.60 GHz 755GB RAM (connected through Infiniband), using 112 cores for all systems. The dotted lines in the panels are guides to the eye with designated quadratic scaling. Fitting the function t = Ax to the data at the larger time range, one obtains the exponent n = 1.9 (2.1) for both the water and the silicon systems in the STO-3G (6-31G) basis.

Wall-time as a function of system size, for the water clusters and the silicon nanocrystals, calculated using two basis sets within the LDA. The calculations were performed on eight Intel Xeon Gold 6132 CPU @ 2.60 GHz 755GB RAM (connected through Infiniband), using 112 cores for all systems. The dotted lines in the panels are guides to the eye with designated quadratic scaling. Fitting the function t = Ax to the data at the larger time range, one obtains the exponent n = 1.9 (2.1) for both the water and the silicon systems in the STO-3G (6-31G) basis.

Strong Scalability

In Figure we study the strong scalability properties of our code, i.e., the scalability achievable when increasing the number of processors for a given task. We show in the figure the speedup and efficiency for a single SCF iteration of the Si1379H476 nanocrystal. Our definition for the speedup in eq requires the knowledge of the elapsed wall time it takes a single processor (more accurately 1 core) to finish this nanocrystal calculation. Because of (human) time constraints we had to extrapolate this timing from a calculation on 36 cores on one single compute node by T1 = 36T36. The results can be analyzed in terms of the Amdahl law finding that the serial fraction is sA = 9 × 10–5 showing a high degree of parallelization. Accordingly, the parallelization efficiency drops very slowly as the number of processors increases, with 96% efficiency even at M = 500 (see the inset in the top panel). We emphasize that this is achieved with a 10 Gb ethernet network communication. Potentially, the decay of efficiency may be slowed down by employing a faster communication solution. According to Amdahl’s law, efficiency will drop to 0.5 when . In Supporting Information E we show results for a smaller system, Si705H300, where the Amdahl serial fraction is larger, sA = 2 × 10–4, a system-size dependency due to the quadratic complexity of our method (see our discussion in section ).
Figure 4

Strong scalability speedup analysis (upper panel) and efficiency (lower panel) for Si1379H476. The reference time for 1 processor for the speedup is extrapolated from T1 = 36T36. The inset in the lower panel enables a higher resolution of the efficiency regime close to unity. The calculation used the 6-31G basis set (11984 basis functions) within the LDA and were performed on several 2.60 GHz Intel Xeon Gold 6240 with 256 GB using 10 Gb Ethernet networking communications.

Strong scalability speedup analysis (upper panel) and efficiency (lower panel) for Si1379H476. The reference time for 1 processor for the speedup is extrapolated from T1 = 36T36. The inset in the lower panel enables a higher resolution of the efficiency regime close to unity. The calculation used the 6-31G basis set (11984 basis functions) within the LDA and were performed on several 2.60 GHz Intel Xeon Gold 6240 with 256 GB using 10 Gb Ethernet networking communications.

Weak Scalability

In this section we focus on the weak scalability properties of our method, namely how the wall time changes with system size K when the number of processors afforded to the calculation M grows in fixed proportion r = K/M. In the left panel of Figure we present the wall-time T as a function of system size K for six series of runs we made with different fixed ratios ranging from r = 4 up to r = 120 (in the actual calculation, r is the number of vectors uα assigned to each processor (see eq )). The markers of each series fall on asymptotically straight lines in the log–log plot which appear parallel to the dark-dashed line, indicating a constant slope of 1. This confirms the claim of achieving linear-scaling wall-time in this regime of operation, where r is held constant. We would also like to examine the speedup in order to determine the degree of efficiency of our calculation on the parallel machine. For calculating the speedup under our definition in eq we need to be able to estimate the wall time T1(W), which for the large systems is not easily accessible because of (human) time constraints. Therefore, we developed the following model for the wall time, with which we will estimate the M = 1 wall times:The first term on the right is the dominant parallelizable part of the calculation run on M processors (electron density calculation, see Supporting Information C for more information). For K ≫ K0 it exhibits quadratic scaling, while for K ≪ K0 the scaling is cubic because of insufficient sparsity of the Hamiltonian and overlap matrices for small K. The second term in eq reflects the timing of the serial part of the calculation, dominated by the communication time needed for specific MPI functions (reduce and broadcast) and scales linearly with K and logarithmically with M.
Figure 5

Weak scalability speedup analysis. Left: The wall-time for a single SCF cycle versus the number of basis set functions K in calculations given for several fixed values of r (K/M where M is the number of processors) on a series of eight hydrogen-terminated silicon nanocrystals (detailed in Supporting Information D) using the 6-31G basis set within the LDA. The black-dashed line is a guide to the eye showing linear scaling wall-time. The calculations were run on several Intel Xeon Gold 6240 CPUs @ 2.60 GHz 256GB RAM connected through a 10 Gb ethernet networking communications. We had access to at most 1584 cores; therefore, for r = 4 we could not treat systems greater than K = 6420, and similar though less stringent limitations appeared for r = 8 and 16. The colored dotted lines are the best-fit of the data to our model in eq . Right: The scaled speedup as a function of the number of processors M, calculated for the six values of r from eq using the best-fit parameters of our model. The black-dashed line indicates the “perfect” speedup .

Weak scalability speedup analysis. Left: The wall-time for a single SCF cycle versus the number of basis set functions K in calculations given for several fixed values of r (K/M where M is the number of processors) on a series of eight hydrogen-terminated silicon nanocrystals (detailed in Supporting Information D) using the 6-31G basis set within the LDA. The black-dashed line is a guide to the eye showing linear scaling wall-time. The calculations were run on several Intel Xeon Gold 6240 CPUs @ 2.60 GHz 256GB RAM connected through a 10 Gb ethernet networking communications. We had access to at most 1584 cores; therefore, for r = 4 we could not treat systems greater than K = 6420, and similar though less stringent limitations appeared for r = 8 and 16. The colored dotted lines are the best-fit of the data to our model in eq . Right: The scaled speedup as a function of the number of processors M, calculated for the six values of r from eq using the best-fit parameters of our model. The black-dashed line indicates the “perfect” speedup . Using the analytical model, the speedup can now be obtained by plugging eq into eq , resulting in the following closed form expression:From this equation, it can be seen that for asymptotically large values of M, the speedup approaches the limit and as long as r is not too smallthe speedup is close to ideal . We now fit our model to the calculation’s timing results from the six constant-r series shown in the left panel of Figure (a total of 32 data points). This leads to a best-fit set of parameters (in hours): τ1 → 5.16 × 10–6 h, τ2 → 5.63 × 10–7 h, and K0 → 2292.6 for our model, and the resulting fit functionsare plotted in the left panel of the figure as dotted colored lines, one for each value of r. It can be seen that these fit functions indeed reproduce the actual data (given as points) quite closely. Having the best-fit parameters, let us now discuss the actual estimated values for the (scaled) speedup in the Gustafson sense. These estimates, based on eq , are plotted in the right panel of Figure . We see that for r > 16 the speedup is not too far from ideal, in accordance with the analysis presented above; however, as indicated in eq , the speedup is smaller when r decreases as is clearly visible for r = 8 and small M and for r = 4 for all values of M. However, even for these small r cases, the speedup is maintained as M increases and the calculation is still quite efficient.

Summary and Conclusions

In this paper we presented a parallelizable electronic structure approach to finite temperature density functional theory under (semi)local functionals using atom-centered Gaussian basis sets which offers linear wall-time complexity as a function of system size in the weak scalability regime. The inherent time complexity of the method is quadratic O(K2), as discussed in section , and it does not involve truncation of density matrix elements, characteristics of linear-scaling approaches. Our trace-based calculation combined with Chebyshev expansions allows for efficient parallelization in the strong scalability sense, as shown in subsection . Because of the quadratic complexity, we found that the value of the Amdahl parameter was system-size dependent, with sA = 2 × 10–4 for the Si705H300 system and sA = 9 × 10–5 for Si1379H476. The overall weak scalability performance shows that linear scaling wall time is achievable, as demonstrated in section , and is highly efficient when the number of orbitals per processors r is not smaller than ∼10, and beyond that efficiency drops by a factor of ∼1.5. Our main conclusion is that this type of approach has the potential to be a useful and efficient tool for studying large systems in regimes where quantum confinement or electron delocalization prevents traditional linear-scaling to set in. Furthermore, for even larger systems, where electrons localize, we plan to enable linear scaling either through stochastic orbital methods[21] or by exploiting directly the DM’s finite range. While in this paper we were concerned mainly with the scalability of the density calculation, force evaluation, done after the density converges, is also an important goal, high on our list of future plans. We will follow our recent work developing a stochastic estimation of the exact energy derivative (Hellman–Feynman) forces.[42] As shown in ref (45), these stochastic estimations lead to noisy forces that can be used only within Langevin dynamics. In our case, we expect that the deterministic evaluation of the exact derivatives will result in deterministic forces of sufficient quality to enable energy-conserving molecular dynamics simulations.
  23 in total

1.  Tight-binding electronic-structure calculations and tight-binding molecular dynamics with localized orbitals.

Authors: 
Journal:  Phys Rev B Condens Matter       Date:  1995-04-15

2.  CP2K: An electronic structure and molecular dynamics software package - Quickstep: Efficient and accurate electronic structure calculations.

Authors:  Thomas D Kühne; Marcella Iannuzzi; Mauro Del Ben; Vladimir V Rybkin; Patrick Seewald; Frederick Stein; Teodoro Laino; Rustam Z Khaliullin; Ole Schütt; Florian Schiffmann; Dorothea Golze; Jan Wilhelm; Sergey Chulkov; Mohammad Hossein Bani-Hashemian; Valéry Weber; Urban Borštnik; Mathieu Taillefumier; Alice Shoshana Jakobovits; Alfio Lazzaro; Hans Pabst; Tiziano Müller; Robert Schade; Manuel Guidon; Samuel Andermatt; Nico Holmberg; Gregory K Schenter; Anna Hehn; Augustin Bussy; Fabian Belleflamme; Gloria Tabacchi; Andreas Glöß; Michael Lass; Iain Bethune; Christopher J Mundy; Christian Plessl; Matt Watkins; Joost VandeVondele; Matthias Krack; Jürg Hutter
Journal:  J Chem Phys       Date:  2020-05-21       Impact factor: 3.488

3.  Overlapped embedded fragment stochastic density functional theory for covalently-bonded materials.

Authors:  Ming Chen; Roi Baer; Daniel Neuhauser; Eran Rabani
Journal:  J Chem Phys       Date:  2019-01-21       Impact factor: 3.488

4.  Self-averaging stochastic Kohn-Sham density-functional theory.

Authors:  Roi Baer; Daniel Neuhauser; Eran Rabani
Journal:  Phys Rev Lett       Date:  2013-09-04       Impact factor: 9.161

5.  Stochastic density functional theory: Real- and energy-space fragmentation for noise reduction.

Authors:  Ming Chen; Roi Baer; Daniel Neuhauser; Eran Rabani
Journal:  J Chem Phys       Date:  2021-05-28       Impact factor: 3.488

6.  Equilibrium configurations of large nanostructures using the embedded saturated-fragments stochastic density functional theory.

Authors:  Eitam Arnon; Eran Rabani; Daniel Neuhauser; Roi Baer
Journal:  J Chem Phys       Date:  2017-06-14       Impact factor: 3.488

7.  Stochastic embedding DFT: Theory and application to p-nitroaniline in water.

Authors:  Wenfei Li; Ming Chen; Eran Rabani; Roi Baer; Daniel Neuhauser
Journal:  J Chem Phys       Date:  2019-11-07       Impact factor: 3.488

8.  Large scale and linear scaling DFT with the CONQUEST code.

Authors:  Ayako Nakata; Jack S Baker; Shereif Y Mujahed; Jack T L Poulton; Sergiu Arapan; Jianbo Lin; Zamaan Raza; Sushma Yadav; Lionel Truflandier; Tsuyoshi Miyazaki; David R Bowler
Journal:  J Chem Phys       Date:  2020-04-30       Impact factor: 3.488

9.  Siesta: Recent developments and applications.

Authors:  Alberto García; Nick Papior; Arsalan Akhtar; Emilio Artacho; Volker Blum; Emanuele Bosoni; Pedro Brandimarte; Mads Brandbyge; J I Cerdá; Fabiano Corsetti; Ramón Cuadrado; Vladimir Dikan; Jaime Ferrer; Julian Gale; Pablo García-Fernández; V M García-Suárez; Sandra García; Georg Huhs; Sergio Illera; Richard Korytár; Peter Koval; Irina Lebedeva; Lin Lin; Pablo López-Tarifa; Sara G Mayo; Stephan Mohr; Pablo Ordejón; Andrei Postnikov; Yann Pouillon; Miguel Pruneda; Roberto Robles; Daniel Sánchez-Portal; Jose M Soler; Rafi Ullah; Victor Wen-Zhe Yu; Javier Junquera
Journal:  J Chem Phys       Date:  2020-05-29       Impact factor: 3.488

10.  Performance analysis of electronic structure codes on HPC systems: a case study of SIESTA.

Authors:  Fabiano Corsetti
Journal:  PLoS One       Date:  2014-04-18       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.