Literature DB >> 35972734

Linear-Scaling Systematic Molecular Fragmentation Approach for Perturbation Theory and Coupled-Cluster Methods.

Abstract

The coupled-cluster (CC) singles and doubles with perturbative triples [CCSD(T)] method is frequently referred to as the "gold standard" of modern computational chemistry. However, the high computational cost of CCSD(T) [O(N7)], where N is the number of basis functions, limits its applications to small-sized chemical systems. To address this problem, efficient implementations of linear-scaling coupled-cluster methods, which employ the systematic molecular fragmentation (SMF) approach, are reported. In this study, we aim to do the following: (1) To achieve exact linear scaling and to obtain a pure ab initio approach, we revise the handling of nonbonded interactions in the SMF approach, denoted by LSSMF. (2) A new fragmentation algorithm, which yields smaller-sized fragments, that better fits high-level CC methods is introduced. (3) A modified nonbonded fragmentation scheme is proposed to enhance the existent algorithm. Performances of the LSSMF-CC approaches, such as LSSMF-CCSD(T), are compared with their canonical versions for a set of alkane molecules, CnH2n+2 (n = 6-10), which includes 142 molecules. Our results demonstrate that the LSSMF approach introduces negligible errors compared with the canonical methods; mean absolute errors (MAEs) are between 0.20 and 0.59 kcal mol-1 for LSSMF(3,1)-CCSD(T). For a larger alkanes set (L12), CnH2n+2 (n = 50-70), the performance of LSSMF for the second-order perturbation theory (MP2) is investigated. For the L12 set, various bonded and nonbonded levels are considered. Our results demonstrate that the combination of bonded level 6 with nonbonded level 2, LSSMF(6,2), provides very accurate results for the MP2 method with a MAE value of 0.32 kcal mol-1. The LSSMF(6,2) approach yields more than a 26-fold reduction in errors compared with LSSMF(3,1). Hence, we obtain substantial improvements over the original SMF approach. To illustrate the efficiency and applicability of the LSSMF-CCSD(T) approach, we consider an alkane molecule with 10,004 atoms. For this molecule, the LSSMF(3,1)-CCSD(T)/cc-pVTZ energy computation, on a Linux cluster with 100 nodes, 4 cores, and 5 GB of memory provided to each node, is performed just in ∼24 h. As a second test, we consider a biomolecular complex (PDB code: 1GLA), which includes 10,488 atoms, to assess the efficiency of the LSSMF approach. The LSSMF(3,1)-FNO-CCSD(T)/cc-pVTZ energy computation is completed in ∼7 days for the biomolecular complex. Hence, our results demonstrate that the LSSMF-CC approaches are very efficient. Overall, we conclude the following: (1) The LSSMF(m, n)-CCSD(T) methods can be reliably used for large-scale chemical systems, where the canonical methods are not computationally affordable. (2) The accuracy of bonded level 3 is not satisfactory for large chemical systems. (3) For high-accuracy studies, bonded level 5 (or higher) and nonbonded level 2 should be employed.

Entities: Chemical

Mesh：
Algorithms
Alkanes

Substances：
Alkanes

Year: 2022 PMID： 35972734 PMCID： PMC9476663 DOI： 10.1021/acs.jctc.2c00587

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.578

Introduction

It has been demonstrated that coupled-cluster (CC) methods are accurate for the prediction of molecular properties.[1−5] The coupled-cluster singles and doubles (CCSD) method[6] provides quite accurate results for most molecular systems at equilibrium geometries, but nevertheless, a triple excitations correction is required to obtain high accuracy.[7−13] The coupled-cluster singles and doubles with perturbative triples [CCSD(T)] method[10,11,14] provides excellent results for a broad range of chemical systems near equilibrium geometries.[12,15−24] Therefore, the CCSD(T) method generally referred to as the “gold standard” of computational chemistry. However, the high computational cost of CCSD(T) [O(N7)], where N is the number of basis functions, limits its applications to small-sized chemical systems. There have been many attempts at development of reduced cost electron correlation methods.[25−35] Some of these studies take advantages of the locality of molecular orbitals (MO), which is based on the idea that dynamic correlation is a short-range phenomenon. The introduction of a “correlation domain” concept, by Pulay and co-workers,[25,26] stimulated local correlation approaches. There are a few variants of local CC methods, such as projected atomic orbitals-based local CC methods (PAO-LCC)[28,29] and local pair natural orbitals (LPNOs).[32−34] Other attempts are the cluster-in-molecule (CIM) approach[36−40] and the divide–expand–consolidate (DEC) approach.[41,42] Alternative and more effective approaches, compared to LCC methods, to tackle the molecular size dependence problems of electronic structure theories are the molecular fragmentation approaches (MFA). Various molecular fragmentation approaches have been suggested to overcome the steep scaling problem of electronic structure methods.[43−46] In molecular fragmentation approaches, a molecular system is broken up into small molecular units, and energies of the fragments are combined to approximate the energy of the entire system. Although, the logic behind all fragmentation approaches is similar, the formation of fragments, as well as the combination of the fragment energies, differ significantly from method to method. Molecular fragmentation methods include the molecular tailoring approach (MTA),[47−49] fragment molecular orbital theory (FMO),[50−52] molecular fractionation with conjugate caps (MFCC),[53,54] systematic molecular fragmentation (by annihilation) [SMF(A)],[44,55−64] combined fragmentation method (CFM),[60,65] generalized energy-based fragmentation (GEBF),[66] kernel energy method (KEM),[67,68] molecules-in-molecules (MIM) approach,[69] many-overlapping-body expansion (MOBE),[70] and generalized many-body expansion (GMBE).[71] In terms of accuracy and general applicability, the SMF approach appears to be very attractive. The SMF energy is a sum of two components: bonded and nonbonded. We may also call them as covalent and noncovalent terms. The number of bonded fragments scales linearly [O(n)], where n is the number of groups, while the number of nonbonded fragments scales quadratically [O(n2)]. To reduce the high cost of nonbonded fragments, Collins introduced a cutoff distance (R), such as 2 Å.[61] If the distance between monomers of a nonbonded fragment is smaller than R, then it is treated with electronic structure methods, otherwise, with a simple perturbation theory approach. For branched molecules, Collins’ algorithm yields large-sized fragments compared to the chain-like linear alkanes case, which is another difficulty.[55,56] This situation especially becomes problematic for high-level CC approaches, such as CCSD(T), where the computational cost increases steeply with the molecular size. In this research, to achieve exact linear scaling and to obtain a pure ab initio approach, we completely neglect all long-range nonbonded contributions since they already approach zero. Further, we introduce a new fragmentation algorithm for the branched molecules, which yields smaller-sized fragments; hence, the new algorithm better fits high-level CC methods. The new linear-scaling SMF algorithm, denoted by LSSMF, has been coded in C++ language by the present authors and added to the MacroQC(72) software. The LSSMF approach is integrated with the Dfocc module.[24,73−82] Hence, all methods available in the Dfocc module can be used with the LSSMF approach. The newly proposed LSSMF-CC approaches, such as LSSMF-CCSD and LSSMF-CCSD(T) as well as LSSMF-MP2, are applied to a series of alkane molecules to demonstrate their efficiencies and accuracies.

Systematic Molecular Fragmentation (SMF)

The SMF approach starts with the molecule M divided into different “groups”. Groups are sets of atoms defined by the SMF algorithm. The basic ideas involved in the method can be illustrated for the simplest case involving a chain-like molecule containing N groups connected by single bonds:The target is to derive an accurate value for the total electronic energy: The energy of the molecule M is determined by summation of the fragment (F), which is defined in terms of combinations of groups and energies. The sizes of the fragments depend on the “level” of SMF, and the fragments can overlap with each other since a group can be involved in multiple fragments.[59] Hence, additional fragments with negative coefficients are generated to cancel the effects of multiple counting. The bonded energy iswhere f is the integer coefficient associated with the fragment F. For a model system of a chain containing five groups, the SMF fragmentation scheme can be expressed as Thus, the fragment sizes increase with the level used. However, the number of fragments grows linearly with the size of the system. The authors have noted that the different levels used in SMF are related to some older concepts used in the field of theoretical thermochemistry. For example, level 1 reactions are known as “isodesmic reactions”,[83] level 2 as homodesmotic reactions,[84] and level 3 as hyperhomodesmotic[85] reactions. Since the bonded energy only includes nearby interactions, one should consider the nonbonded interactions between more distant groups. The nonbonded interactions may be evaluated by the following equation:The “allowed” nonbonded interactions are the interactions that are not already included in E.[44,60]

New Linear-Scaling SMF (LSSMF) Approach

To illustrate the difference of our LSSMF approach from the previous SMF/SMFA approach(es), let us consider an open-chain alkane molecule. For a chain-like CH2 molecule with the SMF scheme (at level 3), the bonded fragments are just butane and propane fragments when hydrogen caps are added. The nonbonded fragments are just methane dimers with different molecular distances. The number of bonded and nonbonded fragments are given as follows:The number of bonded fragments scales linearly with the number of carbons, while the number of nonbonded fragments (NB) scales quadratically. However, one may consider only short-range NB fragments, and their number also scales linearly with the system size.Hence, we introduce a nonbonded cutoff tolerance, Δ. If the distance between the closest atoms of two groups is larger than Δ, then this nonbonded fragment is disregarded. We denote this algorithm by distance-based elimination (DBE). An alternative approach is using the ratio of distance to covalent radii (DCRR) as follows:[86]where X denotes the Cartesian position of the atom in the fragment m, and r denotes the covalent radius of the atom. Atomic covalent radii are obtained from Cordero et al.[87] In this study, we consider different bonded and nonbonded fragmentation levels. Hence, we introduce the LSSMF(m, n) notation, where m and n indicate the bonded and nonbonded levels, respectively.

LSSMF Fragmentation Algorithm

Before presenting our fragmentation algorithm, let us define the notation: i, j, k, l, ... for atoms; a, b, c, d, ... for groups; and μ, ν, λ, σ, ... for fragments. Define the level of SMF and tolerances for single, double, and triple bonds as well as the NB cutoff: Δ, Δ, Δ, and Δ. Read molecular info: Cartesian coordinates (X, Y, and Z), number of atoms (N), atomic masses, and atomic covalent radii (r). Compute interatomic distances: R. Compute bond order matrix: B. If R < r + r + Δ, then B = 1. If R < r + r – Δ, then B = 2. If R < r + r – Δ, then B = 3. Else B = 0. Catch the first nonhydrogen atom. The first such atom is assigned to group 1 (in fact, group 0 in C++). Assign the remaining non-H atoms. Starting the first non-H atom, make a loop over atom pairs i, j. If B > 1, then assign j to the group of the ith atom, G. Otherwise, assign it to the next group, G. Catch double/triple bonded non-H atoms in different groups and merge them. Assign the hydrogen atoms to each group according to values of B. Form the group connectivity matrix: L. If two groups are connected to each other, then L = 1; otherwise, it is equal to zero. Further, determine the bonded atoms of two connected groups: LA. Determine the number of caps per group. Form bonded and nonbonded domains for each group. For group G, the bonded domain is the list of groups G (j ≠ i) that are connected to G. Similarly, the nonbonded domain is the list of groups G (j ≠ i) that are not connected to G, within the nonbonded cutoff tolerance. Form lists of groups and bonded and nonbonded fragments according to the SMF level. Details of bonded and nonbonded fragment algorithms are provided in Sections and 2.5, respectively. Add embedded charges to groups and fragments in the case of polar molecules.[88] Write MacroQC input files for groups and bonded and nonbonded fragments.

Capping Hydrogens

In each final fragment, bonds that are connecting groups in the fragment to other groups that are not in the fragment are “missing”. These missing bonds are replaced by bonds to hydrogen atoms.[55] The total number of hydrogen atoms added to fragments with a sign of +1 is exactly equal to the number added to fragments with a sign of −1. The position of each H atom is taken to lie along the missing bond vector at a distance which is proportional to the expected ratio of bond lengths. That is,where X denotes the Cartesian position of the atom in the fragment, and X denotes the Cartesian position of the atom that is not available in the fragment.

New Fragmentation Algorithm for Branched Molecules

Our fragmentation algorithm is identical to the one suggested by Deev and Collins[55] for unbranched chain-like molecules. However, in the case of branched molecules, we propose a new algorithm. In order to illustrate the difference between two algorithms, let us consider the 2,4-dimethylpentane (24DMP) molecule (Figure ) for which the fragmentation result at level 3 was reported by Deev and Collins.[55]

Figure 1

2,4-Dimethylpentane (24DMP) molecule.

2,4-Dimethylpentane (24DMP) molecule. In the 24DMP molecule, each carbon atom defines a group, with a total of seven groups. Fragmentation suggested by Deev and Collins yields to the following bonded fragments at level 3:[55]where G1G2G3G4G5G6G7 represents the whole molecule. In this case, fragments G1G2G3G4G5 and G1G4G5G6G7 are formed from the combination of the five groups. However, in the case of an open chain analog, the fragments form from the combination of four groups. Hence, Deev and Collins’ algorithm yields fragments at different sizes for open chain and branched molecules. In the latter case, it yields much larger fragments, which may be a problem for high-level CC methods, where the computational cost increases steeply. Therefore, one of the authors (U.B.) suggests a new fragmentation algorithm for branched molecules, in which smaller-sized fragments form as in the case of open chain molecules. Our algorithm yields the following bonded fragments for the 24DMP molecule at level 3:In the fragmentation in eq , fragments formed by the combination of four groups are called the main fragments. The remaining fragments are considered for chemical balance. Hence, we may call them neutralizing fragments or renormalization terms, reminiscent of the many-body perturbation theory. The logic of the proposed fragmentation approach is to form all possible bonded fragments combining four different groups as in the case of open-chain alkane molecules. Our algorithm produces 6F4 + 5F3 + 2F2, where F denotes a fragment formed from i different groups, whereas Deev and Collins’s algorithm produces 2F5 + F3. Hence, our algorithm yields lower size fragments, while Deev and Collins’ algorithm yields a smaller number of fragments. For high-level CC computations with large basis sets, the size of a fragment is more important than the number of additional small fragments. Moreover, a group can be as small as CH4 and H2O but can be as large as benzene and naphthalene. Hence, in the case of large groups, such as benzene and naphthalene, decreasing the size of the fragment from F5 to F4 is still very important to reduce the cost even though small basis sets are employed. Therefore, our algorithm is more efficient in terms of computational cost and better fits high-level CC methods, such as CCSD(T). Our new bonded fragmentation algorithm: Let us assume that we are employing bonded level m fragmentation. Then, the sizes of our main fragments will be m + 1, which means they will include m + 1 groups. To form bonded fragments, we need to loop over groups. At the bonded level m, one may loop over m + 1 groups, which would be a O(n) loop, where n is the number of groups. However, with the concept of the bonded domain, we can reduce the cost dramatically. Hence, we just run a single loop over groups. For each group (G), we get bonded domain BD, which includes G groups (j ≠ i). Then, for each G group, we get bonded domain BD. Finally, we form the union of bonded domains (UBD). Please note that each bonded domain may include just a few groups, in the case of alkanes as many as four. For example, UBD may include a maximum of 16 groups at level 3. Since some groups may be present simultaneously in different BDs, the actual size of UBD is much smaller. Once we form UBD, we loop over the elements of UBD and form all possible main bonded fragments of size m + 1. The groups of the fragments which are formed should be connected in the original molecule. Repetitive groups, in the form of the largest possible fragments, in the main bonded fragments are added to the list of bonded fragments with appropriate negative coefficients. For example, at level 3, our main fragments include four groups (F4). Hence, we first search for repeating three groups (F3) that are connected in the bonded list. If we find any F3, then we add them to the list with appropriate negative coefficients. Then, we repeat this procedure for F2 fragments and finally for groups. The mentioned negative coefficients are obtained following the chemical balance rules.

Modified Nonbonded Algorithms

In the original SMF approach, Collins and co-workers present a simple and effective way to consider nonbonded interactions.[44,55−63] In Collins and co-workers’ NB approach, only two-body interactions are considered, which may be called the NB level 1 fragmentation. For example, for the linear C7H16 molecule, the bonded and NB fragments can be written as follows: where each G group is represented by the i symbol, for example, 234 = G2G3G4. Bonded fragments: where i ↔ j denotes the NB interaction between the ith and jth groups. NB fragments: However, for accurate treatment of the NB interactions, one needs to consider larger contributions than two-body interactions. In a 2009 study, Addicoat and Collins reported an improved algorithm for the NB interactions using a level–level approach in addition to a three-body expansion method.[86] Even though Addicoat and Collins[86] level–level approach is an improvement over two-body expansion, it includes a limited number of higher terms. Furthermore, Addicoat and Collins[86] three-body expansion approach yields a very large number of fragments. Hence, we propose a modified NB algorithm, which includes all three-body terms, while lower level interactions come from the renormalization terms. Our modified algorithm is obtained employing our NB cutoff approaches to Addicoat and Collins’ three-body expansion method. For example, for the n-heptane NB fragmentation scheme yieldsIn our algorithm, we first build all three-body terms, then we investigate all subunits and cancel the repeating terms. Our nonbonded fragmentation algorithm is as follows:: The distant groups that are not present together in the bonded fragments are considered in nonbonded fragments. At the nonbonded level 1, we form “dimers” of the capped groups (G ↔ G) that are not simultaneously present in bonded fragments. Of course, the distance between these two groups should be within the nonbonded cutoff limits. Otherwise, they will be disregarded. In nonbonded level 2, we form three-body complexes (3BCs). Each 3BC is formed between a group and a fragment of two groups (G ↔ F2). The F2 fragments can be obtained by bonded level 1 fragmentation. Of course, the G group and the F2 fragment should not be simultaneously present in bonded fragments. Next, we inspect these 3BCs for repeating “dimers” of groups. If we find any repeating “dimer”, then we add them (dimers of capped groups) to the list with appropriate negative coefficients. The mentioned negative coefficients are again obtained following the chemical balance rules. As in the case of nonbonded level 1, we consider nonbonded fragments within the nonbonded cutoff limits. Even though our nonbonded level 2 (NB2) appears to be identical to the three-body expansion approach of Addicoat and Collins,[86] our algorithm yields a dramatically reduced number of fragments compared to Addicoat and Collins because of employed nonbonded cutoff limits. For example, for the C70H142 molecule, our algorithm yields 1540 NB fragments, while it would be ∼4 times larger in the case of the three-body expansion approach of Addicoat and Collins.

Results and Discussion

Results from the HF, MP2, CCSD, CCSD(T), LSSMF(3,1)-HF, LSSMF(3,1)-MP2, LSSMF(3,1)-CCSD, and LSSMF(3,1)-CCSD(T) methods were obtained for a set of alkanes, CH2 (n = 6–10), for comparison of the absolute energies. Further, for a larger system, CH2 (n = 50–70), results from MP2 and LSSMF-MP2 are compared for the total energies. For the alkanes, Dunning’s correlation-consistent polarized valence double, triple, and quadruple-ζ basis sets (cc-pVDZ, cc-pVTZ, and cc-pVQZ) were employed with the frozen core approximation.[89,90] The density-fitting approach was used for LSSMF methods considered.[24,74,78,79] For the cc-pVXZ primary basis sets, cc-pVXZ-JKFIT[91] and cc-pVXZ-RI[92] auxiliary basis sets were employed for reference and correlation energies, respectively. Geometries of the CH2 structures considered were optimized at the B3LYP/cc-pVDZ and universal force field (UFF)[93] levels for n = 6–10 and n = 50–70, respectively. Previous studies demonstrated that the accuracies of level 1 and level 2 approaches are not satisfactory for bonded fragments, and level 3 should be used at least.[44,60] Hence, in this study, we consider levels 3–6 for bonded fragments, while we consider levels 1 and 2 for nonbonded fragments.

S142 Set

To assess the accuracy of the LSSMF(3,1) approach with respect to the canonical methods, we consider a set of alkanes, CH2 (n = 6–10), which includes 142 molecules, denoted by S142. For the first step of our assessment, we choose a safe cutoff value for nonbonded interactions: Δ = 10.0 Å. In the next section, effects of different Δ values are evaluated. Mean absolute errors (MAEs) of the LSSMF(3,1)-HF, LSSMF(3,1)-MP2, LSSMF(3,1)-CCSD, and LSSMF(3,1)-CCSD(T) methods with respect to canonical methods are depicted in Figure . For the CH2 set, total energies from MP2, CCSD, CCSD(T), LSSMF(3,1)-MP2, LSSMF(3,1)-CCSD, and LSSMF(3,1)-CCSD(T) methods and percentages of the LSSMF energies with respect to the canonical methods are reported in Tables S1–S6.

Figure 2

Mean absolute errors in the total energies of the CH2 (n = 6–10) isomers for the LSSMF(3,1)-HF, LSSMF(3,1)-MP2, LSSMF(3,1)-CCSD, and LSSMF(3,1)-CCSD(T) methods with respect to canonical methods. All computations are performed with the cc-pVDZ basis set and with the Δ = 10.0 Å. For the C10H22 isomers, which are the largest member of test set considered, total energies from MP2, CCSD, CCSD(T), LSSMF(3,1)-MP2, LSSMF(3,1)-CCSD, and LSSMF(3,1)-CCSD(T) methods and percentages of the LSSMF energies with respect to the canonical methods are reported in Tables S5 and S6. For the correlated methods, the percent coverage values are in 99.9990%–100.0005%, while that of LSSMF(3,1)-HF is in 99.9978%–100.0001%. Hence, all considered LSSMF methods cover a satisfactory portion of the total energy of the full methods. The MAE values (Figure ) in total energies are 1.61 [LSSMF(3,1)-HF], 0.59 [LSSMF(3,1)-MP2], 0.58 [LSSMF(3,1)-CCSD], and 0.59 [LSSMF(3,1)-CCSD(T)] kcal mol–1. Further, the Δ values for total energies are 5.30 [LSSMF(3,1)-HF], 2.56 [LSSMF(3,1)-MP2], 2.27 [LSSMF(3,1)-CCSD], and 2.01 [LSSMF(3,1)-CCSD(T)] kcal mol–1. Hence, considering both error measures, MAE and Δ, the results of the correlated LSSMF methods are in good agreement with the canonical methods. These results demonstrate that the high-level electron correlation methods are less prone to fragmentation errors since the dynamical electron correlation is an local phenomenon. Considering the results obtained for the whole alkane set, one can safely rely on the LSSMF-CC methods for high-accuracy studies in large-sized chemical systems, where the canonical methods are not computationally affordable.

Cutoff

In the second step of our assessment of the LSSMF approaches, we investigate the effect of nonbonded cutoff tolerances on the accuracy. For this purpose, we consider five isomers of C10H22: 2,2,3,3-tetramethylhexane (2233TMH), 4-ethyl-2,4-dimethylhexane (4E24DMH), 4-isopropylheptane (4IPH), 5-methylnonane (5MN), and n-decane (decane). For these molecules, the total energies of the LSSMF(3,1)-CCSD(T) approach are computed with Δ = 3–10 Å. The errors at each Δ value with respect to full methods are depicted in Figure . Our results indicate that the maximum error is generally obtained at 3 Å, as expected, and errors are kept constant starting with 6 Å. In the case of the n-decane molecule, we obtain the lowest errors at Δ = 3 Å. The reason why the lowest error is obtained at the shortest distance is because the n-decane molecule bonded energy is closer to CCSD(T) energy compared with the total LSSMF energy, which covers 100.0005% of the CCSD(T) energy. In other words, adding more nonbonded contribution, by increasing Δ, one obtains lower energies compared with CCSD(T). Overall, even though we use Δ = 10 Å throughout this study, a Δ value of 6.0 Å appears to be enough for the most purposes, which corresponds to a DCRR value of ∼4.0.

Figure 3

Errors of the LSSMF(3,1)-CCSD(T) method with respect to the full method with different cutoff distances for 2,2,3,3-tetramethylhexane (2233TMH), 4-ethyl-2,4-dimethylhexane (4E24DMH), 4-isopropylheptane (4IPH), 5-methylnonane (5MN), and n-decane (decane) molecules. All computations are performed with the cc-pVDZ basis set.

Frozen Natural Orbitals

To further increase the applicability of the LSSMF(3,1)-CCSD(T) approach, we also consider frozen natural orbitals (FNOs).[72,94−97] The FNO approximation is very helpful to reduce the computational cost of CCSD(T), while it introduces negligible errors with tight enough occupation tolerances, such as 10–5. To improve the FNO–CC results, we employ the δ correction as suggested by DePrince and Sherrill.[97] With the FNO approximation, we can consider larger basis sets for the canonical methods; hence, we employ the cc-pVTZ basis set. For the n-decane and four lowest energy isomers, we obtain MAE and Δ values of 0.74 and 1.04 kcal mol–1, respectively, for the LSSMF(3,1)-FNO–CCSD(T) approach (Tables S7 and S8). Hence, the fragmentation error is tolerable for the FNO–CCSD(T) method, as in the case of CCSD(T).

Basis Set Effects

To investigate the effect of basis sets, we also carry out total energy computations for the LSSMF(3,1)-FNO–CCSD(T) method with cc-pVDZ, cc-pVTZ, and cc-pVQZ basis sets for three C7H16 isomers. One of these isomers is n-heptane, and others are the lowest energy isomers: 2,2,3-trimethylbutane and 2,2-dimethylhexane. The MAE values with respect to FNO–CCSD(T) for different basis sets are depicted in Figure . The MAE values are 0.33 (cc-pVDZ), 0.38 (cc-pVTZ), and 0.45 (cc-pVQZ) kcal mol–1. Even though there is a slight increase with basis set size, the errors are still at the tolerable magnitudes.

Figure 4

Mean absolute errors in the total energies of three C7H16 isomers for the LSSMF(3,1)-FNO–CCSD(T) method with respect to FNO–CCSD(T). All computations are performed with the FNO occupation tolerance of 10–5 and Δ = 10.0 Å in the cc-pVDZ (DZ), aug-cc-pVDZ (aDZ), cc-pVTZ (TZ), aug-cc-pVTZ (aTZ), cc-pVQZ (QZ), and aug-cc-pVQZ (aQZ) basis sets. To investigate the effect of diffuse basis sets, we also carried out energy computations for the LSSMF(3,1)-FNO–CCSD(T) method with the aug-cc-pVDZ, aug-cc-pVTZ, and aug-cc-pVQZ basis sets for the same isomers (Figure ). The MAE values with respect to FNO–CCSD(T) are 0.71 (aug-cc-pVDZ), 0.47 (aug-cc-pVTZ), and 0.49 (aug-cc-pVQZ) kcal mol–1. The MAE of aug-cc-pVDZ is almost 2-fold increased compared to cc-pVDZ, while the MAE of triple and quadruple ζ basis sets are only slightly increased compared to cc-pVTZ and cc-pVQZ. Nevertheless, the errors are still at tolerable magnitudes.

L12 Set

To further investigate the performance of the LSSMF approach for larger molecules, we consider the L12 set (Table S9), which consists of 12 large molecules including 50–70 carbon atoms. For the L12 set, conventional CC computations are not computationally feasible. Hence, for this set, we investigate the errors of LSSMF-MP2 with respect to the canonical MP2 for absolute energies. For the L12 set, the cc-pVDZ basis set is employed. For the S142 set, bonded level 3 is only considered because higher levels covers either the whole molecules or a large portion of them. Hence, for the L12 set, we explore the effect of higher levels. For the L12 set, bonded levels 3–6 and nonbonded levels 1 and 2 are considered. For each combination of bonded and nonbonded levels, the cutoff values of 7.5 and 10.0 Å are employed for nonbonded interactions, respectively. For the C50H102– C70H142 molecules, total energies from HF, MP2, LSSMF-HF, and LSSMF-MP2 methods and percentages of the LSSMF energies with respect to the canonical methods are reported in Tables S10–S25. For the entire set, the percent coverage values are in 99.9983%–100.0002% and 100.0000%–100.0008% for LSSMF-HF and LSSMF-MP2, respectively. Hence, both LSSMF methods cover large portions of the total energies of the corresponding canonical methods. With the nonbonded level 1 and the cutoff value of 7.5 Å, the MAE values (Figure ) in the LSSMF-HF total energies with respect to HF are 9.95, 2.35, 0.59, and 0.27 kcal mol–1 for the bonded levels 3, 4, 5, and 6, respectively. With the nonbonded level 1 and the cutoff value of 7.5 Å, the MAE values (Figure ) in the LSSMF-MP2 total energies with respect to MP2 are 8.34, 2.99, 2.12, and 1.05 kcal mol–1 for the bonded levels 3, 4, 5, and 6, respectively. Hence, at higher bonded levels, the errors of the LSSMF approach decrease systematically. The accuracy of bonded level 6 is substantially better than lower levels considered; there are 7.9-fold reduction in errors compared to level 3, which is advocated to be accurate in previous studies. To further improve our results, we also consider the nonbonded level 2 scheme proposed in this study. With the nonbonded level 2 and the cutoff value of 7.5 Å, the MAE values (Figure ) in the LSSMF-HF total energies with respect to HF are 6.55, 1.83, 0.48, and 0.19 kcal mol–1 for the bonded levels 3, 4, 5, and 6, respectively. With the nonbonded level 2 and the cutoff value of 7.5 Å, the MAE values (Figure ) in the LSSMF-MP2 total energies with respect to MP2 are 6.64, 1.48, 0.92, and 0.32 kcal mol–1 for the bonded levels 3, 4, 5, and 6, respectively. The NB level 2 fragmentation significantly improves upon NB level 1 and provides 1.26-, 2.02-, 2.30-, and 3.28-fold reductions in errors compared with NB level 1 for bonded level 3, 4, 5, and 6, respectively. Thus, the bonded level 6, NB level 2 combination, which may denoted by LSSMF(6,2), provides the best results. Further, it is also noteworthy that the LSSMF(5,2) level provides lower errors compared to LSSMF(6,1). Similarly, the LSSMF(4,2) level is better than LSSMF(5,1). Hence, it appears that instead of going a higher order in the bonded level, it is better to go higher nonbonded level at first.

Figure 5

Figure 6

Mean absolute errors in the total energies of the L12 set (the largest member is C70H142) for the LSSMF-MP2 method with respect to MP2. All computations are performed with the Δ = 7.5 Å in the cc-pVDZ basis sets. The (m, n) notation indicates the bonded and nonbonded levels, respectively.

Mean absolute errors in the total energies of the L12 set (the largest member is C70H142) for the LSSMF-HF method with respect to HF. All computations are performed with the Δ = 7.5 Å in the cc-pVDZ basis sets. The (m, n) notation indicates the bonded and nonbonded levels, respectively. Mean absolute errors in the total energies of the L12 set (the largest member is C70H142) for the LSSMF-MP2 method with respect to MP2. All computations are performed with the Δ = 7.5 Å in the cc-pVDZ basis sets. The (m, n) notation indicates the bonded and nonbonded levels, respectively. With the nonbonded level 1 and the cutoff value of 10.0 Å, the MAE values (Figure ) in the LSSMF-HF total energies with respect to HF are 9.94, 2.34, 0.59, 0.28 kcal mol–1 for the bonded levels 3, 4, 5, and 6, respectively. With the nonbonded level 1 and the cutoff value of 10.0 Å, the MAE values (Figure ) in the LSSMF-MP2 total energies with respect to MP2 are 8.45, 3.11, 2.24, 1.17 kcal mol–1 for the bonded levels 3, 4, 5, and 6, respectively. With the nonbonded level 2 and the cutoff value of 10.0 Å, the MAE values (Figure ) in the LSSMF-HF total energies with respect to HF are 6.57, 1.86, 0.48, 0.17 kcal mol–1 for the bonded levels 3, 4, 5, and 6, respectively. With the nonbonded level 2 and the cutoff value of 10.0 Å, the MAE values (Figure ) in the LSSMF-MP2 total energies with respect to MP2 are 6.55, 1.48, 0.92, 0.33 kcal mol–1 for the bonded levels 3, 4, 5, and 6, respectively. These results are virtually the same as the previous results obtained for the cutoff value of 7.5 Å, which again demonstrates that our cutoff tolerance is accurate enough. Overall, our results demonstrate that the LSSMF(6,2) level approaches to the canonical method. Therefore, one may rely on the LSSMF methods for high-accuracy studies in large-sized chemical systems, where the canonical methods are not computationally feasible.

Figure 7

Figure 8

Mean absolute errors in the total energies of the L12 set (the largest member is C70H142) for the LSSMF-MP2 method with respect to MP2. All computations are performed with the Δ = 10.0 Å in the cc-pVDZ basis sets. The (m, n) notation indicates the bonded and nonbonded levels, respectively.

Mean absolute errors in the total energies of the L12 set (the largest member is C70H142) for the LSSMF-HF method with respect to HF. All computations are performed with the Δ = 10.0 Å in the cc-pVDZ basis sets. The (m, n) notation indicates the bonded and nonbonded levels, respectively. Mean absolute errors in the total energies of the L12 set (the largest member is C70H142) for the LSSMF-MP2 method with respect to MP2. All computations are performed with the Δ = 10.0 Å in the cc-pVDZ basis sets. The (m, n) notation indicates the bonded and nonbonded levels, respectively.

Timing

In our LSSMF implementation, we form groups, bonded, and nonbonded fragments at first; then, we write all fragment input files to disk. In the third step, we simultaneously submit all fragment jobs to our Linux clusters. Finally, we collect the energy values from output files, merge them, and compute the final LSSMF energy. Hence, our implementation is naturally parallel. The fragment formation procedure is the fastest step (step 1). We can form all fragments just in a few minutes owing to our efficient fragmentation algorithm. Writing fragment input files generally takes several minutes (step 2). Hence, the cost of overall computation is dependent on the cost of CC jobs (step 3), which is dependent on the number of cores that are available. To illustrate the efficiency of our fragmentation algorithm, we consider a set of alkanes, which includes 10,004–50,012 atoms. Total wall times (in min) for the LSSMF(3,1) code (step 1 + step 2) for the CH2 (n = 3334; 6668; 10,002; 13,336; 16,670) set are depicted in Figure . For the largest member of the alkanes set considered, C16670H33342, the total time for the LSSMF code is just 8.4 min on a single node (1 core) computer. Hence, our LSSMF code is very efficient to form fragments and prepare necessary input files.

Figure 9

Total wall time (in min) for the LSSMF(3,1) code for a CH2 set. All procedures were performed on a single node (1 core) Intel(R) Xeon(R) Gold 5218 CPU @ 2.30 GHz computer.

Total wall time (in min) for the LSSMF(3,1) code for a CH2 set. All procedures were performed on a single node (1 core) Intel(R) Xeon(R) Gold 5218 CPU @ 2.30 GHz computer. To illustrate the efficiency and applicability of the LSSMF(3,1)-CCSD(T) approach, we consider the C3334H6670 molecule, which includes 10,004 atoms. For the C3334H6670 molecule, the LSSMF(3,1)-CCSD(T)/cc-pVTZ energy computation is performed in a Linux cluster with 100 nodes, 4 cores, and 5 GB of memory provided to each node. In this system, the total wall time for energy computation is ∼24 h, which indicates that the introduced method is extremely efficient. As a second example, we consider a biomolecular complex (PDB code: 1GLA), which includes 10,488 atoms, to illustrate the efficiency the LSSMF approach. For 1GLA, the LSSMF(3,1)-FNO–CCSD(T)/cc-pVTZ energy computation with Δ = 5 Å is performed in a cluster with 50 nodes (each node has 8 cores, 64 GB of memory, and Xeon Scalable 6148 2.40 GHz CPU). If this chemical system would run as a whole molecule, there would be 231,408 basis functions. At the LSSMF(3,1) level, 3170 groups and 11,445 bonded and 62,716 nonbonded fragments are formed for 1GLA. For the largest fragment, there are only 736 basis functions. The LSSMF(3,1)-FNO–CCSD(T) energy of the molecule is −267,117.064554 hartree (with the FNO occupation tolerance of 10–4). This computation is completed in ∼7 days, which shows the efficiency of our LSSMF method. The number of atoms are similar for the biomolecular complex and the linear alkane considered, C3334H6670. However, the biomolecular complex yields significantly larger fragments due to aromatic bonds in amino acids. Therefore, we observe different computational times.

Conclusions

In this research, efficient implementations of linear-scaling coupled-cluster methods, which employ the systematic molecular fragmentation approach, have been reported. For the branched molecules, a new fragmentation algorithm, which yields smaller-sized fragments compare with previous studies, has been introduced. The new linear-scaling SMF algorithm is denoted by LSSMF. Performances of the developed LSSMF-CC approaches, such as LSSMF-CCSD and LSSMF-CCSD(T), have been compared with their canonical versions for a set of alkane molecules, CH2 (n = 6–10), which includes 142 molecules. Our results demonstrate that the LSSMF approach introduces negligible errors compared with the canonical methods. For the alkanes set, the MAE values are between 0.19 and 0.58 and 0.20 and 0.59 kcal mol–1 for the LSSMF(3,1)-CCSD and LSSMF(3,1)-CCSD(T) methods, respectively. A similar performance has been observed in the case of the frozen natural orbitals-based CCSD(T) approach [LSSMF(3,1)-FNO–CCSD(T)]. Further, we investigate basis set effects on the LSSMF methods using cc-pVXZ (X = D,T,Q) basis sets. Our results indicate that the performance of the LSSMF(3,1)-FNO–CCSD(T) approach with large basis sets is similar to the small basis set cases. To further assess the performances of the LSSMF approaches for large molecular systems, we consider the L12 set, which consists of 12 large molecules including 50–70 carbon atoms. For the L12 set, various bonded and nonbonded levels are considered. Our results demonstrate that the combination of bonded level 6 with nonbonded level 2, LSSMF(6,2), yields substantially accurate results for the MP2 method. The MAE value for the LSSMF(6,2)-MP2 method with respect to MP2 is 0.32 kcal mol–1 with the cutoff value of 7.5 Å. The LSSMF(6,2) approach yields more than a 26-fold reduction in errors compared with the LSSMF(3,1) approach. Hence, we obtain dramatic improvements over Collins’ original SMF approach.[59] To illustrate the efficiency and applicability of the LSSMF approach, we consider an alkane molecule with 10,004 atoms at first. For the C3334H6670 molecule, the LSSMF(3,1)-CCSD(T)/cc-pVTZ energy computation, on a Linux cluster with 100 nodes, 4 cores, and 5 GB of memory provided to each node, is performed just in ∼24 h. Furthermore, we consider a biomolecular complex (PDB code: 1GLA), which includes 10,488 atoms, as a second test for assessment of the efficiency of LSSMF. The LSSMF(3,1)-FNO–CCSD(T)/cc-pVTZ single point energy computation is completed in ∼7 days for the biomolecular complex. Even though the number of atoms appears to be similar, the biomolecular complex includes larger fragments compared to the linear alkane considered, which accounts to the difference in wall time reported. Hence, our results demonstrate that the LSSMF-CC approaches are very efficient. Our results demonstrate the LSSMF(6,2) level approaches to the canonical method. Therefore, one may rely on the LSSMF methods for high-accuracy studies in large-sized chemical systems, where the canonical methods are computationally prohibitive. Overall, we conclude that the LSSMF approach is promising for applications of electron correlation methods in large-scale chemical systems where canonical methods are computationally prohibitive.

51 in total

1. Linear scaling coupled cluster method with correlation energy based error control.

Authors: Marcin Ziółkowski; Branislav Jansík; Thomas Kjaergaard; Poul Jørgensen
Journal: J Chem Phys Date: 2010-07-07 Impact factor: 3.488

2. Linear-Scaling Coupled Cluster with Perturbative Triple Excitations: The Divide-Expand-Consolidate CCSD(T) Model.

Authors: Janus J Eriksen; Pablo Baudin; Patrick Ettenhuber; Kasper Kristensen; Thomas Kjærgaard; Poul Jørgensen
Journal: J Chem Theory Comput Date: 2015-06-10 Impact factor: 6.006

3. Molecular tailoring approach for geometry optimization of large molecules: energy evaluation and parallelization strategies.

Authors: V Ganesh; Rameshwar K Dongare; P Balanarayan; Shridhar R Gadre
Journal: J Chem Phys Date: 2006-09-14 Impact factor: 3.488

9. Energy and analytic gradients for the orbital-optimized coupled-cluster doubles method with the density-fitting approximation: An efficient implementation.

Authors: Uğur Bozkaya; Aslı Ünal; Yavuz Alagöz
Journal: J Chem Phys Date: 2020-12-28 Impact factor: 3.488

10. Derivation of general analytic gradient expressions for density-fitted post-Hartree-Fock methods: an efficient implementation for the density-fitted second-order Møller-Plesset perturbation theory.

Authors: Uğur Bozkaya
Journal: J Chem Phys Date: 2014-09-28 Impact factor: 3.488