Uğur Bozkaya1, Betül Ermiş1. 1. Department of Chemistry, Hacettepe University, Ankara 06800, Turkey.
Abstract
The coupled-cluster (CC) singles and doubles with perturbative triples [CCSD(T)] method is frequently referred to as the "gold standard" of modern computational chemistry. However, the high computational cost of CCSD(T) [O(N7)], where N is the number of basis functions, limits its applications to small-sized chemical systems. To address this problem, efficient implementations of linear-scaling coupled-cluster methods, which employ the systematic molecular fragmentation (SMF) approach, are reported. In this study, we aim to do the following: (1) To achieve exact linear scaling and to obtain a pure ab initio approach, we revise the handling of nonbonded interactions in the SMF approach, denoted by LSSMF. (2) A new fragmentation algorithm, which yields smaller-sized fragments, that better fits high-level CC methods is introduced. (3) A modified nonbonded fragmentation scheme is proposed to enhance the existent algorithm. Performances of the LSSMF-CC approaches, such as LSSMF-CCSD(T), are compared with their canonical versions for a set of alkane molecules, CnH2n+2 (n = 6-10), which includes 142 molecules. Our results demonstrate that the LSSMF approach introduces negligible errors compared with the canonical methods; mean absolute errors (MAEs) are between 0.20 and 0.59 kcal mol-1 for LSSMF(3,1)-CCSD(T). For a larger alkanes set (L12), CnH2n+2 (n = 50-70), the performance of LSSMF for the second-order perturbation theory (MP2) is investigated. For the L12 set, various bonded and nonbonded levels are considered. Our results demonstrate that the combination of bonded level 6 with nonbonded level 2, LSSMF(6,2), provides very accurate results for the MP2 method with a MAE value of 0.32 kcal mol-1. The LSSMF(6,2) approach yields more than a 26-fold reduction in errors compared with LSSMF(3,1). Hence, we obtain substantial improvements over the original SMF approach. To illustrate the efficiency and applicability of the LSSMF-CCSD(T) approach, we consider an alkane molecule with 10,004 atoms. For this molecule, the LSSMF(3,1)-CCSD(T)/cc-pVTZ energy computation, on a Linux cluster with 100 nodes, 4 cores, and 5 GB of memory provided to each node, is performed just in ∼24 h. As a second test, we consider a biomolecular complex (PDB code: 1GLA), which includes 10,488 atoms, to assess the efficiency of the LSSMF approach. The LSSMF(3,1)-FNO-CCSD(T)/cc-pVTZ energy computation is completed in ∼7 days for the biomolecular complex. Hence, our results demonstrate that the LSSMF-CC approaches are very efficient. Overall, we conclude the following: (1) The LSSMF(m, n)-CCSD(T) methods can be reliably used for large-scale chemical systems, where the canonical methods are not computationally affordable. (2) The accuracy of bonded level 3 is not satisfactory for large chemical systems. (3) For high-accuracy studies, bonded level 5 (or higher) and nonbonded level 2 should be employed.
The coupled-cluster (CC) singles and doubles with perturbative triples [CCSD(T)] method is frequently referred to as the "gold standard" of modern computational chemistry. However, the high computational cost of CCSD(T) [O(N7)], where N is the number of basis functions, limits its applications to small-sized chemical systems. To address this problem, efficient implementations of linear-scaling coupled-cluster methods, which employ the systematic molecular fragmentation (SMF) approach, are reported. In this study, we aim to do the following: (1) To achieve exact linear scaling and to obtain a pure ab initio approach, we revise the handling of nonbonded interactions in the SMF approach, denoted by LSSMF. (2) A new fragmentation algorithm, which yields smaller-sized fragments, that better fits high-level CC methods is introduced. (3) A modified nonbonded fragmentation scheme is proposed to enhance the existent algorithm. Performances of the LSSMF-CC approaches, such as LSSMF-CCSD(T), are compared with their canonical versions for a set of alkane molecules, CnH2n+2 (n = 6-10), which includes 142 molecules. Our results demonstrate that the LSSMF approach introduces negligible errors compared with the canonical methods; mean absolute errors (MAEs) are between 0.20 and 0.59 kcal mol-1 for LSSMF(3,1)-CCSD(T). For a larger alkanes set (L12), CnH2n+2 (n = 50-70), the performance of LSSMF for the second-order perturbation theory (MP2) is investigated. For the L12 set, various bonded and nonbonded levels are considered. Our results demonstrate that the combination of bonded level 6 with nonbonded level 2, LSSMF(6,2), provides very accurate results for the MP2 method with a MAE value of 0.32 kcal mol-1. The LSSMF(6,2) approach yields more than a 26-fold reduction in errors compared with LSSMF(3,1). Hence, we obtain substantial improvements over the original SMF approach. To illustrate the efficiency and applicability of the LSSMF-CCSD(T) approach, we consider an alkane molecule with 10,004 atoms. For this molecule, the LSSMF(3,1)-CCSD(T)/cc-pVTZ energy computation, on a Linux cluster with 100 nodes, 4 cores, and 5 GB of memory provided to each node, is performed just in ∼24 h. As a second test, we consider a biomolecular complex (PDB code: 1GLA), which includes 10,488 atoms, to assess the efficiency of the LSSMF approach. The LSSMF(3,1)-FNO-CCSD(T)/cc-pVTZ energy computation is completed in ∼7 days for the biomolecular complex. Hence, our results demonstrate that the LSSMF-CC approaches are very efficient. Overall, we conclude the following: (1) The LSSMF(m, n)-CCSD(T) methods can be reliably used for large-scale chemical systems, where the canonical methods are not computationally affordable. (2) The accuracy of bonded level 3 is not satisfactory for large chemical systems. (3) For high-accuracy studies, bonded level 5 (or higher) and nonbonded level 2 should be employed.
It has been demonstrated
that coupled-cluster (CC) methods are
accurate for the prediction of molecular properties.[1−5] The coupled-cluster singles and doubles (CCSD) method[6] provides quite accurate results for most molecular
systems at equilibrium geometries, but nevertheless, a triple excitations
correction is required to obtain high accuracy.[7−13] The coupled-cluster singles and doubles with perturbative triples
[CCSD(T)] method[10,11,14] provides excellent results for a broad range of chemical systems
near equilibrium geometries.[12,15−24] Therefore, the CCSD(T) method generally referred to as the “gold
standard” of computational chemistry. However, the high computational
cost of CCSD(T) [O(N7)], where N is the number of basis functions, limits
its applications to small-sized chemical systems.There have
been many attempts at development of reduced cost electron
correlation methods.[25−35] Some of these studies take advantages of the locality of molecular
orbitals (MO), which is based on the idea that dynamic correlation
is a short-range phenomenon. The introduction of a “correlation
domain” concept, by Pulay and co-workers,[25,26] stimulated local correlation approaches. There are a few variants
of local CC methods, such as projected atomic orbitals-based local
CC methods (PAO-LCC)[28,29] and local pair natural orbitals
(LPNOs).[32−34] Other attempts are the cluster-in-molecule (CIM)
approach[36−40] and the divide–expand–consolidate (DEC) approach.[41,42]Alternative and more effective approaches, compared to LCC
methods,
to tackle the molecular size dependence problems of electronic structure
theories are the molecular fragmentation approaches (MFA). Various
molecular fragmentation approaches have been suggested to overcome
the steep scaling problem of electronic structure methods.[43−46] In molecular fragmentation approaches, a molecular system is broken
up into small molecular units, and energies of the fragments are combined
to approximate the energy of the entire system. Although, the logic
behind all fragmentation approaches is similar, the formation of fragments,
as well as the combination of the fragment energies, differ significantly
from method to method. Molecular fragmentation methods include the
molecular tailoring approach (MTA),[47−49] fragment molecular orbital
theory (FMO),[50−52] molecular fractionation with conjugate caps (MFCC),[53,54] systematic molecular fragmentation (by annihilation) [SMF(A)],[44,55−64] combined fragmentation method (CFM),[60,65] generalized
energy-based fragmentation (GEBF),[66] kernel
energy method (KEM),[67,68] molecules-in-molecules (MIM)
approach,[69] many-overlapping-body expansion
(MOBE),[70] and generalized many-body expansion
(GMBE).[71]In terms of accuracy and
general applicability, the SMF approach
appears to be very attractive. The SMF energy is a sum of two components:
bonded and nonbonded. We may also call them as covalent and noncovalent
terms. The number of bonded fragments scales linearly [O(n)], where n is the number of
groups, while the number of nonbonded fragments scales quadratically
[O(n2)]. To reduce the
high cost of nonbonded fragments, Collins introduced a cutoff distance
(R), such as 2 Å.[61] If the distance between monomers of a nonbonded
fragment is smaller than R, then it is treated with electronic structure methods, otherwise,
with a simple perturbation theory approach. For branched molecules,
Collins’ algorithm yields large-sized fragments compared to
the chain-like linear alkanes case, which is another difficulty.[55,56] This situation especially becomes problematic for high-level CC
approaches, such as CCSD(T), where the computational cost increases
steeply with the molecular size.In this research, to achieve
exact linear scaling and to obtain
a pure ab initio approach, we completely neglect
all long-range nonbonded contributions since they already approach
zero. Further, we introduce a new fragmentation algorithm for the
branched molecules, which yields smaller-sized fragments; hence, the
new algorithm better fits high-level CC methods. The new linear-scaling
SMF algorithm, denoted by LSSMF, has been coded in C++ language
by the present authors and added to the MacroQC(72) software. The LSSMF approach is integrated with
the Dfocc module.[24,73−82] Hence, all methods available in the Dfocc module can be
used with the LSSMF approach. The newly proposed LSSMF-CC approaches,
such as LSSMF-CCSD and LSSMF-CCSD(T) as well as LSSMF-MP2, are applied
to a series of alkane molecules to demonstrate their efficiencies
and accuracies.
Systematic Molecular Fragmentation (SMF)
The SMF approach starts with the molecule M divided
into different “groups”. Groups are sets of atoms defined
by the SMF algorithm. The basic ideas involved in the method can be
illustrated for the simplest case involving a chain-like molecule
containing N groups connected by single bonds:The target is to derive an accurate value
for the total electronic energy:The energy of the molecule M is determined by
summation of the fragment (F), which is defined in terms of combinations of groups and
energies. The sizes of the fragments depend on the “level”
of SMF, and the fragments can overlap with each other since a group
can be involved in multiple fragments.[59] Hence, additional fragments with negative coefficients are generated
to cancel the effects of multiple counting.The bonded energy
iswhere f is the integer coefficient associated with the fragment F.For a model system
of a chain containing five groups, the SMF fragmentation
scheme can be expressed asThus, the fragment sizes increase with
the level used. However,
the number of fragments grows linearly with the size of the system.
The authors have noted that the different levels used in SMF are related
to some older concepts used in the field of theoretical thermochemistry.
For example, level 1 reactions are known as “isodesmic reactions”,[83] level 2 as homodesmotic reactions,[84] and level 3 as hyperhomodesmotic[85] reactions.Since the bonded energy only
includes nearby interactions, one
should consider the nonbonded interactions between more distant groups.
The nonbonded interactions may be evaluated by the following equation:The “allowed” nonbonded interactions
are the interactions that are not already included in E.[44,60]
New Linear-Scaling SMF (LSSMF) Approach
To illustrate the difference of our LSSMF approach from the previous
SMF/SMFA approach(es), let us consider an open-chain alkane molecule.
For a chain-like CH2 molecule with the SMF
scheme (at level 3), the bonded fragments are just butane and propane
fragments when hydrogen caps are added. The nonbonded fragments are
just methane dimers with different molecular distances. The number
of bonded and nonbonded fragments are given as follows:The number of bonded fragments scales linearly
with the number of carbons, while the number of nonbonded fragments
(NB) scales quadratically. However, one may consider only short-range
NB fragments, and their number also scales linearly with the system
size.Hence, we introduce a nonbonded cutoff tolerance,
Δ. If the distance between the
closest atoms of two groups is larger than Δ, then this nonbonded fragment is disregarded. We denote this
algorithm by distance-based elimination (DBE). An
alternative approach is using the ratio of distance to covalent radii
(DCRR) as follows:[86]where X denotes the Cartesian position of the atom in the fragment m, and r denotes
the covalent radius of the atom. Atomic covalent radii are obtained
from Cordero et al.[87]In this study,
we consider different bonded and nonbonded fragmentation levels. Hence,
we introduce the LSSMF(m, n) notation,
where m and n indicate the bonded
and nonbonded levels, respectively.
LSSMF Fragmentation Algorithm
Before
presenting our fragmentation algorithm, let us define the notation: i, j, k, l, ... for atoms; a, b, c, d, ... for groups; and μ, ν,
λ, σ, ... for fragments.Define the level of SMF and tolerances
for single, double, and triple bonds as well as the NB cutoff: Δ, Δ,
Δ, and Δ.Read molecular
info: Cartesian coordinates
(X, Y, and Z),
number of atoms (N), atomic masses, and atomic covalent radii (r).Compute interatomic distances: R.Compute bond
order matrix: B.If R < r + r + Δ,
then B = 1.If R < r + r – Δ, then B = 2.If R < r + r –
Δ, then B = 3.Else B = 0.Catch the first
nonhydrogen atom. The
first such atom is assigned to group 1 (in fact, group 0 in C++).Assign the remaining
non-H atoms. Starting
the first non-H atom, make a loop over atom pairs i, j. If B > 1, then assign j to the group of the ith atom, G. Otherwise, assign it to the next group, G.Catch double/triple bonded non-H atoms
in different groups and merge them.Assign the hydrogen atoms to each group
according to values of B.Form the group connectivity
matrix: L. If two groups
are connected
to each other, then L = 1; otherwise, it is equal to zero. Further, determine the bonded
atoms of two connected groups: LA.Determine
the number of caps per group.Form bonded and nonbonded domains
for each group. For group G, the bonded domain is the list of groups G (j ≠ i) that are connected to G. Similarly, the nonbonded domain is the list of
groups G (j ≠ i) that are not connected to G, within the nonbonded cutoff tolerance.Form lists of groups and
bonded and
nonbonded fragments according to the SMF level. Details of bonded
and nonbonded fragment algorithms are provided in Sections and 2.5, respectively.Add embedded charges to groups and
fragments in the case of polar molecules.[88]Write MacroQC input files
for groups and bonded and nonbonded fragments.
Capping Hydrogens
In each final fragment,
bonds that are connecting groups in the fragment to other groups that
are not in the fragment are “missing”. These missing
bonds are replaced by bonds to hydrogen atoms.[55] The total number of hydrogen atoms added to fragments with
a sign of +1 is exactly equal to the number added to fragments with
a sign of −1. The position of each H atom is taken to lie along
the missing bond vector at a distance which is proportional to the
expected ratio of bond lengths. That is,where X denotes the Cartesian position of the atom in the fragment,
and X denotes the Cartesian
position of the atom that is not available in the fragment.
New Fragmentation Algorithm for Branched Molecules
Our fragmentation algorithm is identical to the one suggested by
Deev and Collins[55] for unbranched chain-like
molecules. However, in the case of branched molecules, we propose
a new algorithm. In order to illustrate the difference between two
algorithms, let us consider the 2,4-dimethylpentane (24DMP) molecule
(Figure ) for which
the fragmentation result at level 3 was reported by Deev and Collins.[55]
Figure 1
2,4-Dimethylpentane (24DMP) molecule.
2,4-Dimethylpentane (24DMP) molecule.In the 24DMP molecule, each carbon atom defines
a group, with a
total of seven groups. Fragmentation suggested by Deev and Collins
yields to the following bonded fragments at level 3:[55]where G1G2G3G4G5G6G7 represents the whole molecule.
In this case, fragments G1G2G3G4G5 and G1G4G5G6G7 are
formed from the combination of the five groups. However, in the case
of an open chain analog, the fragments form from the combination of
four groups. Hence, Deev and Collins’ algorithm yields fragments
at different sizes for open chain and branched molecules. In the latter
case, it yields much larger fragments, which may be a problem for
high-level CC methods, where the computational cost increases steeply.
Therefore, one of the authors (U.B.) suggests a new fragmentation
algorithm for branched molecules, in which smaller-sized fragments
form as in the case of open chain molecules. Our algorithm yields
the following bonded fragments for the 24DMP molecule at level 3:In the fragmentation in eq , fragments formed by
the combination of four groups are called the main fragments. The
remaining fragments are considered for chemical balance. Hence, we
may call them neutralizing fragments or renormalization
terms, reminiscent of the many-body perturbation theory.
The logic of the proposed fragmentation approach is to form all possible
bonded fragments combining four different groups as in the case of
open-chain alkane molecules. Our algorithm produces 6F4 + 5F3 + 2F2, where F denotes a fragment formed from i different groups,
whereas Deev and Collins’s algorithm produces 2F5 + F3. Hence, our algorithm
yields lower size fragments, while Deev and Collins’ algorithm
yields a smaller number of fragments. For high-level CC computations
with large basis sets, the size of a fragment is more important than
the number of additional small fragments. Moreover, a group can be
as small as CH4 and H2O but can be as large
as benzene and naphthalene. Hence, in the case of large groups, such
as benzene and naphthalene, decreasing the size of the fragment from F5 to F4 is still
very important to reduce the cost even though small basis sets are
employed. Therefore, our algorithm is more efficient in terms of computational
cost and better fits high-level CC methods, such as CCSD(T).Our new bonded fragmentation algorithm:Let us assume that we are employing
bonded level m fragmentation. Then, the sizes of
our main fragments will be m + 1, which means they
will include m + 1 groups.To form bonded fragments, we need to
loop over groups. At the bonded level m, one may
loop over m + 1 groups, which would be a O(n) loop,
where n is the number of groups. However, with the
concept of the bonded domain, we can reduce the cost dramatically.
Hence, we just run a single loop over groups. For each group (G), we get bonded domain BD, which includes G groups (j ≠ i). Then, for each G group, we get bonded domain BD. Finally, we form the union of bonded
domains (UBD). Please note that each bonded domain may include just
a few groups, in the case of alkanes as many as four. For example,
UBD may include a maximum of 16 groups at level 3. Since some groups
may be present simultaneously in different BDs, the actual size of
UBD is much smaller.Once we form UBD, we loop over the
elements of UBD and form all possible main bonded fragments of size m + 1. The groups of the fragments which are formed should
be connected in the original molecule.Repetitive groups, in the form of the
largest possible fragments, in the main bonded fragments are added
to the list of bonded fragments with appropriate negative coefficients.
For example, at level 3, our main fragments include four groups (F4). Hence, we first search for repeating three
groups (F3) that are connected in the
bonded list. If we find any F3, then we
add them to the list with appropriate negative coefficients. Then,
we repeat this procedure for F2 fragments
and finally for groups. The mentioned negative coefficients are obtained
following the chemical balance rules.
Modified Nonbonded Algorithms
In
the original SMF approach, Collins and co-workers present a simple
and effective way to consider nonbonded interactions.[44,55−63] In Collins and co-workers’ NB approach, only two-body interactions
are considered, which may be called the NB level 1 fragmentation.
For example, for the linear C7H16 molecule,
the bonded and NB fragments can be written as follows:where each G group is represented by the i symbol,
for example, 234 = G2G3G4.Bonded fragments:where i ↔ j denotes the NB interaction between the ith and jth groups.NB fragments:However, for accurate treatment
of the NB interactions, one needs
to consider larger contributions than two-body interactions. In a
2009 study, Addicoat and Collins reported an improved algorithm for
the NB interactions using a level–level approach in addition
to a three-body expansion method.[86]Even though Addicoat and Collins[86] level–level
approach is an improvement over two-body expansion, it includes a
limited number of higher terms. Furthermore, Addicoat and Collins[86] three-body expansion approach yields a very
large number of fragments. Hence, we propose a modified NB algorithm,
which includes all three-body terms, while lower level interactions
come from the renormalization terms. Our modified algorithm is obtained
employing our NB cutoff approaches to Addicoat and Collins’
three-body expansion method. For example, for the n-heptane NB fragmentation scheme yieldsIn our algorithm, we first
build all three-body terms, then we investigate all subunits and cancel
the repeating terms.Our nonbonded fragmentation algorithm is
as follows::The distant groups that are not present
together in the bonded fragments are considered in nonbonded fragments.At the nonbonded level
1, we form “dimers”
of the capped groups (G ↔ G) that are
not simultaneously present in bonded fragments. Of course, the distance
between these two groups should be within the nonbonded cutoff limits.
Otherwise, they will be disregarded.In nonbonded level 2, we form three-body
complexes (3BCs). Each 3BC is formed between a group and a fragment
of two groups (G ↔ F2). The F2 fragments
can be obtained by bonded level 1 fragmentation. Of course, the G group and the F2 fragment should not be simultaneously present in bonded
fragments. Next, we inspect these 3BCs for repeating “dimers”
of groups. If we find any repeating “dimer”, then we
add them (dimers of capped groups) to the list with appropriate negative
coefficients. The mentioned negative coefficients are again obtained
following the chemical balance rules. As in the case of nonbonded
level 1, we consider nonbonded fragments within the nonbonded cutoff
limits.Even though our nonbonded level 2 (NB2) appears to be
identical
to the three-body expansion approach of Addicoat and Collins,[86] our algorithm yields a dramatically reduced
number of fragments compared to Addicoat and Collins because of employed
nonbonded cutoff limits. For example, for the C70H142 molecule, our algorithm yields 1540 NB fragments, while
it would be ∼4 times larger in the case of the three-body expansion
approach of Addicoat and Collins.
Results and Discussion
Results from
the HF, MP2, CCSD, CCSD(T), LSSMF(3,1)-HF, LSSMF(3,1)-MP2,
LSSMF(3,1)-CCSD, and LSSMF(3,1)-CCSD(T) methods were obtained for
a set of alkanes, CH2 (n = 6–10), for comparison of the
absolute energies. Further, for a larger system, CH2 (n = 50–70),
results from MP2 and LSSMF-MP2 are compared for the total energies.
For the alkanes, Dunning’s correlation-consistent polarized
valence double, triple, and quadruple-ζ basis sets (cc-pVDZ,
cc-pVTZ, and cc-pVQZ) were employed with the frozen core approximation.[89,90] The density-fitting approach was used for LSSMF methods considered.[24,74,78,79] For the cc-pVXZ primary basis sets, cc-pVXZ-JKFIT[91] and cc-pVXZ-RI[92] auxiliary basis
sets were employed for reference and correlation energies, respectively.
Geometries of the CH2 structures considered were optimized at the B3LYP/cc-pVDZ
and universal force field (UFF)[93] levels
for n = 6–10 and n = 50–70,
respectively.Previous studies demonstrated that the accuracies
of level 1 and
level 2 approaches are not satisfactory for bonded fragments, and
level 3 should be used at least.[44,60] Hence, in
this study, we consider levels 3–6 for bonded fragments, while
we consider levels 1 and 2 for nonbonded fragments.
S142 Set
To assess the accuracy of
the LSSMF(3,1) approach with respect to the canonical methods, we
consider a set of alkanes, CH2 (n = 6–10), which includes
142 molecules, denoted by S142. For the first step of our assessment,
we choose a safe cutoff value for nonbonded interactions: Δ = 10.0 Å. In the next section, effects
of different Δ values are evaluated.
Mean absolute errors (MAEs) of the LSSMF(3,1)-HF, LSSMF(3,1)-MP2,
LSSMF(3,1)-CCSD, and LSSMF(3,1)-CCSD(T) methods with respect to canonical
methods are depicted in Figure . For the CH2 set, total energies from MP2, CCSD, CCSD(T), LSSMF(3,1)-MP2,
LSSMF(3,1)-CCSD, and LSSMF(3,1)-CCSD(T) methods and percentages of
the LSSMF energies with respect to the canonical methods are reported
in Tables S1–S6.
Figure 2
Mean absolute errors
in the total energies of the CH2 (n = 6–10) isomers for
the LSSMF(3,1)-HF, LSSMF(3,1)-MP2, LSSMF(3,1)-CCSD,
and LSSMF(3,1)-CCSD(T) methods with respect to canonical methods.
All computations are performed with the cc-pVDZ basis set and with
the Δ = 10.0 Å.
Mean absolute errors
in the total energies of the CH2 (n = 6–10) isomers for
the LSSMF(3,1)-HF, LSSMF(3,1)-MP2, LSSMF(3,1)-CCSD,
and LSSMF(3,1)-CCSD(T) methods with respect to canonical methods.
All computations are performed with the cc-pVDZ basis set and with
the Δ = 10.0 Å.For the C10H22 isomers, which
are the largest
member of test set considered, total energies from MP2, CCSD, CCSD(T),
LSSMF(3,1)-MP2, LSSMF(3,1)-CCSD, and LSSMF(3,1)-CCSD(T) methods and
percentages of the LSSMF energies with respect to the canonical methods
are reported in Tables S5 and S6. For the
correlated methods, the percent coverage values are in 99.9990%–100.0005%,
while that of LSSMF(3,1)-HF is in 99.9978%–100.0001%. Hence,
all considered LSSMF methods cover a satisfactory portion of the total
energy of the full methods. The MAE values (Figure ) in total energies are 1.61 [LSSMF(3,1)-HF],
0.59 [LSSMF(3,1)-MP2], 0.58 [LSSMF(3,1)-CCSD], and 0.59 [LSSMF(3,1)-CCSD(T)]
kcal mol–1. Further, the Δ values for total energies are 5.30 [LSSMF(3,1)-HF], 2.56 [LSSMF(3,1)-MP2],
2.27 [LSSMF(3,1)-CCSD], and 2.01 [LSSMF(3,1)-CCSD(T)] kcal mol–1. Hence, considering both error measures, MAE and
Δ, the results of the correlated
LSSMF methods are in good agreement with the canonical methods. These
results demonstrate that the high-level electron correlation methods
are less prone to fragmentation errors since the dynamical electron
correlation is an local phenomenon. Considering the results obtained
for the whole alkane set, one can safely rely on the LSSMF-CC methods
for high-accuracy studies in large-sized chemical systems, where the
canonical methods are not computationally affordable.
Cutoff
In the second step of our
assessment of the LSSMF approaches, we investigate the effect of nonbonded
cutoff tolerances on the accuracy. For this purpose, we consider five
isomers of C10H22: 2,2,3,3-tetramethylhexane
(2233TMH), 4-ethyl-2,4-dimethylhexane (4E24DMH), 4-isopropylheptane
(4IPH), 5-methylnonane (5MN), and n-decane (decane).
For these molecules, the total energies of the LSSMF(3,1)-CCSD(T)
approach are computed with Δ =
3–10 Å. The errors at each Δ value with respect to full methods are depicted in Figure . Our results indicate
that the maximum error is generally obtained at 3 Å, as expected,
and errors are kept constant starting with 6 Å. In the case of
the n-decane molecule, we obtain the lowest errors
at Δ = 3 Å. The reason why
the lowest error is obtained at the shortest distance is because the n-decane molecule bonded energy is closer to CCSD(T) energy
compared with the total LSSMF energy, which covers 100.0005% of the
CCSD(T) energy. In other words, adding more nonbonded contribution,
by increasing Δ, one obtains lower
energies compared with CCSD(T). Overall, even though we use Δ = 10 Å throughout this study, a Δ value of 6.0 Å appears to be enough
for the most purposes, which corresponds to a DCRR value of ∼4.0.
Figure 3
Errors
of the LSSMF(3,1)-CCSD(T) method with respect to the full
method with different cutoff distances for 2,2,3,3-tetramethylhexane
(2233TMH), 4-ethyl-2,4-dimethylhexane (4E24DMH), 4-isopropylheptane
(4IPH), 5-methylnonane (5MN), and n-decane (decane)
molecules. All computations are performed with the cc-pVDZ basis set.
Errors
of the LSSMF(3,1)-CCSD(T) method with respect to the full
method with different cutoff distances for 2,2,3,3-tetramethylhexane
(2233TMH), 4-ethyl-2,4-dimethylhexane (4E24DMH), 4-isopropylheptane
(4IPH), 5-methylnonane (5MN), and n-decane (decane)
molecules. All computations are performed with the cc-pVDZ basis set.
Frozen Natural Orbitals
To further
increase the applicability of the LSSMF(3,1)-CCSD(T) approach, we
also consider frozen natural orbitals (FNOs).[72,94−97] The FNO approximation is very helpful to reduce the computational
cost of CCSD(T), while it introduces negligible errors with tight
enough occupation tolerances, such as 10–5. To improve
the FNO–CC results, we employ the δ correction as suggested by DePrince and Sherrill.[97] With the FNO approximation, we can consider
larger basis sets for the canonical methods; hence, we employ the
cc-pVTZ basis set. For the n-decane and four lowest
energy isomers, we obtain MAE and Δ values of 0.74 and 1.04 kcal mol–1, respectively,
for the LSSMF(3,1)-FNO–CCSD(T) approach (Tables S7 and S8). Hence, the fragmentation error is tolerable
for the FNO–CCSD(T) method, as in the case of CCSD(T).
Basis Set Effects
To investigate
the effect of basis sets, we also carry out total energy computations
for the LSSMF(3,1)-FNO–CCSD(T) method with cc-pVDZ, cc-pVTZ,
and cc-pVQZ basis sets for three C7H16 isomers.
One of these isomers is n-heptane, and others are
the lowest energy isomers: 2,2,3-trimethylbutane and 2,2-dimethylhexane.
The MAE values with respect to FNO–CCSD(T) for different basis
sets are depicted in Figure . The MAE values are 0.33 (cc-pVDZ), 0.38 (cc-pVTZ), and 0.45
(cc-pVQZ) kcal mol–1. Even though there is a slight
increase with basis set size, the errors are still at the tolerable
magnitudes.
Figure 4
Mean absolute errors in the total energies of three C7H16 isomers for the LSSMF(3,1)-FNO–CCSD(T) method
with respect to FNO–CCSD(T). All computations are performed
with the FNO occupation tolerance of 10–5 and Δ = 10.0 Å in the cc-pVDZ (DZ), aug-cc-pVDZ
(aDZ), cc-pVTZ (TZ), aug-cc-pVTZ (aTZ), cc-pVQZ (QZ), and aug-cc-pVQZ
(aQZ) basis sets.
Mean absolute errors in the total energies of three C7H16 isomers for the LSSMF(3,1)-FNO–CCSD(T) method
with respect to FNO–CCSD(T). All computations are performed
with the FNO occupation tolerance of 10–5 and Δ = 10.0 Å in the cc-pVDZ (DZ), aug-cc-pVDZ
(aDZ), cc-pVTZ (TZ), aug-cc-pVTZ (aTZ), cc-pVQZ (QZ), and aug-cc-pVQZ
(aQZ) basis sets.To investigate the effect of diffuse basis sets,
we also carried
out energy computations for the LSSMF(3,1)-FNO–CCSD(T) method
with the aug-cc-pVDZ, aug-cc-pVTZ, and aug-cc-pVQZ basis sets for
the same isomers (Figure ). The MAE values with respect to FNO–CCSD(T) are 0.71
(aug-cc-pVDZ), 0.47 (aug-cc-pVTZ), and 0.49 (aug-cc-pVQZ) kcal mol–1. The MAE of aug-cc-pVDZ is almost 2-fold increased
compared to cc-pVDZ, while the MAE of triple and quadruple ζ
basis sets are only slightly increased compared to cc-pVTZ and cc-pVQZ.
Nevertheless, the errors are still at tolerable magnitudes.
L12 Set
To further investigate the
performance of the LSSMF approach for larger molecules, we consider
the L12 set (Table S9), which consists
of 12 large molecules including 50–70 carbon atoms. For the
L12 set, conventional CC computations are not computationally feasible.
Hence, for this set, we investigate the errors of LSSMF-MP2 with respect
to the canonical MP2 for absolute energies. For the L12 set, the cc-pVDZ
basis set is employed. For the S142 set, bonded level 3 is only considered
because higher levels covers either the whole molecules or a large
portion of them. Hence, for the L12 set, we explore the effect of
higher levels. For the L12 set, bonded levels 3–6 and nonbonded
levels 1 and 2 are considered. For each combination of bonded and
nonbonded levels, the cutoff values of 7.5 and 10.0 Å are employed
for nonbonded interactions, respectively.For the C50H102– C70H142 molecules,
total energies from HF, MP2, LSSMF-HF, and LSSMF-MP2 methods and percentages
of the LSSMF energies with respect to the canonical methods are reported
in Tables S10–S25. For the entire
set, the percent coverage values are in 99.9983%–100.0002%
and 100.0000%–100.0008% for LSSMF-HF and LSSMF-MP2, respectively.
Hence, both LSSMF methods cover large portions of the total energies
of the corresponding canonical methods.With the nonbonded level
1 and the cutoff value of 7.5 Å,
the MAE values (Figure ) in the LSSMF-HF total energies with respect to HF are 9.95, 2.35,
0.59, and 0.27 kcal mol–1 for the bonded levels
3, 4, 5, and 6, respectively. With the nonbonded level 1 and the cutoff
value of 7.5 Å, the MAE values (Figure ) in the LSSMF-MP2 total energies with respect
to MP2 are 8.34, 2.99, 2.12, and 1.05 kcal mol–1 for the bonded levels 3, 4, 5, and 6, respectively. Hence, at higher
bonded levels, the errors of the LSSMF approach decrease systematically.
The accuracy of bonded level 6 is substantially better than lower
levels considered; there are 7.9-fold reduction in errors compared
to level 3, which is advocated to be accurate in previous studies.
To further improve our results, we also consider the nonbonded level
2 scheme proposed in this study. With the nonbonded level 2 and the
cutoff value of 7.5 Å, the MAE values (Figure ) in the LSSMF-HF total energies with respect
to HF are 6.55, 1.83, 0.48, and 0.19 kcal mol–1 for
the bonded levels 3, 4, 5, and 6, respectively. With the nonbonded
level 2 and the cutoff value of 7.5 Å, the MAE values (Figure ) in the LSSMF-MP2
total energies with respect to MP2 are 6.64, 1.48, 0.92, and 0.32
kcal mol–1 for the bonded levels 3, 4, 5, and 6,
respectively. The NB level 2 fragmentation significantly improves
upon NB level 1 and provides 1.26-, 2.02-, 2.30-, and 3.28-fold reductions
in errors compared with NB level 1 for bonded level 3, 4, 5, and 6,
respectively. Thus, the bonded level 6, NB level 2 combination, which
may denoted by LSSMF(6,2), provides the best results. Further, it
is also noteworthy that the LSSMF(5,2) level provides lower errors
compared to LSSMF(6,1). Similarly, the LSSMF(4,2) level is better
than LSSMF(5,1). Hence, it appears that instead of going a higher
order in the bonded level, it is better to go higher nonbonded level
at first.
Figure 5
Mean absolute errors in the total energies of the L12 set (the
largest member is C70H142) for the LSSMF-HF
method with respect to HF. All computations are performed with the
Δ = 7.5 Å in the cc-pVDZ
basis sets. The (m, n) notation
indicates the bonded and nonbonded levels, respectively.
Figure 6
Mean absolute errors in the total energies of the L12
set (the
largest member is C70H142) for the LSSMF-MP2
method with respect to MP2. All computations are performed with the
Δ = 7.5 Å in the cc-pVDZ
basis sets. The (m, n) notation
indicates the bonded and nonbonded levels, respectively.
Mean absolute errors in the total energies of the L12 set (the
largest member is C70H142) for the LSSMF-HF
method with respect to HF. All computations are performed with the
Δ = 7.5 Å in the cc-pVDZ
basis sets. The (m, n) notation
indicates the bonded and nonbonded levels, respectively.Mean absolute errors in the total energies of the L12
set (the
largest member is C70H142) for the LSSMF-MP2
method with respect to MP2. All computations are performed with the
Δ = 7.5 Å in the cc-pVDZ
basis sets. The (m, n) notation
indicates the bonded and nonbonded levels, respectively.With the nonbonded level 1 and the cutoff value
of 10.0 Å,
the MAE values (Figure ) in the LSSMF-HF total energies with respect
to HF are 9.94, 2.34, 0.59, 0.28 kcal mol–1 for
the bonded levels 3, 4, 5, and 6, respectively. With the nonbonded
level 1 and the cutoff value of 10.0 Å, the MAE values (Figure ) in the LSSMF-MP2
total energies with respect to MP2 are 8.45, 3.11, 2.24, 1.17 kcal
mol–1 for the bonded levels 3, 4, 5, and 6, respectively.
With the nonbonded level 2 and the cutoff value of 10.0 Å, the
MAE values (Figure ) in the LSSMF-HF total energies with respect
to HF are 6.57, 1.86, 0.48, 0.17 kcal mol–1 for
the bonded levels 3, 4, 5, and 6, respectively. With the nonbonded
level 2 and the cutoff value of 10.0 Å, the MAE values (Figure ) in the LSSMF-MP2
total energies with respect to MP2 are 6.55, 1.48, 0.92, 0.33 kcal
mol–1 for the bonded levels 3, 4, 5, and 6, respectively.
These results are virtually the same as the previous results obtained
for the cutoff value of 7.5 Å, which again demonstrates that
our cutoff tolerance is accurate enough. Overall, our results demonstrate
that the LSSMF(6,2) level approaches to the canonical method. Therefore,
one may rely on the LSSMF methods for high-accuracy studies in large-sized
chemical systems, where the canonical methods are not computationally
feasible.
Figure 7
Mean absolute errors in the total energies of the L12 set (the
largest member is C70H142) for the LSSMF-HF
method with respect to HF. All computations are performed with the
Δ = 10.0 Å in the cc-pVDZ
basis sets. The (m, n) notation
indicates the bonded and nonbonded levels, respectively.
Figure 8
Mean absolute errors in the total energies of the L12
set (the
largest member is C70H142) for the LSSMF-MP2
method with respect to MP2. All computations are performed with the
Δ = 10.0 Å in the cc-pVDZ
basis sets. The (m, n) notation
indicates the bonded and nonbonded levels, respectively.
Mean absolute errors in the total energies of the L12 set (the
largest member is C70H142) for the LSSMF-HF
method with respect to HF. All computations are performed with the
Δ = 10.0 Å in the cc-pVDZ
basis sets. The (m, n) notation
indicates the bonded and nonbonded levels, respectively.Mean absolute errors in the total energies of the L12
set (the
largest member is C70H142) for the LSSMF-MP2
method with respect to MP2. All computations are performed with the
Δ = 10.0 Å in the cc-pVDZ
basis sets. The (m, n) notation
indicates the bonded and nonbonded levels, respectively.
Timing
In our LSSMF implementation,
we form groups, bonded, and nonbonded fragments at first; then, we
write all fragment input files to disk. In the third step, we simultaneously
submit all fragment jobs to our Linux clusters. Finally, we collect
the energy values from output files, merge them, and compute the final
LSSMF energy. Hence, our implementation is naturally parallel. The
fragment formation procedure is the fastest step (step 1). We can
form all fragments just in a few minutes owing to our efficient fragmentation
algorithm. Writing fragment input files generally takes several minutes
(step 2). Hence, the cost of overall computation is dependent on the
cost of CC jobs (step 3), which is dependent on the number of cores
that are available.To illustrate the efficiency of our fragmentation
algorithm, we consider a set of alkanes, which includes 10,004–50,012
atoms. Total wall times (in min) for the LSSMF(3,1) code (step 1 +
step 2) for the CH2 (n = 3334; 6668; 10,002; 13,336; 16,670)
set are depicted in Figure . For the largest member of the alkanes set considered, C16670H33342, the total time for the LSSMF code is
just 8.4 min on a single node (1 core) computer. Hence, our LSSMF
code is very efficient to form fragments and prepare necessary input
files.
Figure 9
Total wall time (in min) for the LSSMF(3,1) code for a CH2 set. All procedures
were performed on a single node (1 core) Intel(R) Xeon(R) Gold 5218
CPU @ 2.30 GHz computer.
Total wall time (in min) for the LSSMF(3,1) code for a CH2 set. All procedures
were performed on a single node (1 core) Intel(R) Xeon(R) Gold 5218
CPU @ 2.30 GHz computer.To illustrate the efficiency and applicability
of the LSSMF(3,1)-CCSD(T)
approach, we consider the C3334H6670 molecule,
which includes 10,004 atoms. For the C3334H6670 molecule, the LSSMF(3,1)-CCSD(T)/cc-pVTZ energy computation is performed
in a Linux cluster with 100 nodes, 4 cores, and 5 GB of memory provided
to each node. In this system, the total wall time for energy computation
is ∼24 h, which indicates that the introduced method is extremely
efficient. As a second example, we consider a biomolecular complex
(PDB code: 1GLA), which includes 10,488 atoms, to illustrate the efficiency the
LSSMF approach. For 1GLA, the LSSMF(3,1)-FNO–CCSD(T)/cc-pVTZ energy computation with
Δ = 5 Å is performed in a
cluster with 50 nodes (each node has 8 cores, 64 GB of memory, and
Xeon Scalable 6148 2.40 GHz CPU). If this chemical system would run
as a whole molecule, there would be 231,408 basis functions. At the
LSSMF(3,1) level, 3170 groups and 11,445 bonded and 62,716 nonbonded
fragments are formed for 1GLA. For the largest fragment, there are only 736 basis
functions. The LSSMF(3,1)-FNO–CCSD(T) energy of the molecule
is −267,117.064554 hartree (with the FNO occupation tolerance
of 10–4). This computation is completed in ∼7
days, which shows the efficiency of our LSSMF method. The number of
atoms are similar for the biomolecular complex and the linear alkane
considered, C3334H6670. However, the biomolecular
complex yields significantly larger fragments due to aromatic bonds
in amino acids. Therefore, we observe different computational times.
Conclusions
In this research, efficient
implementations of linear-scaling coupled-cluster
methods, which employ the systematic molecular fragmentation approach,
have been reported. For the branched molecules, a new fragmentation
algorithm, which yields smaller-sized fragments compare with previous
studies, has been introduced. The new linear-scaling SMF algorithm
is denoted by LSSMF. Performances of the developed LSSMF-CC approaches,
such as LSSMF-CCSD and LSSMF-CCSD(T), have been compared with their
canonical versions for a set of alkane molecules, CH2 (n = 6–10),
which includes 142 molecules. Our results demonstrate that the LSSMF
approach introduces negligible errors compared with the canonical
methods. For the alkanes set, the MAE values are between 0.19 and
0.58 and 0.20 and 0.59 kcal mol–1 for the LSSMF(3,1)-CCSD
and LSSMF(3,1)-CCSD(T) methods, respectively. A similar performance
has been observed in the case of the frozen natural orbitals-based
CCSD(T) approach [LSSMF(3,1)-FNO–CCSD(T)]. Further, we investigate
basis set effects on the LSSMF methods using cc-pVXZ (X = D,T,Q) basis
sets. Our results indicate that the performance of the LSSMF(3,1)-FNO–CCSD(T)
approach with large basis sets is similar to the small basis set cases.To further assess the performances of the LSSMF approaches for
large molecular systems, we consider the L12 set, which consists of
12 large molecules including 50–70 carbon atoms. For the L12
set, various bonded and nonbonded levels are considered. Our results
demonstrate that the combination of bonded level 6 with nonbonded
level 2, LSSMF(6,2), yields substantially accurate results for the
MP2 method. The MAE value for the LSSMF(6,2)-MP2 method with respect
to MP2 is 0.32 kcal mol–1 with the cutoff value
of 7.5 Å. The LSSMF(6,2) approach yields more than a 26-fold
reduction in errors compared with the LSSMF(3,1) approach. Hence,
we obtain dramatic improvements over Collins’ original SMF
approach.[59]To illustrate the efficiency
and applicability of the LSSMF approach,
we consider an alkane molecule with 10,004 atoms at first. For the
C3334H6670 molecule, the LSSMF(3,1)-CCSD(T)/cc-pVTZ
energy computation, on a Linux cluster with 100 nodes, 4 cores, and
5 GB of memory provided to each node, is performed just in ∼24
h. Furthermore, we consider a biomolecular complex (PDB code: 1GLA), which includes
10,488 atoms, as a second test for assessment of the efficiency of
LSSMF. The LSSMF(3,1)-FNO–CCSD(T)/cc-pVTZ single point energy
computation is completed in ∼7 days for the biomolecular complex.
Even though the number of atoms appears to be similar, the biomolecular
complex includes larger fragments compared to the linear alkane considered,
which accounts to the difference in wall time reported. Hence, our
results demonstrate that the LSSMF-CC approaches are very efficient.Our results demonstrate the LSSMF(6,2) level approaches to the
canonical method. Therefore, one may rely on the LSSMF methods for
high-accuracy studies in large-sized chemical systems, where the canonical
methods are computationally prohibitive. Overall, we conclude that
the LSSMF approach is promising for applications of electron correlation
methods in large-scale chemical systems where canonical methods are
computationally prohibitive.