Literature DB >> 35930763

Efficient Crystal Structure Prediction for Structurally Related Molecules with Accurate and Transferable Tailor-Made Force Fields.

Alessandra Mattei¹, Richard S Hong¹, Hanno Dietrich², Dzmitry Firaha², Julian Helfferich², Yifei Michelle Liu², Kiran Sasikumar², Nathan S Abraham¹, Rajni Miglani Bhardwaj¹, Marcus A Neumann², Ahmad Y Sheikh¹.

Abstract

Crystal structure prediction (CSP) is generally used to complement experimental solid form screening and applied to individual molecules in drug development. The fast development of algorithms and computing resources offers the opportunity to use CSP earlier and for a broader range of applications in the drug design cycle. This study presents a novel paradigm of CSP specifically designed for structurally related molecules, referred to as Quick-CSP. The approach prioritizes more accurate physics through robust and transferable tailor-made force fields (TMFFs), such that significant efficiency gains are achieved through the reduction of expensive ab initio calculations. The accuracy of the TMFF is increased by the introduction of electrostatic multipoles, and the fragment-based force field parameterization scheme is demonstrated to be transferable for a family of chemically related molecules. The protocol is benchmarked with structurally related compounds from the Bromodomain and Extraterminal (BET) domain inhibitors series. A new convergence criterion is introduced that aims at performing only as many ab initio optimizations of crystal structures as required to locate the bottom of the crystal energy landscape within a user-defined accuracy. The overall approach provides significant cost savings ranging from three- to eight-fold less than the full-CSP workflow. The reported advancements expand the scope and utility of the underlying CSP building blocks as well as their novel reassembly to other applications earlier in the drug design cycle to guide molecule design and selection.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35930763 PMCID： PMC9476662 DOI： 10.1021/acs.jctc.2c00451

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.578

Introduction

Crystal structure prediction (CSP) of organic molecules aims to generate and rank the crystal structures with various three-dimensional (3D) packing arrangements of molecules, starting from the chemical diagram alone. The field has made substantial progress since the challenge laid down in the seminal commentary by Dunitz in the early 2000s.[1] This progress is best manifested by the results of the latest blind test organized by the Cambridge Crystallographic Data Center (CCDC),[2] a sixth in the series where the predictions have become more accurate, despite increased chemical complexity and conformational flexibility.[2−7] Overall, much of the underlying research focus is directed toward accurately calculating the relative energies of all putative crystal packing arrangements. Hence, all CSP approaches are structured hierarchically so that the overall workflow resembles a funnel of increasingly narrow criteria and accurate energy models. Relatively less-demanding calculations based on force fields are used to generate a large number of candidate crystal structures, while more advanced first-principles computational methods are applied for the final ranking of the most promising crystal structures. Tailor-made force fields (TMFFs) represented an initial breakthrough in the field, allowing sufficiently accurate calculations of the energy minima associated with crystal packings.[8,9] Currently, TMFFs and first-principles energy ranking approaches with dispersion-inclusive hybrid functionals[10,11] and many-body dispersion interactions[12,13] are the most successful methods for yielding reliable and accurate lattice energies.[14,15] More recently, molecular dynamics and enhanced molecular dynamics have also been rigorously utilized to probe free energy stabilities and to eliminate labile structures.[16−18] Such algorithms have been thoroughly benchmarked and widely applied to predict the crystal energy landscape of organic molecules, including complex chemical systems of pharmaceutical relevance. In parallel, modern machine learning potentials combined with state-of-the-art first-principles calculations have been demonstrated to approach the accuracy required for predictive stability rankings with affordable computational costs.[18,19] In addition, the unprecedented level of parallelization available through cloud computing has permitted the resource-intensive CSP calculations to be completed in weeks, if not days.[20] Driven by the fast development of algorithmic improvements and powerful computing resources, CSP has fueled considerable interest across the pharmaceutical industry to complement experimental solid form screening.[21−24] Contemporary prediction of crystal energy landscapes is routinely used to assess and manage the robustness of the solid form selected for development.[25−27] In the context of drug design, when CSP is performed earlier in the design cycle, crucial insights can be gained by understanding the structural determinants of key physicochemical properties inherently rooted in essential crystal structure features, such as conformational flexibility and intermolecular interactions.[28,29] Beyond these downstream calculations, fundamental building blocks of the CSP workflow, such as force fields,[30] enhanced Monte-Carlo simulations,[31] and dispersion-corrected density functional theory (DFT-D),[32] are also broadly applicable in the molecular design itself.[33] The need for computational efficiency in computer aided drug design often leads to reliance on traditional transferable force fields, such as OPLS,[34,35] AMBER,[36] CHARMM,[37] COMPASS,[38] and GROMOS.[39] While limited in accuracy by their underlying functional form, these generic force fields are parameterized with experimental and quantum mechanics data for small organic molecules and/or macromolecules, conferring good parameter transferability.[40] On the other hand, TMFFs are parameterized on molecule-by-molecule basis and thus are more accurate.[8] In addition to requiring more computational effort, they are obviously not transferable. Before the immense value of TMFFs can be realized for broader applications beyond CSP, improvements on both the transferability and computational efficiency are therefore needed. The limitations of standard CSP workflows even when executed at ever-increasing speeds leave many opportunities that, if addressed, can expand the scope and utility of the underlying building blocks and their novel reassemblies. In this study, we present a new paradigm of CSP specifically designed for structurally related molecules, referred to as Quick-CSP, which improves the TMFF accuracy, broadens the concept of fragment-based parameterization to impart transferability, and significantly reduces the extent of ab initio calculations for unprecedented levels of overall efficiency and versatility. We demonstrate the performance of this novel approach on moderately flexible molecules in the pyrrolopyridone-based series of Bromodomain and Extraterminal (BET) domain inhibitors, where the implications of subtle chemical modifications on conformational flexibility, crystal packing, and the resulting aqueous crystalline solubility were recently reported.[29] Our new paradigm provides a stepping-stone to the simultaneous prediction of crystal structures for more than one molecule. The resulting 3D packing arrangements of molecules can thereby be incorporated in physicochemical property predictions, and the insight into the structural determinants of the properties can enable rational drug design.[28,30,41−44]

Results and Discussion

Figure shows key differentiating features of the Quick-CSP workflow compared to the standard Full-CSP, as implemented in the GRACE[45] software suite. Here, the accuracy of the TMFF is improved by the introduction of electrostatic multipoles[46−48] attached to local atomic coordinate systems. The fragment-based approach for force field parameterization is expanded to a family of chemically related molecules, resulting in higher accuracy and some level of transferability. Furthermore, the efficiency of the fragment-based approach systematically increases as the number of molecules under consideration increases, and the approach is scalable to chemical series of virtually any size. In the Quick-CSP workflow, CSPs are run for all compounds in parallel, relaxing the convergence requirements. A key feature is that the bottom energy of the crystal energy landscape is determined through a statistical analysis together with estimates of multiple crystal structures’ DFT-D energies and their associated error. This allows convergence with only a few, robustly selected crystal structures to be DFT-D-optimized, ultimately reducing the computational cost.

Figure 1

Comparison of Quick-CSP and Full-CSP workflows with the differences highlighted in blue text.

Comparison of Quick-CSP and Full-CSP workflows with the differences highlighted in blue text. The Quick-CSP approach was applied to BET bromodomain inhibitors, which have generated significant interest for studying various types of cancers, including acute myeloid leukemia (AML) and multiple myeloma (MM).[49−51] Recent structure–activity-relationship (SAR) efforts have led to the identification of pyrrolopyridone core derivatives that demonstrated submicromolar activity in binding assays against BRD4 and substantial efficacy in several in vivo mouse xenograft models.[52] This family of molecules also shows sufficient diversity in terms of both the features related to developability within the context of Rule of 5 (Ro5) framework and structural variance to challenge each step of the Quick-CSP and Full-CSP workflows.

Features of BET Pyrrolopyridone Core Molecules

Eight molecules of the BET bromodomain inhibitors were selected in this study from the lead optimization efforts.[52] As displayed in Table , these molecules feature a pyrrolopyridone ring that acts as a hydrogen bond acceptor and donor through the pyrrolopyridone carbonyl and the pyrrole amino proton, with the ability of forming a bidentate hydrogen bond. The hydrogen bonding groups and dominant molecular shape are conserved when small chemical modifications are made on the distal ether group or the central aryl ring and/or when the functional group attached to the central aryl ring (i.e., sulfonamide, sulfone, or reverse amide moieties) is substituted. The eight molecules show a moderate conformational flexibility with major torsion angles being around the terminal chain attached to the central aryl ring (Φ1), the distal cyclic ether group (Φ2), and the pyrrolopyridone ring (Φ3). The facile reorientation of the functional group attached to the central aryl ring, as measured by Φ1, is not sterically impeded by the distal ether or the pyrrolopyridone moieties, because of the strong cooperativity between neighboring torsion angles.[29] The three key flexible torsion angles can however impact the degree of π-delocalization or conjugation throughout the molecule. Notably, the positioning of the sulfonamide moiety allows for some degree of conjugation through the central aryl ring and the pyrrolopyridone ring with a more planar conformation, as measured by Φ1 and Φ3, showing the maximum conjugation. Thus, the pyrrolopyridone-based molecules can adopt a variety of 3D molecular conformations with different relative positions of the hydrogen bond acceptors and donors.[29]

Table 1

Molecular Diagram of BET Bromodomain Inhibitors with Chemical Modifications on the Pyrrolopyridone Scaffolda

The three flexible torsion angles are also indicated.

The three flexible torsion angles are also indicated. The range of 3D molecular conformations can in turn result in different crystal packing arrangements and specific intermolecular interactions in the crystalline state. Electrostatic forces between the polar and aromatic pyrrolopyridone core molecules are expected to play an important role in crystal structures because of the large dipole moment (Table S1). Similarly, π–π stacking intermolecular interactions, involving the distal ether aryl and/or the pyrrolopyridone rings, can contribute considerably to stabilize the crystal packing. While it is not always clear a priori, the intricacy of this family of molecules suggests the need of a TMFF that accounts for the electrostatics effects of features, such as π-electron density, on hydrogen bonding and aromatic π-stacking interactions to improve accuracy upfront in a CSP workflow.

More Accurate TMFF Method

The Quick-CSP workflow was run twice using the multipole TMFF in one case and point charge TMFF in the other. For each molecule, the accuracy of the backfitted TMFF was assessed by comparing the lattice energies of force field-optimized and ab initio-optimized crystal structures, the latter being taken from the eight full-CSP runs. The root-mean-square deviation (σ), expressed in terms of an error per atom, between the TMFF and PBE-NP[10] lattice energies was then calculated. The multipole TMFF shows systematically lower errors than the point charge TMFF across the pyrrolopyridone-based molecules (Figure ). The multipole TMFF contains more parameters that allow for a better description of the electron density distribution, resulting in more accurate lattice energies. The large number of parameters can, however, make the fitting process more complicated and result in decreased accuracy if the fitting algorithm gets trapped in an unfavorable local minimum. The results demonstrate that GRACE’s force field parameter fitting algorithm transformed the increased complexity, through the introduction of electrostatic multipoles, into increased accuracy. In addition, switching point charge to multipole TMFFs reduces the force field error by 16.2% after backfitting and 14.2% before backfitting (Figure S1 and Table S2). The multipole TMFF shows the strongest improvements for the molecules containing a sulfonamide moiety (i.e., molecules 1–6). Figure a provides more detailed insights into the performance of the two backfitted TMFFs compared to the PBE-NP benchmark for molecule (2), which exhibits the largest difference in force field accuracies of the molecules with the sulfonamide moiety. The most striking feature of the correlation plot is an enhanced agreement of the multipole TMFF energies with PBE-NP calculations that results in a tighter crystal lattice energy distribution. As expected from the smaller σ-value, the multipole TMFF provides improved lattice energies of the generated crystal structures, yielding a mean absolute error (MAE) of 1.21 kcal/mol. In contrast, there appears to be a significant scatter in the predictions of the relative energies computed by the point charge TMFF, leading to a large MAE of 1.57 kcal/mol.

Figure 2

Figure 3

Comparison of backfitted point charge (gray) and multipole (blue) TMFFs relative energies to the corresponding reference PBE-NP energies for (a) molecule (2) and (b) molecule (7). Insets show crystal structures with large deviations in backfitted TMFFs relative energies with respect to PBE-NP.

Accuracy, as measured by σ-value, of backfitted point charge (gray) and multipole (blue) TMFFs for each pyrrolopyridone-based molecule. Reporting the σ-value per atom has the advantage of removing the dependency on the number of atoms. Comparison of backfitted point charge (gray) and multipole (blue) TMFFs relative energies to the corresponding reference PBE-NP energies for (a) molecule (2) and (b) molecule (7). Insets show crystal structures with large deviations in backfitted TMFFs relative energies with respect to PBE-NP. Closer inspection of the crystal lattice energies revealed that the point charge TMFF can miscalculate the strength of hydrogen bonding interactions. The largest negative deviation relative to PBE-NP calculations occurred for the generated crystal structure that shows a bidentate hydrogen bond between pyrrolopyridone rings and lacks intermolecular hydrogen bonding interactions involving the sulfonamide moiety (Figure a top inset (red)). The latter structural feature leads to a large electron density on the electronegative sulfonamide group. The specific description of electrostatic effects by the lone pair electrons and the electron density distribution can make a significant difference to the ability of representing the directionality and strength of hydrogen bonding interactions.[53] The error of the point charge TMFF may thereby originate from systematically underestimating the electrostatic interactions and overestimating the hydrogen bond strength, as the directionality of the intermolecular hydrogen bonding is not accounted. On the other hand, the multipole TMFFs account for the anisotropy of the electron density around the nitrogen and oxygen atoms of the sulfonamide group and considerably reduce the energy error. Further assessment of the crystal lattice energy distribution shows that the multipole TMFF performs well for relatively planar molecular conformations that allow the π-delocalization or conjugation to extend to the central aryl ring via the nitrogen lone pair of the sulfonamide group. This is consistent with the electrostatic potential around conjugated systems, aromatic systems, and/or lone pairs being more realistically reproduced by electrostatic multipole models.[54] In addition, as a result of the planar molecular conformation and extended π-delocalization, the generated crystal structure with a large positive deviation displays aromatic π-stacking interactions that are known to be more correctly modeled by electrostatic multipole functionals (Figure a bottom inset (yellow)).[55] In such a case, the point charge TMFF error stems from the under-stabilization of the extended π-delocalization as well as the underestimation of the intermolecular aromatic π-stacking interactions present in the more planar conformation. Similarly, other molecules containing the sulfonamide moiety show a MAE range of 1.09–1.26 kcal/mol for relative energies computed by the multipole TMFF, whereas MAE range of 1.37–1.57 kcal/mol was obtained by the point charge model (Figure S2a–e). For molecules (7) and (8) that contain the sulfonyl group and the reverse amide directly attached to the central aryl ring, respectively, both TMFFs yield crystal lattice energies that are in very good agreement with the reference PBE-NP method, as indicated by the low and comparable MAE values (Figures b and S2f). The origin of these results may reflect the extent to which molecular conformations, intermolecular interactions, and their interplay contribute toward the stability of molecular organic crystals. The incorporation of a hydrogen bond acceptor directly attached to the central aryl ring alters the extent of the overall π-delocalization in the system. For instance, the molecular conformation of compound (7) in the crystal structure that shows a large positive energy error exhibits a Φ1 of about 90°, a position that completely “breaks” the conjugation between the sulfonyl group and the central aryl ring (Figure b inset). Accordingly, it decreases the overall π-delocalization of the system. Therefore, the subtle difference in σ-values between the two backfitted TMFFs may reflect the negligible contribution of the degree of electron delocalization to the accurate description of the molecular conformation. It is worth emphasizing that crystal lattice energy distributions using the point charge and multipole TMFFs generated before backfitting are comparable to those with backfitted TMFFs, achieving similar MAE ranges across the eight molecules (Figures S3 and S4).

Fragment-Based Force Field for Structurally Related Molecules

Because many molecular fragments share force field parameters, the accuracy of these parameters depends on the transferability between fragments. To evaluate if the force field parameterization for a family of structurally related molecules results in decreased accuracy, the error per atom of the combined TMFF with multipoles was compared to that of the TMFF derived for individual molecules. The σ-values obtained from the combined TMFF are slightly lower than those from the individual TMFFs (Figure ), indicating that for a family of molecules, the TMFF accuracy does not decrease due to the lack of transferability or the difficulty of fitting more parameters. On the contrary, the higher data-to-parameter ratio results in a better definition of force field parameters and/or a more robust fitting behavior. These results refer to the backfitted TMFF. On average, backfitting reduces the energy error by 5.6 and 4.7% in Full-CSP and Quick-CSP workflows, respectively (Table S3). The small size of these changes demonstrates that the TMFFs are well calibrated, even without backfitting, as such removing the backfitting in the Quick-CSP workflow may be acceptable to further reduce the central processing unit (CPU) time requirements. However, additional validation on other structurally related compounds is required to ensure similar performance. Backfitting conflicts with transferability because it is carried out for specific molecules. Nevertheless, it may remain useful in the context of Quick-CSP because force field parameterization errors can be overcome, and the slightly improved force field accuracy may result in downstream savings that are greater than the backfitting effort. If the focus is on force field generation for a large number of chemical species, backfitting can and should be avoided.

Figure 4

Accuracy, as measured by σ-value, of the combined TMFF generated in the Quick-CSP workflow (blue), backfitted to each molecule, and the backfitted TMFF derived for each molecule in the Full-CSP workflow (green). The number of molecular fragments used for the individual and combined TMFFs is shown in Figure a. The individual force field generation for the eight molecules in the Full-CSP workflow used a total of 40 and 19 fragments for the force field parameterization of intramolecular interactions (bonded fragments) and additional torsion terms (torsion fragments), respectively. The combined force field generation in the Quick-CSP workflow employed 20 and 10 fragments for intramolecular interactions and torsion terms, respectively. The two-fold reduction is reflected in the CPU time requirements for the force field parameterization. Generating the combined TMFF for the pyrrolopyridone core molecules indeed decreased the CPU timings by a factor of 1.9 compared to the cumulative time of the individual TMFFs (Figure b). Figure c highlights specific bonded and torsion fragments that are in common among the eight molecules and afford the CPU timing reduction in the combined force field generation. A complete list of molecular fragments used for the individual and combined TMFFs is displayed in Figures S5–S8. For a larger family of chemically related molecules, the fragmentation approach is expected to provide even more economy of scale. Indeed, for a family of 30 BET molecules from the first and next generations of the drug design cycle[52,56] (Table S4) there was a four-fold reduction in the number of molecular fragments for the combined force field generation in the Quick-CSP workflow (Figure S9). The individual force field generation for 30 molecules in the Full-CSP workflow used a total of 208 and 81 bonded and torsion fragments, respectively. On the other hand, the combined force field generation in the Quick-CSP workflow employed only 57 and 31 fragments for the force field parameterization of intramolecular interactions and torsion terms, respectively.

Figure 5

(a) Number of molecular fragments for the individual and combined TMFFs generated in the full-CSP and Quick-CSP workflows, respectively; (b) CPU timings for generating the individual and combined TMFFs. The timing of the TMFF generation for each molecule in the full-CSP is color-coded based on the computational cost requirement; the overall cost saving by creating the combined TMFF for the eight molecules is also indicated. (c) Bonded and torsion fragments that are in common among the eight molecules; hydrogen atoms are omitted for clarity. Most significantly, the combined TMFF not only achieves an improved accuracy and speedup compared to the individual TMFF but also provides transferability across the chemical series. This represents a significant advancement because, to date, each new compound required a newly and individually parameterized TMFF. A key factor underlying this success is that the transferability requirement shifts from the force field parameters themselves, as in traditional transferable force fields, to the shared molecular fragments and the methodology of generating force field parameters. To quantitatively assess transferability improvements, the accuracy of the combined TMFF was compared with that obtained by the COMPASS II force field,[57] a newly parameterized, transferable force field. The combined TMFF for the pyrrolopyridone-based molecules outperforms the COMPASS II force field in reproducing the lattice energies of putative crystal structures, giving a lower σ-value than that of the COMPASS II force field by a factor of 2 (Figure S10). Using multipoles instead of point charges has a stronger impact on the force field accuracy than the backfitting. Multipoles improve the TMFF accuracy over point charges by about 14%, while backfitting improves it by only 5%. Compared to COMPASS II force field, the rest of the improvements stems from the more detailed force field atom types and most likely from the use of explicit off-diagonal van der Waals parameters instead of combination rules. This ability to employ an accurate TMFF that exhibits transferability within structurally related molecules represents a paradigm shift in computing crystal energy landscapes with unprecedented efficiency and accuracy. In addition, it opens the opportunity for CSP components to be leveraged for computer-guided molecule design in advance of molecular synthesis to make accurate yet computationally affordable predictions of physicochemical properties, including solubility, permeability, and potency.[30,33,58]

Crystal Energy Landscapes

Completeness of predicted structures within a given energy window and energy accuracy represent an important metric for the performance of a CSP calculation. The energies of the top 20 ranked crystal structures from each Full-CSP of the three molecules (1), (2) and (6), that had progressed far enough to have experimental structures (all three were experiemntally found to be monomorphic), were calculated at the PBE-NP and PBE(0)+MBD+Fvib level of theory (Figure ). Crystal energy landscapes of the three molecules are shown in Figure S11. PBE(0)+MBD+Fvib is currently considered the most accurate and affordable method with an error of the order of 0.25 kcal/mol.[59] The root-mean-square deviations between the two methods are 0.85, 0.35, and 0.63 kcal/mol for the three molecules, respectively. It should be noted that these values include not only the inaccuracy of the PBE-NP method but also the neglect of zero-point vibrational energy and finite temperature enthalpy and entropy contributions. It is thus concluded that the PBE-NP method has an inherent inaccuracy of about 0.5 kcal/mol for the pyrrolopyridone core molecules. The Quick-CSP calculations were configured to determine the bottom energy of the crystal energy landscapes within an accuracy of 0.5 kcal/mol.

Figure 6

Energies of (a) molecule (1), (b) molecule (2), and (c) molecule (6) crystal structures as predicted at the PBE-NP and the PBE(0)+MBD+Fvib level of theory in the Full-CSP workflow. Experimentally observed crystal forms are highlighted in red, while all other predicted structures are in gray. Figure plots the energies from Quick-CSP and Full-CSP at the PBE-NP level of theory. Across all eight molecules, most of the low-energy structures of the crystal energy landscape from Quick-CSP were identified and reproduced with high-fidelity compared to the Full-CSP method. Details of the relative energies and corresponding ranking of the crystal structures from Quick-CSP and Full-CSP workflows are shown in Tables S5–S7 for molecules (1), (2), and (6). Comparison to experimental crystal structures or powder diffraction patterns are provided in Figures S12 and S13. The crystal structure of molecule (6) was readily solved by matching the experimental powder diffraction pattern to the powder diffraction patterns of the Quick-CSP and Full-CSP predicted crystal structures. The availability of crystal energy landscapes in early stages of drug development allows for cheap and timely crystal structure solution from laboratory quality powder data. In the Quick-CSP workflow, the expected value for the energy at the bottom of the landscape and its error bar were calculated and are displayed in Figure . Table shows that the difference between the actual bottom energy computed in the Full-CSP workflow and the estimate bottom energy calculated in the Quick-CSP workflow is smaller than the calculated standard deviation of the estimate for five molecules and is at the 1σ threshold for one more molecule. For a Gaussian distribution, one would expect 5.5 cases to fall inside one standard deviation on average. The results therefore demonstrate the ability to locate the bottom of the crystal energy landscape within a user-defined accuracy.

Figure 7

Table 2

Energy Difference between the Actual Bottom Energy from the Full-CSP Workflow and the Estimated Bottom Energy from the Quick-CSP Workflow and the Standard Deviation (σ) of the Estimated Bottom Energy

compound ID	Δ bottom energy (kcal/mol)	σ (kcal/mol)a
1	0.099	0.272
2	0.494	0.493
3	–0.577	0.348
4	0.371	0.499
5	0.185	0.368
6	–0.110	0.427
7	0.337	0.414
8	0.601	0.489

σ is expressed as error per molecule that is related to the error per atom by Gaussian error propagation.

Absolute energies of crystal structures as predicted by the Full-CSP and Quick-CSP workflows. The PBE-NP optimized crystal structures are shown in gray, while crystal structures with an estimate of the PBE-NP energy are highlighted in orange. Experimentally observed crystal forms for molecules (1), (2), and (6) are highlighted in red. The estimate for the bottom of the energy window from the Quick-CSP workflow and its error bar are also displayed in black. σ is expressed as error per molecule that is related to the error per atom by Gaussian error propagation. To further highlight the advantages of Quick-CSP over the established Full-CSP workflow, the computational resources employed for crystal structure generation and ranking were compared. The Quick-CSP workflow provides a substantial reduction in the computational cost, associated with crystal structure generation and final energy ranking, as shown in Figure . The extent of cost savings however varied on a molecule-by-molecule basis because of the diversity of crystal packings that these molecules can adopt because of different spatial arrangements of substituents. It ranged from the lowest saving by a factor of 2.7 for molecule (3) to the highest by a factor of 7.8 for molecule (5). Analyzing the computational cost of each step, the effectiveness of the Quick-CSP workflow is due mainly to the CPU timing saving imparted by the reduced number of ab initio crystal structure optimizations in the final energy ranking (Figure S14). It is worth emphasizing that the extent of these savings, while variable and difficult to predict a priori, are significant when compared with the marginal gains achieved through a variety of conventional efficiency gain strategies in standard CSP workflows for these molecules.[29]

Figure 8

Comparison of the total computational cost, associated with the crystal-structure generation and reranking, between the Full-CSP (green) and Quick-CSP (blue) workflows for each pyrrolopyridone-based molecule. Numbers indicate the factor of computational cost savings obtained by the Quick-CSP workflow.

Conclusions

The BET case study shows that the first-of-a-kind Quick-CSP offers an attractive speed-accuracy compromise that allows for the earlier and broader use of the predicted crystal structures. The underlying design concept of Quick-CSP prioritizes more accurate physics earlier in the workflow through robust and transferable TMFFs. The TMFF, a necessary building block for modern CSP, can be generated cost-effectively for families of structurally related molecules by exploiting economy of scale, building on the fragment-based force field parameterization approach implemented in GRACE. For the BET family of molecules, it has been demonstrated that force field parameters are transferable between fragments, resulting in improved accuracy when a combined TMFF is derived for the entire family rather than parameterizing individual TMFFs. The use of crystal structure reference data for the entire molecules in the fitting procedure, known as backfitting in the context of CSP, only results in a marginal improvement of the force field accuracy and may be omitted. The completeness and accuracy requirements for the generated crystal energy landscapes are relaxed to the generation of representative crystal packings and the identification of the most stable crystal structure within a targeted accuracy. The ability to perform only as many ab initio crystal structure optimizations as required to locate the bottom of the crystal energy landscape affords significant cost saving, ranging from a factor of three up to eight, compared to the Full-CSP workflow. Combining this cost and CPU time reduction with parallel execution of the individual CSPs with one molecule per asymmetric unit as part of the Quick-CSP workflow, crystal energy landscapes for all molecules of the family could be obtained within a couple of days. The described gains open the opportunity for CSP applications in early stages of the drug/material design cycle, where crystal packing can be utilized for various key physicochemical property prediction, and the resulting structural insights can be fed into rational molecular design and developability assessment. At the same time, the highly sophisticated, accurate, and modular CSP building blocks can be utilized in other workflows related for instance to conformational scoring and molecular dynamics in the drug/material discovery cycle to guide molecule design and selection.

Computational Methods

All Full-CSP calculations were carried out using the computer program GRACE (version 3.0)[45] on the Rescale cloud computing platform.[60] The Quick-CSP workflow was run on in-house LINUX clusters. Details of the two hardware configurations and the translation from the CPU time consumption on the in-house LINUX cluster to that on Rescale are described in the Supplementary Material.

Ab Initio Methods

Most ab initio calculations in this study were carried out with the Vienna Ab Initio Simulation Package (VASP)[61−63] and the Perdew–Burke–Ernzerhof density functional,[64,65] which was augmented with a dispersion correction by Neumann and Perrin, referred to as PBE-NP.[10] The method is also referred to as DFT-D in this work. VASP was used with a plane wave cutoff energy of 520 eV and a k-point spacing of roughly 0.07 Å–1. Some low-energy crystal structures were further processed at the PBE(0)+MBD+Fvib level of theory using the all-electron FHI-aims (Fritz Haber Institute ab initio molecular simulations) package for the calculation of ab initio energies and forces.[66] Crystal structures are first optimized using the PBE-NP with the light basis set, the hybrid PBE(0) functional,[67] and the many-body dispersion (MBD) model.[12,68−71] In addition, the phonon energy (Fvib) was evaluated in the harmonic approximation with some additional corrections using PBE-NP with the light basis set. In the force field generation procedure, isolated molecule reference data were generated with Turbomole 7.2 Rev. 1.[72,73] The PBE-NP/def2-SV(P) method with geometry counterpoise (gCP) correction[74] for the def2-SV(P) basis set[75] was used to perform coarse gas-phase molecular optimization. Subsequently, the PBE-NP/aug-cc-pVDZ method with gCP for the aug-cc-pVDZ basis set[76] was applied to compute the energies and forces used in the force field parameter fitting.

TMFF

TMFFs in GRACE feature atomic point charges calculated from bond increments, isotropic van der Waals interactions using the Leonard-Jones-9-6 potential, and anharmonic intramolecular bond, angle, inversion as well as torsion terms. For the van der Waals parameters, the full interaction matrix including off-diagonal terms is fitted. The description of Coulomb interactions was improved by the addition of electrostatic dipole and quadrupole moments, defined with respect to local atomic coordinate systems that move with each atom and its covalently bonded neighbors. The functional form of the TMFF is provided in the Supporting Information. A detailed description of the reference data and the fitting process is published elsewhere.[8] Instead of using predefined Force Field Atom Types (FFATs), GRACE constructs on-the-fly, as required, canonical names that describe local chemical environments. The canonical names take into account bond properties, such as bond order and aromaticity, and atomic properties, such as atomic number and formal charge. Primary FFATs consider the central atom and its bonds and are used to index the van der Waals parameters. Secondary FFATs consider the central atom, its bonds, its neighbors, and their associated bonds and are used to index all other parameters. As an example, the force field atom types for molecule (1) are shown in Figure S15 together with Tables S7 and S8. The syntax of the force field atom types is also described in the Supporting Information. Large molecules and families of molecules are cut automatically into three types of smaller, chemically reasonable fragments. When bonds need to be cut, they are terminated with hydrogen atoms or methyl groups. The smallest fragments, called nonbonded fragments, are used to parameterize van der Waals interactions. The fragments of intermediate size, called bonded fragments, define most of the intramolecular and electrostatic force field parameters. The largest fragments, termed torsion fragments, are used to parameterize some torsion terms that are not covered by the bonded fragments. The force field parameterization procedure occurs in three stages, from the smallest to the largest fragments. At each stage, force fields are generated first for the individual fragments with parameters available from a previous stage being kept constant. In the first stage, the individual force field parameterization is executed for all nonbonded fragments and all pairs of nonbonded fragments and stored in a database. The CPU time requirements reported in this work refer to a complete database (i.e., all force field parameterizations for nonbonded fragments and their pairs were fetched from the database). At the second stage, each bonded force field parameterization also includes a small amount of interaction data for the bonded fragment with the nonbonded fragments to achieve a balanced description of intra- and intermolecular interactions. The new force field parameters introduced for the individual fragments are combined by a Monte Carlo genetic algorithm procedure, which removes conflicts between parameters for the same interaction. Finally, a local parameter optimization is carried out in which all force field parameters are fitted to the reference data simultaneously. For the dipole and quadrupole moments, only components that are compatible with the local symmetry of the secondary force field atom types are allowed. The multipole moments are determined as part of the general force field fitting procedure in which all force field parameters are fitted simultaneously to ab initio electrostatic potentials, energies, atomic forces, and cell derivatives. Modest restraints pulling multipole moments to zero are used to alleviate ambiguities related to buried atoms. Atomic polarizability[77,78] is not explicitly considered; however, it is possible that the off-diagonal terms of the van der Waals parameters capture the polarization effects to a certain extent. In this work, force field errors with respect to ab initio reference data are expressed in terms of an energy error per atom, σ. The error per molecule, σmol, is related to the error per atom by Gaussian error propagation (i.e., where N is the number of atoms per molecule).

Full-CSP

To avoid force field mis-parameterization, each CSP started with a short and incomplete crystal structure generation from which a diverse set of crystal structures was selected for a training set and a test set. For both sets, PBE-NP lattice energy minimizations were carried out. The training set was added to the reference data for force field parameterization, and all force field parameters were optimized against all the reference data. This force field parameter optimization is called backfitting, which generates an improved force field. The test set was used to assess the force field accuracy before and after backfitting. If the accuracy change was very large, the backfitting was repeated. CSP was performed with one molecule per asymmetric unit (Z′ = 1) and consisted of three distinct steps. First, a Monte Carlo parallel tempering algorithm was used in conjunction with the TMFF to generate all relevant crystal packings in a certain energy window. The completeness of the crystal structure generation was estimated from the average number of times each crystal structure was generated and the fraction of unique generated structures. The crystal structure generation was stopped when 99% of all crystal structures had been generated. The energy window was determined by a user-defined target energy window (1 kcal/mol) plus n-times (usually n = 4) the force field accuracy. In a second reranking step, force field optimized structures were subjected to ab initio minimization. Since ab initio optimizations of all generated structures would be computationally expensive, statistical correlations were built up between structural similarity, ab initio single point calculations on force field optimized structures, and full ab initio optimizations. These statistical correlations were used to select crystal structures for further ab initio optimizations. The re-ranking always starts with 100 ab initio single point calculations and optimizations to initialize the statistical models. The details of these statistical models are rather intricate and go beyond the scope of this manuscript. The important point is that for each crystal structure, there exists an ab initio energy estimation with known standard deviations. These energy estimations were used to calculate an estimated value for the number of structures in the user-defined target window. The re-ranking was terminated when the number of ab initio optimized structures in the target window divided by the estimated value was more than a certain threshold (typically 99%). Finally, all structures in the target energy window were further optimized with stronger convergence criteria.

Quick-CSP

First, a TMFF was parameterized for a family of chemically related molecules. Then, independent CSPs were executed simultaneously for all compounds of the family. Like in the Full-CSP workflow, backfitting was carried out. It is worth emphasizing that each CSP of the Quick-CSP workflow produced its own backfitted force fields (i.e., it tailors the transferable TMFF more toward the respective target molecule). The user-defined target energy window was 1 kcal/mol. In the crystal structure generation, a 98% convergence was targeted. During the re-ranking, statistical models, as also used in Full-CSP, were initialized with 100 ab initio calculations, but compared to the Full-CSP workflow a different convergence criterion was used. The reranking stopped when the bottom of the energy landscape was determined within a certain tolerance. From the ab initio energies and their corresponding estimates, 1000 random landscapes were generated by considering the error bars of the estimates and an average bottom energy with standard deviation was then computed. The re-ranking was stopped when this standard deviation reached the target tolerance. As an additional condition for convergence, the most stable predicted crystal structure must be ab initio-optimized. For the crystal structures that were not ab initio-optimized, the Quick-CSP workflow reports the estimated energies and the TMFF-optimized crystal structures.

58 in total

1. A test of crystal structure prediction of small organic molecules.

Authors: J P Lommerse; W D Motherwell; H L Ammon; J D Dunitz; A Gavezzotti; D W Hofmann; F J Leusen; W T Mooij; S L Price; B Schweizer; M U Schmidt; P Verwer; D E Williams
Journal: Acta Crystallogr B Date: 2000-08

2. Fragment-based discovery of bromodomain inhibitors part 1: inhibitor binding modes and implications for lead discovery.

Authors: Chun-Wa Chung; Anthony W Dean; James M Woolven; Paul Bamborough
Journal: J Med Chem Date: 2012-01-11 Impact factor: 7.446

3. Effects of Dispersion in Density Functional Based Quantum Mechanical/Molecular Mechanical Calculations on Cytochrome P450 Catalyzed Reactions.

Authors: Richard Lonsdale; Jeremy N Harvey; Adrian J Mulholland
Journal: J Chem Theory Comput Date: 2012-09-04 Impact factor: 6.006

4. COMPASS II: extended coverage for polymer and drug-like molecule databases.

Authors: Huai Sun; Zhao Jin; Chunwei Yang; Reinier L C Akkermans; Struan H Robertson; Neil A Spenley; Simon Miller; Stephen M Todd
Journal: J Mol Model Date: 2016-01-27 Impact factor: 1.810

5. Discovery of N-Ethyl-4-[2-(4-fluoro-2,6-dimethyl-phenoxy)-5-(1-hydroxy-1-methyl-ethyl)phenyl]-6-methyl-7-oxo-1H-pyrrolo[2,3-c]pyridine-2-carboxamide (ABBV-744), a BET Bromodomain Inhibitor with Selectivity for the Second Bromodomain.

Authors: George S Sheppard; Le Wang; Steven D Fidanze; Lisa A Hasvold; Dachun Liu; John K Pratt; Chang H Park; Kenton Longenecker; Wei Qiu; Maricel Torrent; Peter J Kovar; Mai Bui; Emily Faivre; Xiaoli Huang; Xiaoyu Lin; Denise Wilcox; Lu Zhang; Yu Shen; Daniel H Albert; Terrance J Magoc; Ganesh Rajaraman; Warren M Kati; Keith F McDaniel
Journal: J Med Chem Date: 2020-05-07 Impact factor: 7.446

6. Predicting crystal structures of organic compounds.

Authors: Sarah L Price
Journal: Chem Soc Rev Date: 2013-11-22 Impact factor: 54.564

7. A Prolific Solvate Former, Galunisertib, under the Pressure of Crystal Structure Prediction, Produces Ten Diverse Polymorphs.

Authors: Rajni M Bhardwaj; Jennifer A McMahon; Jonas Nyman; Louise S Price; Sumit Konar; Iain D H Oswald; Colin R Pulham; Sarah L Price; Susan M Reutzel-Edens
Journal: J Am Chem Soc Date: 2019-08-21 Impact factor: 15.419

8. Report on the sixth blind test of organic crystal structure prediction methods.

Authors: Anthony M Reilly; Richard I Cooper; Claire S Adjiman; Saswata Bhattacharya; A Daniel Boese; Jan Gerit Brandenburg; Peter J Bygrave; Rita Bylsma; Josh E Campbell; Roberto Car; David H Case; Renu Chadha; Jason C Cole; Katherine Cosburn; Herma M Cuppen; Farren Curtis; Graeme M Day; Robert A DiStasio; Alexander Dzyabchenko; Bouke P van Eijck; Dennis M Elking; Joost A van den Ende; Julio C Facelli; Marta B Ferraro; Laszlo Fusti-Molnar; Christina Anna Gatsiou; Thomas S Gee; René de Gelder; Luca M Ghiringhelli; Hitoshi Goto; Stefan Grimme; Rui Guo; Detlef W M Hofmann; Johannes Hoja; Rebecca K Hylton; Luca Iuzzolino; Wojciech Jankiewicz; Daniël T de Jong; John Kendrick; Niek J J de Klerk; Hsin Yu Ko; Liudmila N Kuleshova; Xiayue Li; Sanjaya Lohani; Frank J J Leusen; Albert M Lund; Jian Lv; Yanming Ma; Noa Marom; Artëm E Masunov; Patrick McCabe; David P McMahon; Hugo Meekes; Michael P Metz; Alston J Misquitta; Sharmarke Mohamed; Bartomeu Monserrat; Richard J Needs; Marcus A Neumann; Jonas Nyman; Shigeaki Obata; Harald Oberhofer; Artem R Oganov; Anita M Orendt; Gabriel I Pagola; Constantinos C Pantelides; Chris J Pickard; Rafal Podeszwa; Louise S Price; Sarah L Price; Angeles Pulido; Murray G Read; Karsten Reuter; Elia Schneider; Christoph Schober; Gregory P Shields; Pawanpreet Singh; Isaac J Sugden; Krzysztof Szalewicz; Christopher R Taylor; Alexandre Tkatchenko; Mark E Tuckerman; Francesca Vacarro; Manolis Vasileiadis; Alvaro Vazquez-Mayagoitia; Leslie Vogt; Yanchao Wang; Rona E Watson; Gilles A de Wijs; Jack Yang; Qiang Zhu; Colin R Groom
Journal: Acta Crystallogr B Struct Sci Cryst Eng Mater Date: 2016-08-01

9. A complete description of thermodynamic stabilities of molecular crystals.

Authors: Venkat Kapil; Edgar A Engel
Journal: Proc Natl Acad Sci U S A Date: 2022-02-08 Impact factor: 11.205

10. Data-efficient machine learning for molecular crystal structure prediction.

Authors: Simon Wengert; Gábor Csányi; Karsten Reuter; Johannes T Margraf
Journal: Chem Sci Date: 2021-02-11 Impact factor: 9.825