Literature DB >> 36237276

A Machine Learning Model of Chemical Shifts for Chemically and Structurally Diverse Molecular Solids.

Manuel Cordova^1,2, Edgar A Engel³, Artur Stefaniuk¹, Federico Paruzzo¹, Albert Hofstetter¹, Michele Ceriotti^2,4, Lyndon Emsley^1,2.

Abstract

Nuclear magnetic resonance (NMR) chemical shifts are a direct probe of local atomic environments and can be used to determine the structure of solid materials. However, the substantial computational cost required to predict accurate chemical shifts is a key bottleneck for NMR crystallography. We recently introduced ShiftML, a machine-learning model of chemical shifts in molecular solids, trained on minimum-energy geometries of materials composed of C, H, N, O, and S that provides rapid chemical shift predictions with density functional theory (DFT) accuracy. Here, we extend the capabilities of ShiftML to predict chemical shifts for both finite temperature structures and more chemically diverse compounds, while retaining the same speed and accuracy. For a benchmark set of 13 molecular solids, we find a root-mean-squared error of 0.47 ppm with respect to experiment for 1H shift predictions (compared to 0.35 ppm for explicit DFT calculations), while reducing the computational cost by over four orders of magnitude.

Entities: Chemical

Year: 2022 PMID： 36237276 PMCID： PMC9549463 DOI： 10.1021/acs.jpcc.2c03854

Source DB: PubMed Journal: J Phys Chem C Nanomater Interfaces ISSN： 1932-7447 Impact factor: 4.177

Introduction

The atomic-level structures of solid materials are of high interest across many areas of chemistry. While X-ray diffraction (XRD) is the most well-established method for determining the structure of crystalline compounds, many materials lack the long-range order required to perform single-crystal XRD. Solid-state nuclear magnetic resonance (NMR) directly probes local atomic environments and so does not require a long-range order, making it a popular method for studying the structure of microcrystalline and disordered solids. Notably, combining solid-state NMR experiments with chemical shift calculations, a process typically referred to as NMR crystallography, allows determination of a wide range of structures,[1−4] from pharmaceuticals[5−7] to capping groups on nanoparticle surfaces[8] to the spacer layers in two-dimensional hybrid perovskite materials.[9] Striking recent examples include the determination of the structures of drug molecules in pharmaceutical formulations,[10,11] and the precise determination of the structure of active sites in enzyme reaction pathways[12,13] and of the disordered structure of an amorphous drug.[14] A key step in NMR crystallography is the computation of chemical shifts for candidate structures. Here, high accuracy is required in order to capture the effect of the particular conformation and packing of the molecular building blocks on the chemical shifts and to allow the identification of the correct structure among a set of potential candidates based on a comparison between computed and measured chemical shifts.[15−19] With the current best calculations, the root-mean-square error (RMSE) between the experiment and calculation can be as low as 1.5 ppm for 13C and 0.2 ppm for 1H.[2,18,20−22] Plane-wave density functional theory (DFT) methods using the gauge including projected augmented wave (GIPAW) formalism[23−25] generally offer a good tradeoff between accuracy and computational cost for computing chemical shifts in small periodic structures. Consequently, DFT has been widely used in NMR crystallography to determine the structure of powdered solids.[1−3,26] However, the computational cost of DFT methods severely limits the size of systems accessible, preventing the study of large or disordered systems. In recent years, machine-learning models have proven a powerful tool for supplementing and bypassing intensive quantum-mechanical calculations of molecular and atomic properties. In particular, NMR chemical shifts have been modeled using kernel methods[27−29] and neural networks.[30−35] Such approaches have proven able to yield chemical shifts to within DFT accuracy at a fraction of the computational cost, allowing applications to large ensembles of large systems. We have previously introduced ShiftML,[15,36] a machine-learning model of chemical shifts trained on GIPAW DFT data for 3,546 structures from the Cambridge structural database (CSD),[37] allowing fast and accurate predictions of chemical shifts for any molecular solid containing C, H, N, O, and S atoms. We have further demonstrated how this model can be used to enable new approaches in NMR crystallography. It has enabled structure determination of amorphous materials through chemical shift computations of molecular dynamics (MD) snapshots containing thousands of atoms[14] and enabled accounting for the effects of thermal and quantum-mechanical nuclear motion on the experimentally observable chemical shifts of crystalline solids based on path-integral molecular dynamics (PIMD) simulations.[19] Moreover, it has allowed on-the-fly chemical shift calculations in a chemical shift-driven direct structure determination protocol.[38] ShiftML also opens up the possibility to transform databases of crystal structures into databases of chemical shifts, which we have used, for example, to build a Bayesian framework to assign the NMR spectra of organic crystals.[39] Although ShiftML constitutes a powerful method for computing chemical shifts with high accuracy and at a low computational cost, two important limitations prevent its more widespread use. First, the model is currently limited to compounds containing only C, H, N, O, and S atoms. While these elements are among the most prevalent in the CSD, numerous organic crystals contain elements outside of this set, leaving them beyond the scope of ShiftML. Second, the training set of ShiftML only contains structures that were geometry-optimized using DFT, resulting in lower accuracy for predictions on finite temperature or distorted structures, or for structures that are geometry-optimized using other methods (such as semi-empirical electronic structure calculations[40,41]). Here, we present ShiftML2, an updated version of ShiftML, trained on GIPAW DFT chemical shifts for an extended set of over 14,000 structures containing any of 12 common elements (H, C, N, O, S, F, P, Cl, Na, Ca, Mg, and K) and composed of roughly equal amounts of relaxed and thermally perturbed structures of crystals extracted from the CSD. ShiftML2 shows slight improvements over the previous versions of ShiftML on DFT-relaxed structures (1H RMSE of 0.47 ppm against 0.51 ppm for the ShiftML model described in ref (15), which we refer to as ShiftML1 here). More importantly, it effectively retains this accuracy for distorted structures, for which the performance of ShiftML1 degrades dramatically, while additionally allowing chemical shift computations for more chemically diverse structures.

Methods

Configurational Sampling

In order to construct suitable reference data for an accurate and robust ShiftML2 model, we first extracted all crystal structures from the CSD with unit cells containing no more than 200 atoms (for which high-throughput first-principles calculations are comparatively affordable) and including H and C, but no additional elements other than N, O, S, F, P, Cl, Na, Ca, Mg, and K. We note that we initially allowed the presence of Br and I atoms but later discarded the structures containing these atoms due to the need for relativistic corrections to obtain accurate shieldings for atoms in their vicinity. After extracting a random selection of 1,000 molecular crystals as a test set, the selection of the training set was performed by farthest point sampling (FPS)[42] of the remaining 140,373 structures based on the kernel-induced pairwise distances.Here, the kernel function k( · , · ) = (X · X)2 measures the similarity of the average smooth overlap of atomic positions (SOAP) power spectra[43] of the constituent atoms within a crystal structure, X, computed using the hyperparameters specified in Supplementary Table S1. The first 10,000 FPS-sorted (most structurally diverse) structures were selected as the training set. All training and test structures were relaxed using DFT-fixed cell geometry optimizations using the Quantum ESPRESSO (QE) electronic structure package[44,45] with the PBE density functional,[46] a Grimme D2 dispersion correction,[47,48] wavefunction and charge density energy cut-offs of 60 and 240 Ry, respectively, and ultrasoft pseudopotentials with GIPAW reconstruction.[49,50] To render this computation efficient, only the Gamma point was accounted for. Further details may be found in the SI. Subsequently, short constant-volume MD simulations of 500 fs were performed using i-PI[51,52] to drive the dynamics, and the above QE setup to evaluate energies and forces. We used a timestep of 1 fs and a Generalized Langevin equation thermostat[53,54] to equilibrate the system at 300 K. Finally, we collected two structures for each molecular crystal in the training and test sets, the relaxed structure, and a thermalized MD structure (the last in the trajectory) and proceeded to compute the associated GIPAW-DFT chemical shieldings for all 22,000 resulting structures.

GIPAW-DFT Chemical Shieldings

The GIPAW NMR calculations were performed using the QE code with the same DFT parameters as for the structure relaxation above but using refined plane wave and charge density energy cut-offs of 100 and 400 Ry, respectively, a Monkhorst–Pack k-point grid[55] with a maximum spacing of 0.06 Å–1, and the ultrasoft pseudopotentials with GIPAW reconstruction from the USSP pseudopotential database v1.0.0. Finally, all structures were discarded, which displayed at least one outlier shift (defined as being outside the range of chemical shifts between the 1st and 99th percentile of all shifts of that element by at least 1.5 times that range), or where the calculation failed. Overall, 2650 structures were discarded because the self-consistent loop did not reach the high level of convergence needed for reliable GIPAW calculations, we removed 3313 additional structures containing Br or I atoms, and we discarded 24 structures that displayed outlier shieldings. This led to final training and test sets containing 14,254 and 1759 structures, respectively.

Machine Learning Model

We use kernel ridge regression (KRR)[56] to predict the isotropic chemical shielding of an atom based on its local atomic environment as follows:where X and X are symmetry-adapted descriptors, which encode the local atomic environment around the atom of interest and those in the training set, respectively, and w denotes the regression weight associated with training sample i. k(·,·) is the kernel function that defines the similarity between two atomic environments. Here, we measure the similarity between two environments as the scalar product between the vectors corresponding to their descriptor, raised to a power ζ. Training a KRR model involves determining the weights w such that eq is best satisfied for the training data, with an additional regularization term that reduces the magnitude of regression weights. Further information is available in the SI.

Uncertainty Estimation

Uncertainty estimation is performed using a resampling approach to generate a committee of M = 32 KRR models,[57] trained on random two-fold splits of the training data. The final prediction for a sample i in the test set, σ̂, is given by the mean of the prediction for each model, and the estimated uncertainty is defined as the standard deviation s of the prediction of each model, rescaled by a factor α given by[57]where Ntest is the size of the test set, and the sum runs over all test samples.

Local Atomic Environment Descriptor

We describe local atomic environments using smooth overlap of atomic positions (SOAP) power spectra[43] as implemented in librascal.[58] We use a sparse implementation of the SOAP descriptors, making use of the sparsity of elements in local environments around individual atoms. The relevant hyperparameters were optimized by five-fold cross-validation performed on the 1H environments of a subset of 1000 training structures, selected at random other than including all training structures containing Na, Ca, Mg, or K. The latter ensures that these elements are represented during hyperparameter optimization, despite their low abundance in the training data. The structures selected for hyperparameter optimization contain a total of 27,802 1H environments. In each cross-validation fold, the training data were partitioned into three equal parts, and a KRR model was trained on each part. This was done in order to reduce the computational resources required to train the models for each split. The selection of descriptor parameter values was based on the RMSE obtained on the validation data. The explored and selected hyperparameter values can be found in the SI. We note that ref (15) found almost identical hyperparameters to be optimal for H, C, N, O, and S through independent optimizations for the different elements. We therefore apply the hyperparameters optimized for 1H to the other elements without further optimization, except for the optimal radial basis,[59] which was constructed individually with the complete final training data for each element.

Farthest Point Sampling of Training Environments

The training data were sorted using FPS[42] based on distances between pairs of environments X and X defined as in eq . This serves two purposes: first, it permits the removal of duplicate environments arising from, for example, equivalent atomic sites related by the crystal symmetries in relaxed structures. Second, it identifies the most structurally diverse set of training environments. To eliminate redundant environments and distill a computationally manageable number of informative environments, we split the training data into randomly selected batches of 50,000 samples (atomic environments) (because FPS is not computationally feasible on the whole set). FPS was then used on each batch and stopped once the minimum distance between FPS-selected samples reached 10–2 for 1H and 10–3 for all other elements. The FPS selection was then repeated after shuffling the environments, recombining them into different batches of 50,000 samples and increasing the distance threshold in each batch by steps of 10–3, until a total of fewer than 100,000 environments remained.

Outlier Detection and Model Training

When required, the FPS-selected training environments were randomly selected to a maximum of 216 samples in order to limit the size of the kernel required to predict chemical shifts. Then, five-fold cross-validation was performed. For each fold, a committee of eight KRR models was trained. To this end, the training split was further subsampled, training each KRR model on a random selection of half of the training split for a given fold. For each fold, the predictions and associated uncertainty estimates for the validation split were used to identify and discard outlier environments. In practice, environments were discarded if the residual error exceeded both the standard deviation of the shifts in the training data and twice the associated uncertainty estimate. After removing these outliers, 32 KRR models were trained on randomly selected environments making up half of the remaining curated data to construct the final model of shifts. The rescaling factor α for uncertainty estimation was obtained from the predictions on the test set.

Atom Type Identification

The different atom types, defined here as hybridization and formal charge, in the training and test structures were identified using the RDKit[60] Python package on the asymmetric unit of the crystals extracted using the CSD Python API.[37] The structures where RDKit failed to identify bonds and/or formal charges were discarded from the atom-type analysis. Carbon atoms identified as charged were set to a neutral charge, as well as nitrogen atoms identified with a negative charge and oxygen atoms identified as positively charged. This was done upon visual inspection of a subset of crystal structures displaying such unusual atom types, confirming that such atom types were incorrectly determined by the package. In total, atomic types of 6,960 out of the 10,593 final training structures and 1,443 out of the 1,759 test structures were identified.

Comparison with Experimental Chemical Shifts

To further test the resulting models, we performed plane-wave DFT calculations for 13 structures with assigned experimental chemical shifts with the same level of theory as for the computation of DFT shieldings of the training and test sets. Comparison between computed (or predicted) shieldings and experimental chemical shifts was performed by linear regression of the shieldings computed with the corresponding experimental shifts, using average values of chemically equivalent shifts and resolving any assignment ambiguity by selecting the assignment resulting in the minimum RMSE.

Results and Discussion

Training Set Selection and Model Training

Because of the lack of large databases of experimental chemical shifts in molecular solids, we trained the model on shielding values computed by DFT, as was done previously for ShiftML1.[15,36] This ensures both consistency in the training data as well as the ability to perform high-throughput computations to obtain a substantial amount of training data in reasonable time. The training structures were chosen to be as diverse as possible through FPS. Because computed shieldings are related to chemical shifts by a simple linear relationship, we use the two terms interchangeably. High quality of the training data is key to producing an accurate machine learning model. In addition, the kernel model framework used here has a linear time and memory complexity with respect to the training set size for inference. It is thus important to reduce the amount of training data while retaining diverse atomic environments and removing outliers to obtain both fast and accurate predictions of chemical shifts. To this end, we performed an iterative, batched FPS of the chemical environments, as described in the Methods section. Figure A shows the first and last FPS iterations on typical batches. The significant drop in minimum distance between FPS-selected samples after selecting 30,000 of the 50,000 environments in an initial batch corresponds to symmetrically equivalent atomic sites in relaxed crystal structures. After gathering the FPS-selected environments from all batches after the final iteration, we obtained 67,535 1H environments.

Figure 1

(A) First (blue) and last (red) FPS selection step for a batch of up to 50,000 1H environments. The blue and red dashed lines show the threshold for the minimum distance between FPS-selected samples set to select environments in the first and last FPS selection steps, respectively. (B) Comparison of DFT-computed 1H shieldings and predictions for the training environments obtained through 5-fold cross-validation. (C) Comparison of the absolute error of the prediction and predicted uncertainty for the training environments selected by FPS. The red lines indicate the criteria used to discard outliers (red points in B and C). Figure B, C highlights the outliers among the selected 1H training environments identified following the scheme described in the Methods section. In total, 145 1H environments were considered as outliers because they exhibit both relatively large prediction error and comparatively small prediction uncertainty (red points and lines in Figure C). Among the final 1H training environments, 86% were from distorted structures and 14% from relaxed structures. This highlights the importance of the presence of distorted structures in the training data in order to obtain a uniform sampling of the space of possible atomic environments. The final model was constructed by training 32 models on random half splits of the remaining training environments. Prediction uncertainties were estimated as the rescaled standard deviation of the 32 predictions to fit the error distribution, as described in ref (61).

Model Evaluation and Comparison to ShiftML1

Figure shows correlation plots between predicted and DFT-computed 1H shieldings in the test set as well as the associated distribution of prediction errors. We obtain an RMSE of 0.52 ppm and an R2 coefficient of 0.97, with 95% of the predictions having an error below 1 ppm. The RMSE was found to be slightly lower in relaxed structures (0.48 ppm) compared to MD structures (0.53 ppm). The presence of sodium or magnesium in crystal structures was found to raise both the prediction error (Figure C) and, to a lesser extent, uncertainty (Figure D). We attribute that to the relatively low number of structures containing these elements in the training set (226 structures containing Na, 65 containing Mg), coupled to the high charge density of these ions which induces a large change in the shielding on neighboring atomic sites. Although calcium and potassium are not significantly better represented in the training set (145 structures containing Ca, 176 containing K), their reduced charge densities compared to Mg and Na induce lower perturbations of the shielding of neighboring atomic sites, which are better captured by the kernel.

Figure 2

(A) Comparison of DFT-computed 1H shieldings and ShiftML2 predictions on the test set. (B) Histogram of the of prediction error between ShiftML2 predictions and DFT-computed shieldings for 1H environments. Comparison of 1H (C) chemical shift RMSE and (D) average prediction uncertainties on test structures containing (blue) or lacking (red) a given element. Comparison of DFT-computed 1H shieldings and ShiftML2 predictions on (E) relaxed and (F) MD structures in the test set. Black lines in (A, E, and F) show perfect correlations. We observe a reduced prediction uncertainty and error for shieldings above 20 ppm (see Supplementary Figure S8). This behavior is expected considering that 90% of the training data have DFT shieldings computed above 20 ppm, which corresponds to typical chemical shifts of aliphatic and aromatic CH protons (<10 ppm). The reduced density of training data at lower shieldings (corresponding to higher chemical shifts) results in increased error and uncertainty of the predictions. To compare ShiftML1 and ShiftML2, we apply both models to the ShiftML1 test set, as well as all structures from the current test set which contain exclusively H, C, N, O, and S atoms (i.e., those for which ShiftML1 is applicable). Figure shows the 1H shift predictions of the two models for the ShiftML1 test set (Figure A, B) and for the relaxed (Figure C,D) and finite temperature (Figure E,F) structures from the ShiftML2 test set, which only contain H, C, N, O, and S atoms. Table summarizes the results obtained by both models. There are two striking conclusions that are illustrated by the figure and table. First, overall, ShiftML2 displays slight improvements over ShiftML1 for relaxed structures (0.47 ppm RMSE compared to 0.49 ppm on the ShiftML1 test set, and 0.47 ppm RMSE compared to 0.51 ppm on relaxed structures from the ShiftML2 test set), indicating that the increase in the number of training environments was sufficient to avoid deterioration of the accuracy despite the greater chemical diversity. Second, ShiftML2 is substantially more accurate for finite temperature structures (0.53 ppm RMSE for ShiftML2 compared to 0.98 ppm for ShiftML1), highlighting the greater robustness of a model trained on finite temperature structures when predicting atomic properties for distorted structures. To confirm the robustness of ShiftML2 toward distorted structures, we evaluated the error against DFT-computed 1H shieldings for up to 50 snapshots taken every 100 fs from MD simulations of the crystal structures of cocaine, AZD5718 and form 4 of AZD8329. We found that the average RMSEs along the MD trajectories were only slightly above the RMSEs obtained for the relaxed structures (0.58 ppm against 0.55 ppm RMSE for AZD5718, 0.50 ppm against 0.45 ppm RMSE for form 4 of AZD8329, and 0.49 ppm against 0.42 ppm RMSE for cocaine, see Supplementary Figure S9).

Figure 3

Table 1

Chemical Shift Root-Mean-Square Error (RMSE), Mean Absolute Error (MAE), and R2 Coefficient of ShiftML1 and ShiftML2 Compared to DFT-Computed Shieldingsa

test set	RMSE [ppm]	MAE [ppm]	R²
ShiftML1	0.48/0.46	0.37/0.35	0.98/0.98
ShiftML2, relaxed only	0.51/0.47	0.38/0.35	0.98/0.98
ShiftML2, MD only	0.98/0.53	0.71/0.40	0.91/0.97
ShiftML2, all	0.78/0.50	0.54/0.38	0.94/0.98

The values are given for ShiftML1 and ShiftML2, separated by a slash.

Comparison of DFT-computed 1H shieldings and predictions using ShiftML1 (A, C, E) or ShiftML2 (B, D, F) on (A, B) the ShiftML1 test set, (C, D) relaxed structures containing only H, C, N, O, and S in the ShiftML2 test set, and (E, F) MD structures containing only H, C, N, O, and S in the ShiftML2 test set. Black lines show perfect correlations. The values are given for ShiftML1 and ShiftML2, separated by a slash. This is a key improvement compared to the previous ShiftML version because it allows accurate predictions of chemical shifts beyond relaxed structures and yields a better description of shifts in (PI)MD snapshots, and for intermediate structures during structural optimization. The ability of the model to generalize to distorted structures is key in many applications of NMR crystallography. In particular, it allows accurate computation of chemical shifts on structures that are geometry optimized with different levels of theories (e.g., force fields or DFTB), which is important for the accurate description of shifts in MD simulations of materials.[14] It also enables more confident on-the-fly computations of chemical shifts of intermediate structures during the optimization of crystal structures in chemical-shift driven structure determination protocols, resulting in a potentially more powerful driving force toward the experimental structure.[38] Figure shows the prediction error for different types of protons in the test set. The two most common proton types H–C(sp3) and H–C(aromatic), making up 90% of the test set, yielded chemical shift RMSEs below 0.5 ppm. All other proton types displayed chemical shift RMSEs below 0.9 ppm, with the exception of alkyne protons, for which the RMSE was found to be 5.3 ppm. Such a high error is explained by the presence of only two alkyne protons identified in the final training data. Interestingly, we find that protons attached to nitrogens in charged groups display a lower error compared to their neutral counterparts. Molecular salts were found to display comparable shift RMSEs to neutral compounds. H-bonded protons yielded a chemical shift RMSE of 0.79 ppm.

Figure 4

Chemical shift RMSE for different types of (A) 1H, (B) 13C, (C) 31P, (D) 15N, (E) 17O, (F) 33S, and (G) 35Cl in the test set. The number of environments (or structures) in the test set contributing to each bar is indicated next to it.

Experimental Benchmark Set and Polymorphs

We evaluate the accuracy of the model with respect to experimental 1H chemical shifts using a benchmark set of 13 molecular crystals made up of cocaine, form 4 of AZD8329, theophylline, uracil, naproxen, the co-crystal of 3,5-dimethylimidazole, 4,5-dimethylimidazole, AZD5718, furosemide, flutamide, the co-crystal of indomethacin and nicotinamide, flufenamic acid, the potassium salt of penicillin G, and phenylphosphonic acid.[14,26,36,62−65]Figure A compares the predicted and experimentally measured shifts for this set. We obtain a RMSE of 0.47 ppm, compared to 0.35 ppm using DFT. For reference, the RMSE obtained on the experimental benchmark set for ShiftML1 (containing the six first molecular solids mentioned previously) is 0.41 ppm for ShiftML2, compared to 0.39 ppm for ShiftML1 and 0.36 for DFT. This further highlights that the accuracy of ShiftML1 for relaxed structures has been retained by ShiftML2, while extending the capabilities of the model to predict shifts for more chemically and structurally diverse structures. Notably, within the limits of the small experimental set used here, the accuracy against experimental shifts is found to decrease when including structures containing F, Cl, P, or K atoms, while DFT remained at the same level of accuracy. Because no such deterioration is observed for the structures in the test set (see Figure C), we attribute this to the chemical environments in the experimental set, which are not well represented in the training data.

Figure 5

(A) Comparison between predicted and experimental 1H shifts for 13 molecular solids. Black line shows perfect correlation. Chemical shift RMSE obtained by ShiftML2 (blue) and DFT (red) against experimental shifts for candidate structures of (B) cocaine, (C) AZD8329 form 4, and (D) AZD5718. The correct crystal structure is indicated by the gray zone. The black horizontal lines indicate the expected RMSE between ShiftML2 predictions and experimental shifts (0.47 ppm). Computing DFT chemical shifts for the 13 structures required over 56 CPU days, while ShiftML2 required less than 20 CPU minutes to predict the shifts of all atoms in the structures considered. If only 1H chemical shifts are required, this time is reduced to less than four CPU minutes, which corresponds to a more than 24,000-fold speedup compared to DFT shift computation. The ability to determine the correct structure from among a set of candidates based on comparison between experimental and computed shifts is key to NMR crystallography. Figure B–D shows the RMSE between experimental and predicted 1H shifts for different sets of candidate structures for cocaine, form 4 of AZD8329, and AZD5718. The correct candidates systematically yielded a chemical shift RMSE below 0.6 ppm and corresponded to the lowest RMSE among the sets of candidates for form 4 of AZD8329 and AZD5718 and to the second lowest RMSE for cocaine.

Models for Other Nuclei

In addition to 1H, we constructed models for all the other nuclei present in the training data. Figure , Supplementary Figure S7, and Table compare the resulting predictions for the nuclei beyond 1H to GIPAW DFT shieldings. We note that although we refer to a particular nucleus (e.g., 15N), the isotropic chemical shift of all NMR-active isotopes of a particular element can be predicted with the same accuracy, by adapting the offset (and slope) used to convert computed shieldings into chemical shifts. We obtain strong correlations (R2 > 0.95) for 13C, 15N, 17O, 19F, and 35Cl. This indicates that ShiftML2 can accurately predict chemical shifts for these elements, although the absolute error is higher than that for 1H because of the larger chemical shift ranges for these nuclei (see Table ). The lower number of training environments for 31P, 23Na, 43Ca, 25Mg, and 39K was found to lead to lower correlation with DFT-computed shifts. While we still provide models for these nuclei, we acknowledge that more accurate models based on more extensive training data would be required to obtain more accurate predictions for these elements. We reiterate that the main purpose of including these elements in the training data was to allow prediction of 1H, 13C, or 15N chemical shifts for structures containing such elements. Detailed ShiftML2 prediction accuracies for different types of 13C, 15N, 17O, 31P, 33S, and 35Cl nuclei are shown in Figure B–G. As for 1H, we observe a loss of accuracy for sp-hybridized 13C and 15N. The other nuclei (19F, 23Na, 43Ca, 25Mg, and 39K) each displayed a unique atomic type across the test set.

Figure 6

Comparison of DFT-computed and predicted (A)13C, (B) 15N, (C) 19F, and (D) 35Cl chemical shifts in the test set. Black lines show perfect correlation.

Table 2

Training and Test Size, Chemical Shift RMSE, MAE, and R2 Coefficient for ShiftML2 Models Trained on Nuclei beyond 1H

nucleus	training set size	test set size	RMSE [ppm]	MAE [ppm]	R²
¹³C	65,498	60,406	4.53	3.12	0.99
¹⁵N	65,506	6514	15.02	9.99	0.98
¹⁷O	65,488	11,330	23.18	16.21	0.98
¹⁹F	23,958	865	9.70	6.85	0.97
³³S	18,509	1470	57.53	35.12	0.87
³¹P	5337	235	32.61	17.64	0.70
³⁵Cl	15,780	757	23.58	17.02	0.97
²³Na	728	14	5.77	4.58	0.57
⁴³Ca	386	8	13.01	10.77	0.99
²⁵Mg	186	10	12.27	8.21	0.94
³⁹K	632	9	9.33	7.07	0.39

Comparison of DFT-computed and predicted (A)13C, (B) 15N, (C) 19F, and (D) 35Cl chemical shifts in the test set. Black lines show perfect correlation.

Conclusions

We have presented a machine learning model of chemical shifts that improves on our previously published model[16] in two key ways. First, the chemical diversity covered by the model has been extended from 5 to 12 elements, meaning that shifts for a much larger space of compounds can now be accessed. Second, finite temperature structures have been included in the training data, allowing reliable chemical shift predictions for distorted structures. Compared to GIPAW DFT, we obtain R2 correlation coefficients above 0.95 for 1H, 13C, 15N, 17O, 19F, and 35Cl shifts, and a chemical shift RMSE below 0.5 ppm for 1H. The model is able to massively accelerate the computation of shifts in molecular solids while retaining DFT-level accuracy with respect to experimental shifts for 1H (0.47 ppm RMSE). Importantly, the cases of cocaine, form 4 of AZD8329, and AZD5718 demonstrate that ShiftML2 permits fast and reliable NMR crystal structure determination for complex organic molecular crystals. The capacity to calculate shifts for distorted structures is important for two reasons. First, it allows reliable shifts to be calculated for structures that are not geometry-optimized using DFT, such as structures optimized using (semi-)empirical approaches such as DFTB and for structures from MD simulations. Second, it means that shifts calculated for structures generated in a simulated annealing structure determination protocol[38] will be accurate even when the trial structure is not in an energy minimum, potentially providing a much more efficient driving force toward the correct structures, and this will be the subject of future studies. The model presented here scales linearly with respect to the number of local atomic environments in a structure of interest, making shifts for large ensembles of large structures accessible. The new model will thus accelerate NMR crystallography by allowing large-scale computations for candidate structures, either from MD trajectories or in direct optimization methods. The models are freely available on https://dx.doi.org/10.5281/zenodo.7097427.

47 in total

1. Assigning powders to crystal structures by high-resolution (1)H-(1)H double quantum and (1)H-(13)C J-INEPT solid-state NMR spectroscopy and first principles computation. A case study of penicillin G.

Authors: Nicolas Mifsud; Bénédicte Elena; Chris J Pickard; Anne Lesage; Lyndon Emsley
Journal: Phys Chem Chem Phys Date: 2006-06-20 Impact factor: 3.676

2. Semiempirical GGA-type density functional constructed with a long-range dispersion correction.

Authors: Stefan Grimme
Journal: J Comput Chem Date: 2006-11-30 Impact factor: 3.376

3. QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials.

Authors: Paolo Giannozzi; Stefano Baroni; Nicola Bonini; Matteo Calandra; Roberto Car; Carlo Cavazzoni; Davide Ceresoli; Guido L Chiarotti; Matteo Cococcioni; Ismaila Dabo; Andrea Dal Corso; Stefano de Gironcoli; Stefano Fabris; Guido Fratesi; Ralph Gebauer; Uwe Gerstmann; Christos Gougoussis; Anton Kokalj; Michele Lazzeri; Layla Martin-Samos; Nicola Marzari; Francesco Mauri; Riccardo Mazzarello; Stefano Paolini; Alfredo Pasquarello; Lorenzo Paulatto; Carlo Sbraccia; Sandro Scandolo; Gabriele Sclauzero; Ari P Seitsonen; Alexander Smogunov; Paolo Umari; Renata M Wentzcovitch
Journal: J Phys Condens Matter Date: 2009-09-01 Impact factor: 2.333

4. Identifying the intermolecular hydrogen-bonding supramolecular synthons in an indomethacin-nicotinamide cocrystal by solid-state NMR.

Authors: Keisuke Maruyoshi; Dinu Iuga; Oleg N Antzutkin; Amjad Alhalaweh; Sitaram P Velaga; Steven P Brown
Journal: Chem Commun (Camb) Date: 2012-10-01 Impact factor: 6.222

5. A Bayesian approach to NMR crystal structure determination.

Authors: Edgar A Engel; Andrea Anelli; Albert Hofstetter; Federico Paruzzo; Lyndon Emsley; Michele Ceriotti
Journal: Phys Chem Chem Phys Date: 2019-10-21 Impact factor: 3.676

6. Elucidating an Amorphous Form Stabilization Mechanism for Tenapanor Hydrochloride: Crystal Structure Analysis Using X-ray Diffraction, NMR Crystallography, and Molecular Modeling.

Authors: Sten O Nilsson Lill; Cory M Widdifield; Anna Pettersen; Anna Svensk Ankarberg; Maria Lindkvist; Peter Aldred; Sandra Gracin; Norman Shankland; Kenneth Shankland; Staffan Schantz; Lyndon Emsley
Journal: Mol Pharm Date: 2018-03-12 Impact factor: 4.939

7. Multiresolution 3D-DenseNet for Chemical Shift Prediction in NMR Crystallography.

Authors: Shuai Liu; Jie Li; Kochise C Bennett; Brad Ganoe; Tim Stauch; Martin Head-Gordon; Alexander Hexemer; Daniela Ushizima; Teresa Head-Gordon
Journal: J Phys Chem Lett Date: 2019-07-30 Impact factor: 6.475

8. Advanced capabilities for materials modelling with Quantum ESPRESSO.

Authors: P Giannozzi; O Andreussi; T Brumme; O Bunau; M Buongiorno Nardelli; M Calandra; R Car; C Cavazzoni; D Ceresoli; M Cococcioni; N Colonna; I Carnimeo; A Dal Corso; S de Gironcoli; P Delugas; R A DiStasio; A Ferretti; A Floris; G Fratesi; G Fugallo; R Gebauer; U Gerstmann; F Giustino; T Gorni; J Jia; M Kawamura; H-Y Ko; A Kokalj; E Küçükbenli; M Lazzeri; M Marsili; N Marzari; F Mauri; N L Nguyen; H-V Nguyen; A Otero-de-la-Roza; L Paulatto; S Poncé; D Rocca; R Sabatini; B Santra; M Schlipf; A P Seitsonen; A Smogunov; I Timrov; T Thonhauser; P Umari; N Vast; X Wu; S Baroni
Journal: J Phys Condens Matter Date: 2017-10-24 Impact factor: 2.333

9. Polymorphic Forms of Valinomycin Investigated by NMR Crystallography.

Authors: Jiří Czernek; Jiří Brus
Journal: Int J Mol Sci Date: 2020-07-11 Impact factor: 5.923