Literature DB >> 34826232

Bayesian probabilistic assignment of chemical shifts in organic solids.

Manuel Cordova^1,2, Martins Balodis¹, Bruno Simões de Almeida¹, Michele Ceriotti^2,3, Lyndon Emsley^1,2.

Abstract

A prerequisite for NMR studies of organic materials is assigning each experimental chemical shift to a set of geometrically equivalent nuclei. Obtaining the assignment experimentally can be challenging and typically requires time-consuming multidimensional correlation experiments. An alternative solution for determining the assignment involves statistical analysis of experimental chemical shift databases, but no such database exists for molecular solids. Here, by combining the Cambridge Structural Database with a machine learning model of chemical shifts, we construct a statistical basis for probabilistic chemical shift assignment of organic crystals by calculating shifts for more than 200,000 compounds, enabling the probabilistic assignment of organic crystals directly from their two-dimensional chemical structure. The approach is demonstrated with the 13C and 1H assignment of 11 molecular solids with experimental shifts and benchmarked on 100 crystals using predicted shifts. The correct assignment was found among the two most probable assignments in more than 80% of cases.

Entities: Chemical

Year: 2021 PMID： 34826232 PMCID： PMC8626066 DOI： 10.1126/sciadv.abk2341

Source DB: PubMed Journal: Sci Adv ISSN： 2375-2548 Impact factor: 14.136

INTRODUCTION

Chemical shift assignment is the starting point of any detailed nuclear magnetic resonance (NMR) study (). In organic solids at natural isotopic abundance, this is still a laborious and often challenging process. In particular, 13C resonance assignment typically requires the use of the through-bond 13C-13C double-quantum/single-quantum correlation (INADEQUATE) experiment (, ). For materials for which the crystal structure is already known, the assignment can be determined at least partially by comparing the experimental chemical shifts with shifts computed using density functional theory (DFT) in the gauge-invariant projector-augmented wave method (, ) or fragment-based methods (, ). However, in most applications, the full structure is not known and, in particular, de novo chemical shift–based NMR crystallography relies on chemical shift assignment to confidently identify the crystal structure from among a set of candidates generated, for example, through crystal structure prediction (–). An illustrative example of the assignment problem is shown in Fig. 1, with the 13C cross-polarization (CP) magic angle spinning (MAS) spectrum of ritonavir. The spectrum contains 32 peaks, corresponding to the 37 magnetically inequivalent carbon atoms in the molecule, and assigning the peaks to the atoms is not at all obvious. Several straightforward experimental methods can be used to simplify the assignment process in organic solids. Heteronuclear correlation (HETCOR) experiments (, ) provide pairwise 1H-X (where X = 13C, 15N, etc.) correlations and allow the separation of NMR signals along two dimensions, which simplifies the identification of the bonding environment associated with the observed peaks. In addition, spectral editing (–) can be used to identify the carbon multiplicity (i.e., the number of bonded protons) associated to each observed peak, allowing the reduction of the assignment problem to subsets of peaks and corresponding atomic sites.

Fig. 1.

13C CPMAS spectrum of ritonavir.

Molecular structure of ritonavir and the 13C CPMAS spectrum recorded for a powder sample of ritonavir form II. ppm, parts per million.

13C CPMAS spectrum of ritonavir.

Molecular structure of ritonavir and the 13C CPMAS spectrum recorded for a powder sample of ritonavir form II. ppm, parts per million. Chemical shift assignment of biomolecules such as proteins and RNA can be obtained directly from their sequence through statistical analysis of chemical shifts (–). In addition, simultaneous chemical shift assignment and structure determination can be obtained from matching atomic contacts to nuclear Overhauser effect experiments (). These approaches rely on the existence of a large database of experimental chemical shifts and molecular structures, such as the Biological Magnetic Resonance Data Bank (BMRB) () and Protein Data Bank (), respectively. For example, the BMRB contains more than 9.4 million instances of experimental chemical shifts for 279 types of proton, carbon, and nitrogen sites in the 20 amino acids that make up proteins, with e.g., more than 89,000 instances of the NH shift in alanine alone. Such large and diverse chemical shift databases however do not exist, to our knowledge, for organic crystals. Recently, ShiftML, a machine learning model able to predict chemical shifts in molecular solids, was introduced (, ). This model allows chemical shifts to be obtained directly from the structure of a molecular solid, bypassing the need for an optimized wave function and making the shifts of large ensembles of large structures accessible with DFT accuracy (, ). Here, we show how combining this model with a database of three-dimensional (3D) structures such as the Cambridge Structural Database (CSD) () enables the probabilistic assignment of organic crystals using chemical shift statistics without any knowledge of the 3D structure. We generate a large database of chemical shifts for organic crystals by predicting shifts using ShiftML on structures extracted from the CSD. By relating the shifts obtained to molecular fragment descriptors, we obtain probabilistic assignments of organic crystals directly from their molecular structure.

RESULTS

The framework presented here was applied to a set of various organic molecules for which the carbon chemical shift assignment was already (at least partially) determined experimentally. The selected set is composed of theophylline (), thymol (), cocaine (), strychnine, AZD5718 (), lisinopril (), ritonavir, the K salt of penicillin G (), β-piroxicam (), decitabine (), and simvastatin (). The experimental spectra used for the assignment of strychnine and ritonavir are shown in figs. S1 and S2. Experimental shifts of lisinopril were obtained from a dihydrate form (). Experimental shifts of ritonavir were obtained from the polymorphic form II. Graph generation is the starting point of statistical assignment and can be performed directly from the 2D representation of the molecule. Figure 2 (A and B) shows the graphs generated for illustrative carbon atoms in theophylline with a depth w = 3. The chemical shift distributions of the carbon labeled 4 in theophylline corresponding to different graph depths are shown in Fig. 2C, together with the corresponding graphs. As expected, the distribution changes as w is increased, until at w = 3 and above where they are found to be highly similar, with a width dominated by the uncertainty in the ShiftML prediction. We thus selected a minimum number of 10 instances to construct each probability distribution and used the maximum graph depth that fulfils this requirement for each nucleus.

Fig. 2.

Graph and statistical distribution.

Graph and statistical distribution.

(A) 2D structure and carbon labeling scheme of theophylline. (B) Graphs of carbons 1, 2, 3, and 4 of theophylline constructed at a depth w = 3. In each graph, the red vertex corresponds to the central atom (for which the chemical shift distribution is extracted), and blue vertices indicate the atoms at the maximum shortest path from the central vertex. (C) Chemical shift distributions obtained corresponding to the carbon labeled 4, with different graph depths w. The number of instances from the database used to construct each distribution is indicated in parentheses. The prior statistical distribution of chemical shifts for each atom in a molecule can be constructed from the shifts predicted for all atoms in the database that share the same graph. Evaluating the obtained statistical distributions at the observed shifts yields the probability of observing each shift originating from each nucleus in the molecule (Fig. 3, A and B). The possible combinations of individual assignments, based on a Bayesian construction, make it possible to associate a probability to each global assignment of all shifts. After obtaining the probability for each global assignment in the set, marginalization yields individual assignment probabilities (Fig. 3C). In this case, the most probable individual assignment for each carbon, as well as the most probable global assignment, was found to correspond to the experimental assignment of theophylline (black dots in Fig. 3C).

Fig. 3.

Probabilistic assignment of theophylline.

Probabilistic assignment of theophylline.

(A) Statistical 13C chemical shift distributions for theophylline (colored lines). The carbon labels follow Fig. 2A. Experimental shifts are indicated by black vertical lines below the distributions and are labeled (a) through (f) in order of decreasing chemical shift (see table S9). (B) Probabilities of observing each chemical shift of theophylline for a given carbon nucleus. (C) Marginal individual assignment probabilities of the 13C chemical shifts of theophylline after Bayesian inference of the possible global assignments. The dots indicate the experimentally determined correct assignment. Overlap of the chemical shift distributions can lead to highly ambiguous assignments. A common method to separate overlapping NMR signals consists in spreading them along multiple dimensions. The HETCOR experiment yields high-sensitivity correlated 1H and 13C chemical shifts of dipolar coupled nuclei and can be tuned to obtain a spectrum dominated by one-bond correlations (, ). The correlated statistical distributions of chemical shifts corresponding to a simulated HETCOR can be obtained by considering bonded CH pairs in the molecule. This additional dimension often helps separate overlapping 1D statistical distributions and chemical shifts by incorporating the additional information given by the 1H chemical shift. In addition, this can also be used to simultaneously assign 13C and 1H chemical shifts. Figure 4 depicts the probabilistic assignment of bonded 13C-1H chemical shifts of thymol using 2D correlated statistical shift distributions. The pair of topologically equivalent bonded C-H groups (labeled 9 and 10) was assigned to a pair of experimental shifts in Fig. 3D, as the disambiguation of topologically equivalent nuclei cannot be performed from the 2D representation of a molecule. As seen in Fig. 3B, the assignment of the carbon labeled 8 would have been much more ambiguous using only 13C chemical shifts. The probability of assigning carbon 8 to chemical shift e is 34% using only statistical distributions of 13C chemical shifts (Fig. 4E) and 100% using correlated statistical distributions of 1H and 13C chemical shifts (Fig. 4D). We note that the most probable assignments of carbons 6 and 7 and of the methyl groups 1, 9, and 10 do not match the experimentally determined ones. We attribute these discrepancies to substantial overlap between the corresponding statistical distributions of chemical shifts that arise because of similar local bonding environments of carbons 6 and 7 and of methyl groups.

Fig. 4.

Probabilistic assignment of thymol using simulated HETCOR.

Probabilistic assignment of thymol using simulated HETCOR.

(A) Carbon labeling scheme of thymol. (B) Contour plot of the correlated statistical chemical shift distributions of bonded 13C-1H in thymol. The carbon labels follow (A). Experimental shifts are indicated by black dots and are labeled alphabetically in order of decreasing 13C chemical shift (see fig. S10). The statistical distributions, normalized such that their maximum is one, are drawn as contour plots at levels 0.1, 0.5, and 0.9. (C) Probabilities of observing each 13C-1H shift pair in thymol for a given carbon nucleus. (D) Marginal individual assignment probabilities of unique directly bonded CH pairs and of pairs of topologically equivalent CH pairs (insert) in thymol. (E) Marginal individual assignment probabilities of unique carbons and of pairs of topologically equivalent carbons (insert) in thymol using only 13C chemical shift distributions. In (D) and (E), the dots indicate the experimentally determined correct assignment. In addition to HETCOR, spectral editing methods are also straightforward high-sensitivity experiments that can be performed routinely to aid assignment. These experiments are able to separate 13C chemical shifts according to the number of bonded protons (multiplicity) (–). The method can thus be directly applied to the statistical assignment framework presented here to break down the statistical assignment problem into smaller subproblems of reduced complexity. This is especially useful when considering molecules yielding substantial overlap of statistical distributions. Knowledge of the multiplicity of 13C chemical shifts can also be used to select a subset of HETCOR peaks to assign. Figure 5 shows the assignment of 13C and 1H-13C chemical shifts of strychnine using the combination of spectral editing and correlated statistical distributions of chemical shifts. In Fig. 5D, the chemical shifts of carbons without any proton attached were assigned using the 1D 13C chemical shift distributions of the associated nuclei. Carbons with a single bonded proton were assigned using the correlated 1H-13C statistical chemical shift distributions. The carbons with two attached protons were assigned to pairs of correlated 1H-13C chemical shifts, restricting the 13C shift to be unique in each pair.

Fig. 5.

Probabilistic assignment of strychnine using simulated HETCOR and spectral editing.

Probabilistic assignment of strychnine using simulated HETCOR and spectral editing.

(A) Carbon labeling scheme of strychnine. (B) 125-MHz 13C CPMAS NMR spectrum of strychnine. (C) 1H-13C HETCOR spectrum of strychnine. (D) Marginal individual assignment probabilities of the carbon nuclei of strychnine. The carbon multiplicity is indicated above each probability map. The HETCOR shifts were used to assign CH and CH2 carbons. The shifts are labeled alphabetically in order of decreasing chemical shift (see table S11). (E) The three most probable global assignments for the different blocks assigned individually along with their probability. The individual assignments making up the global assignments are indicated in blue if they correspond to the experimentally determined assignment and in red otherwise. Carbons 1, 8, and 12 were assigned without ambiguity (P = 100%) directly from the evaluation of their statistical distributions of chemical shifts on the observed shifts. Figure 5E summarizes the three most probable global assignments of strychnine. For each assignment, the global assignment is broken down into blocks by multiplicity and then potentially into sub-blocks where there is no significant probability of overlap according to a threshold (here, a factor 100 with respect to the highest probability for each nucleus). For each subassignment, there is an associated probability. The most probable assignment of each block was found to match the experimentally determined one, except for the assignment of CH2 groups, where the assignments of carbons 21 and 19 are swapped compared to the experimentally determined assignment. This is due to the large difference between the distribution of chemical shifts and experimental shift of carbon 19 (see fig. S9), which could come from an unusual intermolecular environment of that atomic site in the crystal structure. We consider that a reliable assignment is difficult to extract from the set of global assignments and associated probabilities, especially in cases with a large number of overlapping distributions and shifts, which yield many possible global assignments. Marginalization helps simplify the analysis of global assignments and identify ambiguities more easily. This can be seen in Fig. 5D, where the assignment of carbon 7 to shift d is favored compared to shifts b and c, which suggests only a pairwise uncertainty between carbon 2 and 14. In addition to strychnine, shown in Fig. 5, the marginal individual assignment probabilities obtained for a set of nine selected molecules with complete experimental assignments (except for the two phenyl rings of ritonavir) using spectral editing and correlated 1H-13C statistical chemical shift distributions are shown in Fig. 6 and figs. S16 to S18. The assignment of carbon nuclei without any attached proton were obtained from the 1D statistical distributions of 13C chemical shifts. The statistical distributions of chemical shifts for each example are shown in figs. S3 to S10. Notably, the assignment of lisinopril was found to be possible even when omitting the water molecules present in the crystal structure.

Fig. 6.

Probabilistic assignment of six organic crystals.

Probabilistic assignment of six organic crystals.

Marginal individual assignment probabilities of 13C chemical shifts of (A) theophylline, (B) thymol, (C) cocaine, (D) ritonavir, (E) lisinopril, and (F) AZD5718 using correlated 1H-13C chemical shift distributions and spectral editing. For each probability map, labels along the vertical axis indicate nuclei, and labels along the horizontal axis denote experimental shifts labeled alphabetically in order of decreasing 13C shift (see tables S9 to S14). The carbon multiplicity is indicated above each marginal assignment probability map. In (D), the assignment of carbons 9 to 13 and 19 to 23 is not shown, as their experimental assignment is ambiguous. Nevertheless, the associated peaks were considered during the assignment process. The assignment probabilities of the aromatic CH groups of ritonavir are shown in fig. S11. Figure 7 shows the assignment of the K salt of penicillin G. Only the organic ion was considered to construct the graph descriptors used to extract statistical distributions of chemical shifts. As for the presence of the water molecule in the case of lisinopril above, here, the presence of the potassium ion, which is absent from the database, did not lead to a significant decrease in the ability of the model to predict the assignment, highlighting its generality beyond molecules for which chemical shifts can be computed by ShiftML. While ShiftML would not be able to compute shifts for crystals where even only one atom is different from C, H, N, O, and S, this model only requires the molecule to be assigned to only contain these elements to obtain the probabilistic assignment. If the additional component in a salt or a cocrystal were to lead to a very different crystalline environment from those included in the database, then this might lead to poor performance of the probabilistic assignment.

Fig. 7.

Probabilistic assignment of the K salt of penicillin G.

Carbon labeling scheme and marginal individual assignment probabilities of the K salt of penicillin G. The shifts are labeled (a) through (m) in order of decreasing chemical shift (see table S16).

Probabilistic assignment of the K salt of penicillin G.

Carbon labeling scheme and marginal individual assignment probabilities of the K salt of penicillin G. The shifts are labeled (a) through (m) in order of decreasing chemical shift (see table S16). The marginal individual assignment probabilities obtained directly from the 2D representation of the molecules were found to match the experimentally determined assignment in most cases. We observe that assignment ambiguities generally involve pairs or triplets of nuclei and shifts, leaving only a few possibilities for the NMR spectroscopist to further investigate to obtain the complete chemical shift assignment. Of the 178 experimental individual assignments considered in Figs. 5 to 7 and figs. S16 to S18, only 8 were associated with a probability below 10% and two below 1%. These low probabilities were generally associated with crowded regions in experimental spectra or with statistical outlier shifts compared to the distributions, which could have originated from unusual intermolecular environments. To validate these results in a statistically significant manner, we evaluated the performance of the framework presented here on a benchmark set of a hundred crystal structures having between 10 and 20 different carbon atoms, randomly selected from the CSD database. In total, this corresponds to 1214 inequivalent carbon atoms. We used the ShiftML-predicted shifts for each atom as the correct assignment and excluded those shifts from the statistical distributions used to assign the molecules. The benchmark set was separated into five subsets containing 20 structures each that were evaluated independently to obtain SDs. Although using shifts predicted by ShiftML may introduce a bias, as the same method was used to construct the database of shifts, we assumed that the Gaussian width used to construct the statistical distributions of chemical shifts and the exclusion of the shifts assigned from the sets of shifts used to construct those distributions mitigate this issue. Figure 8 summarizes the performance of the probabilistic assignment model on the experimental (Fig. 8A) and synthetic (Fig. 8B) sets of molecules selected. The use of spectral editing and correlated 1H-13C chemical shift distributions was found to improve the ability of the model to correctly assign carbon chemical shifts. Using either 2D statistical distributions of chemical shifts, spectral editing, or combining both led to the experimental assignment being among the two most probable marginal assignments in more than 80% of cases. Overall, the performances on the experimental benchmark set were consistent with the synthetic benchmark set, except when using spectral editing where a slight improvement in the experimental set compared to the synthetic set was observed.

Fig. 8.

Model performances.

Comparison of probabilistic assignment performances using 1D (13C) or 2D (1H-13C) statistical distributions and including spectral editing (SE). Proportion of the experimental assignments being within the n (n = 1, 2, and 3) most probable marginal individual assignments in the (A) experimental and (B) synthetic benchmark sets of molecules. Error bars indicate the SD over the five subsets making up the synthetic benchmark set.

Model performances.

DISCUSSION

The framework presented here allows chemical shift assignment of organic crystals directly from their 2D structure. This was achieved through the chemical shift prediction for more than 200,000 organic crystal structures, which yields statistical distributions of chemical shifts for given covalent environments. A Bayesian framework was then used to obtain probabilistic marginal assignments of individual nuclei from the probabilities of the set of global assignments generated. Overall, using correlated 1H-13C chemical shift distributions in tandem with spectral editing, the method was found to include the experimental assignment among the two most probable marginal assignments in more than 80% of cases. Furthermore, in most cases, any ambiguity is found in small subgroups of shifts. This is highlighted in lisinopril in, for example, the CH2 carbons because of significant overlap between the corresponding statistical distributions of chemical shifts and because of similar experimental shifts (see fig. S7). In summary, the approach presented here can provide marginal assignments based only on the 2D molecular structure, where typically most of the resonances will be assigned with high probabilities, and only a few resonances will show ambiguities among doubles or triples that can then be the subject of targeted experiments for disambiguation, if needed, or left unresolved and assigned such that the error is minimized when compared with computed shifts for model structures (e.g., when performing NMR-driven crystal structure determination). This can greatly accelerate the assignment process. In particular, the method is shown to provide assignments for molecules such as strychnine, lisinopril, AZD5718, and ritonavir, which have crowded 13C spectra with between 20 and 40 distinct carbons and which would have been previously completely unaddressable without resorting to natural abundance 13C-13C correlations. For example, in strychnine, of the 21 carbons, 14 are correctly assigned with more than 75% confidence. The model was also successfully applied to the assignment of a hydrate and an organic salt, with no significant performance loss compared to the benchmark set. We expect that a more accurate model of chemical shifts could lead to improved probabilistic assignment through the framework presented here. The method shown here is not restricted to 1H and 13C and can be used to assign the isotropic shifts of any NMR-active isotope of hydrogen, carbon, nitrogen, and oxygen in principle. To illustrate that, fig. S19 and Supplementary Text describe the probabilistic assignment of the 15N shifts of AZD5718. The code is publicly available at https://github.com/manucordova/ProbAsn, and a user guide is available in Supplementary Text and on the GitHub webpage. A suggested workflow to assign an organic solid is also described in Supplementary Text.

MATERIALS AND METHODS

NMR spectroscopy

The samples of strychnine and ritonavir form II were purchased from Sigma-Aldrich and Tokyo Chemical Industry, respectively. Experiments were performed on Bruker Ascend 400 and Ascend 500 wide-bore Avance III and 900 US2 wide-bore Avance Neo NMR spectrometers. The spectrometers operate at 1H Larmor frequencies of 400, 500, and 900 MHz, respectively, and are equipped with H/X/Y 3.2-mm, H/X/Y 4.0-mm, H/C/N/D 1.3-mm, and H/C/N 0.7-mm CPMAS probes. 1D 1H MAS NMR spectra were recorded at a temperature of 298 K using rotor spinning rates (νr) up to 111 kHz. 1D 13C CPMAS () NMR spectra were acquired at 298 K with νr of 12.5 and 22 kHz for ritonavir and strychnine, respectively. During the signal acquisition SPINAL-64 decoupling () was applied with a 1H radio frequency (rf) field amplitude of 100 kHz. For ritonavir spectral editing, experiments were used to distinguish carbons with different numbers of protons attached to them. To selectively remove quaternary carbons, a 1D version of MAS-J-HSQC () was used; to remove quaternary and primary carbons, a double quantum filter was added to the MAS-J-HSQC () sequence, and to remove primary and secondary carbons, a simple CP experiment with an inserted delay of 0.5 ms before acquisition and after the CP pulse was applied (). 2D 1H-13C HETCOR experiments were carried out at 298 K using νr = 22 kHz. During t1, 100-kHz eDUMBO-122 was applied to decouple the 1H-1H dipolar coupling (), and during t2, 100-kHz SPINAL-64 decoupling was applied. The natural abundance 2D 13C-13C refocused INADEQUATE (, ) spectra required for the direct experimental assignment for ritonavir and strychnine were acquired using a Bruker 400-MHz Ascend NMR spectrometer. The probe was configured into 1H/13C double resonance mode. Variable amplitude CP () was used to transfer polarization from 1H to 13C. SPINAL-64 () heteronuclear 1H decoupling with rf fields of 100 kHz was applied in all cases. The temperature of the sample for ritonavir was 250 K, and a 4-mm rotor was used with a spinning frequency of 12.5 kHz. Experiments (2 × 120 hours) were acquired and combined in postprocessing to obtain the final spectrum (total time, 10 days). For strychnine, dynamic nuclear polarization (DNP) was used (). The sample was impregnated with 10 mM AMUPol dissolved in 60:30:10 glycerol-d8:D2O:H2O. The spectrometer is equipped with a low-temperature magic angle spinning 3.2-mm probe and connected through a corrugated waveguide to a 263-GHz gyrotron capable of outputting ca. 5 to 10 W of continuous-wave microwaves (). The sweep coil of the main magnetic field was optimized so that the microwave irradiation gave the maximum positive proton DNP enhancement with binitroxide cross-effect–based polarizing agents (e.g., AMUPol and TEKPol). The temperature of the sample for ritonavir was 92 K, and a 3.2-mm rotor was used with a spinning frequency of 12.5 kHz. A DNP enhancement of 36 was determined on the basis of the ratio of the area of the spectra acquired with and without microwave irradiation. The DNP-enhanced natural abundance 2D 13C-13C refocused INADEQUATE experiment () was run for 45 hours. All chemical shifts were referenced via alanine. The full set of acquisition parameters is given in tables S1 to S4.

Selection of crystal structures

The structures used to construct the chemical shift database were obtained from the CSD (). Only the organic crystal structures suitable for chemical shift predictions were selected. The corresponding selection criteria were that every structure must only contain C, H, N, O, and S atoms and that the disorder is resolvable. Missing protons were added automatically using the tool built into the CSD Python API. In total, 205,069 valid structures were selected.

Relaxation and chemical shift prediction

Because proton positions in published single-crystal x-ray diffraction structures may not correspond to the actual hydrogen positions in the crystals, they have to be optimized. Because of the large number of structures selected, DFT relaxation would be prohibitively costly. The semiempirical density functional tight binding (DFTB) method () was thus chosen to relax proton positions in all structures. The structures were optimized at the DFTB3-D3H5 level of theory (, ) using the 3ob-3-1 parameter set (, ). Further computational details are given in Supplementary Text. Instances where the structure relaxation failed were discarded. A total of 203,303 structures were successfully relaxed and considered for chemical shift prediction. All chemical shift predictions were performed using ShiftML version 1.2 (publicly available at https://shiftml.epfl.ch) (, ). Conversions of predicted shieldings to chemical shifts were performed by least squares fitting of the shieldings obtained for benchmark sets of DFTB-relaxed structures to their experimental chemical shifts, fixing the slope to a value of −1. The offsets obtained were found to be 30.96 parts per million (ppm) for 1H, 168.64 ppm for 13C, 185.99 ppm for 15N, and 205.08 ppm for 17O. This corresponds to 1H and 13C shifts relative to tetramethylsilane (TMS), 15N shifts relative to NH4Cl, and 17O shifts relative to liquid H2O. The sets of structures and isotropic chemical shifts used to determine shielding-to-shift conversions are described in tables S5 to S8. We note that chemical shieldings are stored in the database and converted to chemical shifts on the fly during the construction of chemical shift distributions. In total, the database contains 5,243,129 unique 1H, 4,847,864 unique 13C, 466,370 unique 15N, and 867,446 unique 17O chemical shifts, respectively.

Molecular fragment descriptors

For assignment of the spectrum of a molecule of unknown structure, classification of the predicted shifts should be done such that a statistical distribution of chemical shifts can be obtained for any nucleus from the 2D representation of a molecule. The molecular fragment descriptor should thus not contain any information about conformation or molecular packing in the crystal structures. Among the topological atom-centered descriptors that fit these requirements (–), we chose to represent topological atomic environments by graphs where vertices represent atoms and edges represent covalent connectivities. The vertices were labeled by element, and the edges were kept unlabeled. Graphs were cut to a maximum depth w of 6, defined as the maximum shortest path between the central vertex (for which the chemical shift is predicted) and any other vertex in the path. Conversion of the 3D crystal structures to their corresponding graphs was performed by identifying atom pairs as covalently bonded when the distance between the atoms in the pair is less than 1.1 times the sum of the covalent radii of the atoms involved.

Database construction and search

A given topological atomic environment can be searched by identifying which graphs in the database match the graph of the selected atomic environment. However, there is no known algorithm able to solve the graph isomorphism problem required for each database entry in polynomial time (, ). Thus, the search was simplified by using the Weisfeiler-Lehman hash () as a unique graph identifier. If the number of instances of a given atomic environment identified in the database was deemed too small to produce statistically significant chemical shift distributions, then the atomic environment was searched again after reducing the graph depth. For this work, we chose a minimum number of instances of 10. Further details concerning the database architecture and search can be found in Supplementary Text.

Construction of probability distributions

We use a notation and a conceptual framework extending the Bayesian selection of crystal structure prediction candidate structures compatible with measured shifts (). From the set of chemical shifts and uncertainties {y, σ} predicted by ShiftML for the CSD structures that share the same graph G as the atom i in the molecule of interest, we define the probability of observing a chemical shift y for that atom as proportional to the sum of Gaussian functions centered on each predicted shift y and with a width σ given by its prediction uncertainty Similarly, we define the probability of observing a cross-peak (y(1), y(2)) for a pair of bonded atoms (i, j) in a molecule as proportional to the sum of uncorrelated 2D Gaussian functionswhere and are the sets of chemical shifts and predicted uncertainties computed for all the bonded atoms in the reference dataset that share the same graph G as the pair being considered.

Probabilistic assignment

Considering the vector of observed shifts y, the probability that one of its elements y originates from atom i is obtained by evaluating Eq. 1 (or Eq. 2) for all elements in y For a given assignment a (defined as the vector mapping atoms in the molecule to experimental shifts such that a = j if atom i is assigned to shift j), the probability of observing a vector of chemical shifts y is given by Applying Bayes theorem on Eq. 4 yields the probability of an assignment a given the observed vector of shifts y In Eq. 5, we assume that p(a) is a nonzero constant if the assignment is valid (i.e., if all nuclei are assigned to only one chemical shift and if all observed shifts are assigned at least one nucleus) and zero otherwise. Whenever some of the assignments can be made according to experimental data or heuristic arguments, such prior information can be incorporated in the definition through p(a). By combining individual assignments, the complete set of possible global assignments can be generated. Because of the combinatorial complexity of generating all possible global assignments, several procedures were implemented to reduce the global assignment generation cost while ensuring that the most probable assignments are generated, and these are described in Supplementary Text. Note that if the probability of any shift originating from a given nucleus is lower by a set threshold (typically a factor of 100) than the maximum probability for that nucleus, then it is discarded. This results in some nuclei being assigned unambiguously independently of the rest of the global assignment (e.g., shift “e” in Fig. 3). Equation 5 assigns a distinct probability to each possible assignment of the entries of the measured shifts vector y to all the environments. It is the correct probabilistic metric to compare two assignments but is hard to interpret. A more compact indicator is given by the marginal probability that atom i is assigned to shift j, which can be extracted from the set of generated assignments by considering only the vectors a containing that particular individual assignment. This is shown in Eq. 6 by the Kronecker delta δ, which selects the assignments for which a = j For topologically equivalent nuclei, which have identical graphs and probability distributions, tuples of nuclei were assigned to tuples of experimental shifts (which can be partly or entirely identical).

Synthetic benchmark set

A set of 100 randomly selected crystal structures from the database were selected to benchmark the probabilistic assignment. The selection was restricted to crystals having between 10 and 20 unique carbon atoms. The selected structures are listed in Supplementary Text. The ShiftML-predicted shifts associated to each nucleus were used as ground-truth assignment. The structure to assign was systematically excluded from the database search performed to construct statistical distributions of chemical shifts. The synthetic benchmark set was separated into five sets containing 20 crystals each and 241, 260, 212, 259, and 242 unique carbon atoms, respectively.

34 in total

1. Extended-connectivity fingerprints.

Authors: David Rogers; Mathew Hahn
Journal: J Chem Inf Model Date: 2010-05-24 Impact factor: 4.956

2. Assigning powders to crystal structures by high-resolution (1)H-(1)H double quantum and (1)H-(13)C J-INEPT solid-state NMR spectroscopy and first principles computation. A case study of penicillin G.

Authors: Nicolas Mifsud; Bénédicte Elena; Chris J Pickard; Anne Lesage; Lyndon Emsley
Journal: Phys Chem Chem Phys Date: 2006-06-20 Impact factor: 3.676

Review 3. Chemical shift computations on a crystallographic basis: some reflections and comments.

Authors: Robin K Harris; Paul Hodgkinson; Chris J Pickard; Jonathan R Yates; Vadim Zorin
Journal: Magn Reson Chem Date: 2007-12 Impact factor: 2.447

4. A new algorithm for reliable and general NMR resonance assignment.

Authors: Elena Schmidt; Peter Güntert
Journal: J Am Chem Soc Date: 2012-07-23 Impact factor: 15.419

5. An improved broadband decoupling sequence for liquid crystals and solids.

Authors: B M Fung; A K Khitrin; K Ermolaev
Journal: J Magn Reson Date: 2000-01 Impact factor: 2.229

6. Solid-state NMR spectroscopy.

Authors: Bernd Reif; Sharon E Ashbrook; Lyndon Emsley; Mei Hong
Journal: Nat Rev Methods Primers Date: 2021-01-14

7. A Bayesian approach to NMR crystal structure determination.

Authors: Edgar A Engel; Andrea Anelli; Albert Hofstetter; Federico Paruzzo; Lyndon Emsley; Michele Ceriotti
Journal: Phys Chem Chem Phys Date: 2019-10-21 Impact factor: 3.676

8. Effects of polymorphic differences for sulfanilamide, as seen through 13C and 15N solid-state NMR, together with shielding calculations.

Authors: Alessia Portieri; Robin K Harris; Richard A Fletton; Robert W Lancaster; Terence L Threlfall
Journal: Magn Reson Chem Date: 2004-03 Impact factor: 2.447

9. Quantifying weak hydrogen bonding in uracil and 4-cyano-4'-ethynylbiphenyl: a combined computational and experimental investigation of NMR chemical shifts in the solid state.

Authors: Anne-Christine Uldry; John M Griffin; Jonathan R Yates; Marta Pérez-Torralba; M Dolores Santa María; Amy L Webber; Maximus L L Beaumont; Ago Samoson; Rosa María Claramunt; Chris J Pickard; Steven P Brown
Journal: J Am Chem Soc Date: 2008-01-01 Impact factor: 15.419

10. 13C and (15)N chemical shift tensors in adenosine, guanosine dihydrate, 2'-deoxythymidine, and cytidine.

Authors: Dirk Stueber; David M Grant
Journal: J Am Chem Soc Date: 2002-09-04 Impact factor: 15.419

1 in total

1. A Machine Learning Model of Chemical Shifts for Chemically and Structurally Diverse Molecular Solids.

Authors: Manuel Cordova; Edgar A Engel; Artur Stefaniuk; Federico Paruzzo; Albert Hofstetter; Michele Ceriotti; Lyndon Emsley
Journal: J Phys Chem C Nanomater Interfaces Date: 2022-09-23 Impact factor: 4.177

1 in total