| Literature DB >> 18817546 |
Stefan Kuhn1, Björn Egert, Steffen Neumann, Christoph Steinbeck.
Abstract
BACKGROUND: Current efforts in Metabolomics, such as the Human Metabolome Project, collect structures of biological metabolites as well as data for their characterisation, such as spectra for identification of substances and measurements of their concentration. Still, only a fraction of existing metabolites and their spectral fingerprints are known. Computer-Assisted Structure Elucidation (CASE) of biological metabolites will be an important tool to leverage this lack of knowledge. Indispensable for CASE are modules to predict spectra for hypothetical structures. This paper evaluates different statistical and machine learning methods to perform predictions of proton NMR spectra based on data from our open database NMRShiftDB.Entities:
Mesh:
Year: 2008 PMID: 18817546 PMCID: PMC2605476 DOI: 10.1186/1471-2105-9-400
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An example of an assigned proton spectrum from the NMRShiftDB (molid 10019871).
Mean absolute error in ppm from a decision tree prediction by proton category classes.
| aromatic | non aromatic- | rigid aliphatic systems | non rigid aliphatic systems | |
| number of protons | 4013 | 1279 | 5062 | 8318 |
| mean abs. error | 0.168 | 0.273 | 0.204 | 0.135 |
Figure 2Hierarchy of the descriptor set, with proportion of types. Taken from 1 = [9] and 2 = [12].
Figure 3Coverage of chemical space by NMRShiftDB. A set of atomic descriptors were calculated for protons in Pubchem (50,000 structures/465,000 protons) (in black) and the protons with assigned shifts in the NMRShiftDB (in red, 1829 structures/18692 protons attached to carbons) and plotted using principal component analysis.
Figure 4PCA analysis of the numerical attributes in the descriptor set. The symbols correspond to the four different proton categories suggested in [9]. "○" for protons in aromatic rings, "△" for protons in non-aromatic π systems, "+" for protons in rigid aliphatic systems and "×" for protons in non-rigid aliphatic systems.
Figure 5Histogram of the shift values. Most prominent regions are aromatic and and aliphatic values at 1.3 and 7.4 ppm respectively.
Figure 6Circle Segments Plot of the categorial attributes in the descriptor set. The values in each of the segments are sorted increasingly according to the shift from the inner of the circle to the outer. The chemical shift in the range [0..10.7] is mapped to the coloured interval of size [0..100]. The number of the distinct attribute levels i is given in square brackets. Those values are mapped to the colour map as i/max(i) * 100. The influence of the dichotomous spatial variables PkHyb* and TopoPicontact* are less amenable with respect to the chemical shift, the farther the respective heavy atom is.
Figure 7Comparison of the real shift values and the predicted values on the data set (18672 protons) for a Random Forest with selected features.
Figure 8Mean absolute error of investigated Classifiers, the standard error (SE) is given in brackets. Classifiers trained with selected features are annotated with an additional #. Bagged Classifiers are annotated with an additional * and boosted classifiers carry an additional +- symbol. MAE/SE calculated with two decimal place for δand one decimal place for δin terms of classification.
Figure 9Mean Decrease Accuracy (%IncMSE) and Mean Decrease Gini (IncNodePurity) (sorted decreasingly from top to bottom) of attributes as assigned by the random forest. The abbreviations of the descriptors can be found in table 1 of additional file 1. For the mean decrease in accuracy the most relevant descriptors either relate to the proton or the carbon it is connected to or to atoms close to this. The most important types of descriptors are hybridisation, electronegativity, distances and whether the proton is joined to a conjugated π system.
Figure 10Representative examples for typical misclassifications ('hard errors'). NMRShiftDB-molid: 10019871, atomID: 10189925 at heavy atom nr. 16 (0.62 ppm), atomID: 10189928 at heavy atom nr. 17 (0.27 ppm). NMRShiftDB-molid: 10021849, atomID: 10324987 at heavy atom nr. 22 (0.73 ppm), atomID: 10324990 at heavy atom nr. 23 (0.27 ppm). NMRShiftDB-molid: 20062578, atomID: 20958256 at heavy atom nr. 17 (3.53 ppm), atomID: 20958257 at heavy atom nr. 17 (2.72 ppm). NMRShiftDB-molid: 10027587, atomID: 11110950 at heavy atom nr. 7 (2.11 ppm), atomID: 11110951 at heavy atom nr. 7 (1.45 ppm). NMRShiftDB-molid: 10021815, atomID: 10323000 at heavy atom nr. 13 (1.96 ppm), atomID: 10323001 at heavy atom nr. 13 (1.11 ppm). NMRShiftDB-molid: 10022075, atomID: 10335470 at heavy atom nr. 13 (3.61 ppm), atomID: 10335471 at heavy atom nr. 13 (2.90 ppm). NMRShiftDB-molid: 21156, atomID: 283794 at heavy atom nr. 1 (8.21 ppm). NMRShiftDB-molid: 21324, atomID: 296084 at heavy atom nr. 2 (6.73 ppm). Provided shift values are experimental and not predicted shifts. Predictions are given in table 2. The corresponding descriptor set is provided in table 2 of additional file 1. The HOSE codes of these atoms are shown in table 4. The different descriptor sets for atoms 10189925 and 10189928 are shown in table 3.
The set of descriptors with distinctive values for the atomIds of CH3 protons from NMRshiftDB-molid: 10019871 at heavy atom 16 and 17, from figure 10 (topleft).
| 10189925 | 10189928 | |
| PkcPeriod12spat | 20.00 | 40.00 |
| PkcPeriod13spat | 40.00 | 20.00 |
| PkcPeriod15spat | 20.00 | 40.00 |
| PkcPiEN16spat | 16.57 | 0.00 |
| PkcSigmaEN12spat | 73.73 | 79.00 |
| PkcSigmaEN13spat | 78.73 | 73.60 |
| PkcSigmaEN15spat | 80.59 | 97.03 |
| PkcValenceelectrons12spat | 14.29 | 57.14 |
| PkcValenceelectrons13spat | 57.14 | 14.29 |
| PkcValenceelectrons15spat | 14.29 | 85.71 |
| PkcVdwradius12spat | 57.14 | 80.95 |
| PkcVdwradius13spat | 80.95 | 57.14 |
| PkcVdwradius15spat | 57.14 | 72.38 |
| PkHyb15spat | 0.00 | 100.00 |
| SpaMindistToHy08spat | 38.80 | 29.55 |
| SpaMindistToHy12spat | 42.07 | 29.40 |
| SpaMindistToHy14spat | 42.40 | 36.07 |
| SpaMindistToHy15spat | 37.96 | 25.46 |
| SpaMindistToHy16spat | 39.37 | 29.20 |
| SpatAvdistohy12spat | 44.16 | 37.44 |
| SpatAvdistohy14spat | 44.58 | 37.87 |
| SpatAvdistohy15spat | 43.27 | 34.91 |
| SpatAvdistohy16spat | 42.74 | 36.12 |
| SpatDisttoatom12spat | 42.70 | 36.98 |
| SpatDisttoatom14spat | 44.74 | 37.29 |
| SpatDisttoatom15spat | 43.94 | 37.62 |
| SpatDisttoatom16spat | 41.95 | 35.81 |
| TopoBondsToAtom08spat | 15.79 | 21.05 |
HOSE Codes for given atomIDs (hard errors)
| atomIds | Hose Code |
| 10189925 | H-1;C(HHC/HCC/HHC,HHH)HCN/=OO,HS/ |
| 10189928 | H-1;C(HHC/HCC/HHC,HHH)HCN/=OO,HS/ |
| 10324987 | H-1;C(HHC/HCC/HCN,HHH)HHO,CC/&,=OC,=O&/ |
| 10324990 | H-1;C(HHC/HCC/HCN,HHH)HHO,CC/&,=OC,=O&/ |
| 20958256 | H-1;C(HCC/HCN,*C*C/HHO,CC,H,H,*C,*C)&,=OC,=O&,H,H,*C,*&/,,=CC,,H*&/ |
| 20958257 | H-1;C(HCC/HCN,*C*C/HHO,CC,H,H,*C,*C)&,=OC,=O&,H,H,*C,*&/,,=CC,,H*&/ |
| 11110950 | H-1;C(HCC/=CC,HHC/CC,CCC,=CC)HHC,HHH,HC&,HHH,HHH,HC,HHH/HH&,HHC,HHC/ |
| 11110951 | H-1;C(HCC/=CC,HHC/CC,CCC,=CC)HHC,HHH,HC&,HHH,HHH,HC,HHH/HH&,HHC,HHC/ |
| 10323000 | H-1;C(HCC/CCC,HHC/H,H,C,C,C,C,HHH,HC&)H,H&,C,C,C,HH,HH&,&,CCC/,HH,HH,C,C,HHH,H&C,HHC,HHH/ |
| 10323001 | H-1;C(HCC/CCC,HHC/H,H,C,C,C,C,HHH,HC&)H,H&,C,C,C,HH,HH&,&,CCC/,HH,HH,C,C,HHH,H&C,HHC,HHH/ |
| 10335470 | H-1;C(HCC/HCC,*C*C/=O&,HHH,*C,*C,*C,&),*C*C,H,H*C,*&/,H,H*C,*&,*&O/ |
| 10335471 | H-1;C(HCC/HCC,*C*C/=O&,HHH,*C,*C,*C,&),*C*C,H,H*C,*&/,H,H*C,*&,*&O/ |
| 283794 | H-1;C(*C*C/*C*N,H*C/*C*S,*C,H*&)H*&,*&,*&S/,=O=OC/ |
| 296084 | H-1;C(*C*C/*CC,H*C/H*C,H=N,*&C)H*&,C,H=N/,HHC,C/ |
Predictions of hard errors for all classifiers in ppm.
| atomID | shift | LR | HOSE | J48 | J48+ | J48* | J48#+ | J48#* | J48# | RF | RF# | RF+ | IBk | SVM |
| 10189925 | 0.62 | 0.91 | 0.85 | 0.90 | 0.80 | 0.90 | 0.90 | 0.90 | 0.90 | 0.90 | 0.90 | 0.90 | 0.87 | 0.91 |
| 10189928 | 0.27 | 0.94 | 0.85 | 0.80 | 0.90 | 0.90 | 0.90 | 0.90 | 0.90 | 0.80 | 0.90 | 0.90 | 0.78 | 0.94 |
| 10324987 | 0.73 | 1.05 | 0.95 | 0.80 | 0.80 | 0.80 | 0.80 | 1.10 | 0.80 | 0.80 | 0.80 | 0.80 | 0.82 | 0.87 |
| 10324990 | 0.27 | 1.16 | 0.95 | 0.90 | 0.90 | 1.10 | 1.00 | 0.80 | 0.80 | 1.10 | 0.80 | 1.10 | 1.33 | 0.79 |
| 20958256 | 3.53 | 2.43 | 3.01 | 2.70 | 2.70 | 2.70 | 2.70 | 2.70 | 2.70 | 2.70 | 2.70 | 2.70 | 2.72 | 2.82 |
| 20958257 | 2.72 | 2.43 | 3.01 | 2.70 | 3.50 | 3.50 | 3.50 | 3.50 | 3.50 | 3.50 | 3.50 | 3.50 | 3.53 | 2.91 |
| 11110950 | 2.11 | 1.68 | 2.24 | 1.40 | 1.40 | 1.40 | 2.30 | 2.30 | 1.40 | 2.00 | 1.40 | 1.40 | 1.45 | 2.24 |
| 11110951 | 1.45 | 1.96 | 2.24 | 2.10 | 2.10 | 2.10 | 2.30 | 2.30 | 2.10 | 2.00 | 2.10 | 2.10 | 2.11 | 2.33 |
| 10323000 | 1.96 | 1.79 | 1.52 | 1.40 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.11 | 1.45 |
| 10323001 | 1.11 | 1.82 | 1.52 | 1.40 | 2.00 | 2.00 | 2.00 | 2.00 | 1.40 | 2.00 | 2.00 | 2.00 | 1.96 | 1.48 |
| 10335470 | 3.61 | 2.92 | 2.91 | 2.90 | 2.90 | 2.90 | 2.90 | 2.80 | 2.90 | 2.90 | 2.90 | 2.90 | 2.90 | 3.06 |
| 10335471 | 2.90 | 2.87 | 2.91 | 2.90 | 3.60 | 3.60 | 3.60 | 3.60 | 3.60 | 3.60 | 3.60 | 3.60 | 3.61 | 3.08 |
| 283794 | 8.21 | 7.28 | 8.04 | 7.00 | 7.50 | 7.00 | 7.20 | 7.20 | 6.70 | 7.60 | 8.00 | 7.50 | 7.47 | 7.29 |
| 296084 | 6.73 | 7.33 | 7.52 | 7.70 | 7.50 | 7.40 | 8.10 | 7.60 | 7.60 | 8.10 | 7.60 | 7.40 | 7.38 | 7.53 |
Figure 11Effect of feature selection on quality of prediction. Protons in black are those correctly classified by at least one classifier, the misclassified protons ('hard errors') are shown in red with the corresponding atomID. Left: unreduced descriptor set. Right: reduced descriptor set after feature selection. The protons are sorted increasingly according to the shift value [0..10.7] ppm. On the y-axis the Manhattan distance to the nearest neighbour (NN) proton in the numeric descriptor space is shown.
Figure 12Principle of HOSE codes. The HOSE code is built sphere-wise around the atom described; the carbon shown would have the HOSE code (four spheres): C-arom;*C*CC(*C,*C,=OC/*CX,*&,,CC/&C,,CN,C).
A lexicographically ordered section of NMRShiftDB's table of HOSE codes and assiciated shifts to illustrate the connection between similar HOSE codes and similar NMR shifts.
| Hose Code | Shift |
| H-1;C(*C*C/*CC,*CO/*CO,CCC,*&C,H)H*&,H,HHH,HHH,HHH,CCC/,HHH,HHH,HHH/ | 6.89 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,CC,H=C,HHC/,HHC,*C*C,%N,CC,HHC/ | 7.88 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,CC,H=C,HHC/,HHC,*C*C,%N,HC,HHC/ | 7.88 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,CCC,HHC/,HHC,*C*C,HHH,HHH,HHH,HHC/ | 7.19 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,H=C,=OC/,=OC,*C*N,HC,,HHH/ | 7.18 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,H=C,HHC/,HHC,*C*C,CC,HHC/ | 7.14 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,H=C,HHC/,HHC,*C*C,HC,HCC/ | 7.14 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,H=C,HHC/,HHC,*C*C,HC,HHC/ | 7.13 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,H=C,HHC/,HHC,*N*N,HC,HHC/ | 7.13 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,H=C,HHC/,HHC,=OO,HC,HHC/ | 6.99 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,H=C,HHH/,HHH,*C*C,HC/ | 7.10 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,H=O,HHC/,HHC,*C*C,,HCC/ | 7.20 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,H=O,HHC/,HHC,*C*C,,HHC/ | 7.16 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,H=O,HHC/,HHC,=OC,,HHC/ | 7.35 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&C,C)H*&,C,HC,HHH,HHC/,HHC,*C*C,HCC/ | 7.19 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,*&Y,C)H*&,C,HC,,HHC/,HHC,*C*C,HHC/ | 6.89 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,H*&,C)H*&,C,CC,HHC/,HHC,*C*C,%N,HHC/ | 7.10 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,H*&,C)H*&,C,CC,HHH/,HHH,*C*C,%N/ | 7.20 |
| H-1;C(*C*C/*CC,*CO/*CO,H=C,H*&,C)H*&,C,HC,HHC/,HHC,*C*C,HHC/ | 7.13 |