| Literature DB >> 23496927 |
Anna Katharina Dehof1, Simon Loew, Hans-Peter Lenhof, Andreas Hildebrandt.
Abstract
BACKGROUND: NMR chemical shift prediction plays an important role in various applications in computational biology. Among others, structure determination, structure optimization, and the scoring of docking results can profit from efficient and accurate chemical shift estimation from a three-dimensional model.A variety of NMR chemical shift prediction approaches have been presented in the past, but nearly all of these rely on laborious manual data set preparation and the training itself is not automatized, making retraining the model, e.g., if new data is made available, or testing new models a time-consuming manual chore.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23496927 PMCID: PMC3682865 DOI: 10.1186/1471-2105-14-98
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Summary and comparison of data sets used by different hybrid shift prediction approaches
| Meiler [ | 2003 | 292 (n. a.) | 30 (n. a.) | N | |
| Sparta [ | 2007 | 200 (n. a.) | 25 (n. a.) | X-ray | N |
| CamShift [ | 2009 | n. a. (224,036) | 35 (n. a.) | X-ray | N |
| ShiftX2 [ | 2010 | 197 (206,903) | 61 (n. a.) | X-ray | Y |
| NightShift | 2012 | 515 (326,652) | 344 (217,768) | Y |
The third and fourth columns show the size of the data set, measured as the number of proteins contained within. The CamShift publication reported only the overall number of shifts, which we denote in parentheses. Note that for the NightShift set, we already excluded homologous proteins. The table only includes methods for which these numbers could be reliably determined.
Definition and number of data points for our atom super classes using notation borrowed from the Amber force field
| N | Backbone nitrogen | 65,246 |
| CA | Alpha backbone carbon | 66,579 |
| CB | Beta backbone carbon | 60,353 |
| C | Backbone carboxy carbon | 48,442 |
| H | Backbone hydrogens attached to the backbone nitrogen | 68,461 |
| HA | Side chain hydrogens on alpha positions (HA, 1HA, 2HA) | 71,066 |
| HB | Hydrogens on beta positions (HB, 1HB, 2HB) | 62,106 |
| HD | Hydrogens on delta positions (2HD, HD1, HD2, 1HD1, 1HD2, and 2HD2) | 37,514 |
| HG | Hydrogens on gamma positions (HG, 1HG1, 1HG2, 2HG, 2HG1, HG1) | 43,221 |
| HEHZ | Remaining hydrogens (HE, HE1, HE2, HE3, 2HE, 1HE, 1HE2, 2HE2, HH2, HZ, 1HZ, HZ2, HZ3) | 21,532 |
Performance (correlation coefficients and rmse) of our models in comparison to ShiftX2 using the test set created by our pipeline
| Spinster | 0.817 (2.977) | 0.956 (1.425) | 0.992 (1.582) | 0.731 (1.524) | 0.593 (0.505) |
| ShiftX2 | 0.554 (5.606) | 0.953 (1.475) | 0.984 (2.238) | 0.711 (1.65) | 0.534 (0.583) |
| Training / test size | 39,147 / 26,099 | 39,947 / 26,632 | 36,211 / 24,142 | 29,065 / 19,377 | 41,076 / 27,385 |
| Prediction | HA correlation | HB correlation | HD correlation | HEHZ correlation | HG correlation |
| method | (rmse) | (rmse) | (rmse) | (rmse) | (rmse) |
| Spinster | 0.997 (2.889) | 0.82 (0.559) | 0.994 (0.324) | 0.981 (0.375) | 0.86 (0.365) |
| ShiftX2 | 0.517 (5.998) | 0.721 (1.033) | 0.012 (0.706) | 0.816 (0.505) | -0.147 (0.967) |
| Training / test size | 42,639 / 28,427 | 37,263 / 24,843 | 22,508 / 15,006 | 12,919 / 8,613 | 25,932 /17,289 |
The size is measured in the number of available atomic shifts.
Figure 1Our NightShift pipeline for data set generation and training of NMR chemical shift prediction models.
Number of shifts and features of the data set per atom super class, evaluated on the ‘raw’ BMRB data set
| Orig num shifts | 65,440 | 66,870 | 60,761 | 48,686 | 68,496 | 71,243 | 62,116 | 37,523 | 21,548 | 43,227 |
| Orig num features | 111 | 111 | 111 | 111 | 111 | 111 | 111 | 111 | 111 | 111 |
| Final num shifts train | 39,147 | 39,947 | 36,211 | 29,065 | 41,076 | 42,639 | 37,263 | 22,508 | 12,919 | 25,932 |
| Final num shifts test | 26,099 | 26,632 | 24,142 | 19,377 | 27,385 | 28,427 | 24,843 | 15,006 | 8,613 | 17,289 |
| Final num features | 44 | 40 | 39 | 44 | 44 | 44 | 38 | 43 | 38 | 38 |
Maximum identity between proteins in test- and training data set was below 10%. The first two lines show the numbers for the raw data set before applying the training procedure, lines three and four show the number of shifts, and the last line shows the number of features used by the models (numbers for the RefDB are similar).
Performance of models (correlation coefficients and rmse) in comparison to ShiftX2 using a RefDB-based test set created by our pipeline
| Spinster-Ref | 0.847 (2.426) | 0.97 (1.161) | 0.996 (1.093) | 0.794 (1.146) | 0.563 (0.393) |
| ShiftX2 | 0.844 (2.53) | 0.972 (1.122) | 0.995 (1.133) | 0.817 (1.114) | 0.578 (0.44) |
| Training / test size | 20,038 / 13,359 | 17,953 / 11,969 | 28,742 / 19,162 | 12,844 / 8,563 | 37,912 / 25,276 |
| Prediction | HA correlation | HB correlation | HD correlation | HEHZ correlation | HG correlation |
| method | (rmse) | (rmse) | (rmse) | (rmse) | (rmse) |
| Spinster-Ref | 0.998 (2.218) | 0.963 (0.232) | 0.997 (0.212) | 0.993 (0.24) | 0.94 (0.192) |
| ShiftX2 | 0.998 (2.314) | 0.964 (0.227) | 0.995 (0.227) | 0.992 (0.244) | 0.903 (0.199) |
| Training / test size | 20,915 / 13,944 | 34,521 / 23,015 | 15,530 / 10,354 | 7,962 / 5,309 | 21,264 /14,177 |
The size is measured in the number of available atomic shifts.
Figure 2NightShift workflow via the web-interface.