| Literature DB >> 18937839 |
Wiebke Timm1, Alexandra Scherbart, Sebastian Böcker, Oliver Kohlbacher, Tim W Nattkemper.
Abstract
BACKGROUND: Mass spectrometry is a key technique in proteomics and can be used to analyze complex samples quickly. One key problem with the mass spectrometric analysis of peptides and proteins, however, is the fact that absolute quantification is severely hampered by the unclear relationship between the observed peak intensity and the peptide concentration in the sample. While there are numerous approaches to circumvent this problem experimentally (e.g. labeling techniques), reliable prediction of the peak intensities from peptide sequences could provide a peptide-specific correction factor. Thus, it would be a valuable tool towards label-free absolute quantification.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18937839 PMCID: PMC2600826 DOI: 10.1186/1471-2105-9-443
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Features constituting the sss feature set
| GB500 | Estimated gas-phase basicity at 500 K (Zhang | 20 |
| VASM830103 | Relative population of conformational state E (Vasquez | 11 |
| NADH010106 | Hydropathy scale (36% accessibility) (Naderi-Manesh | 9 |
| FAUJ880111 | Positive charge (Fauchere | 6 |
| WILM950102 | Hydrophobicity coefficient in RP-HPLC, C8 with 0.1%TFA/MeCN/H2O (Wilce | 6 |
| OOBM850104 | Optimized average non-bonded energy per atom (Oobatake | 2 |
| Molecular mass of the peptide | - | |
| The Kerr-constant increments (Khanarian-Moore, 1980) | - | |
| Hydropathy scale (50% accessibility) (Naderi-Manesh | - | |
| Information measure for extended without H-bond (Robson-Suzuki, 1976) | - | |
| Helix-coil equilibrium constant (Finkelstein-Ptitsyn, 1977) | - | |
| Signal sequence helical potential (Argos | - | |
| R | No. of arginine residues | 20 |
| F | No. of phenylalanine residues | 20 |
| M | No. of methionine residues | 17 |
| Q | No. of glutamine residues | 5 |
| Y | No. of tyrosine residues | 4 |
| No. of histidine residues | - | |
The "selected" column shows the number of times out of twenty runs of a forward stepwise selection that selected the corresponding feature. Hand-picked features are printed in bold face. Feature selection on the aa (above the separating line) and seq (below) feature set were done independently of each other. The seq feature set fully includes mono. No di- or tri-peptide string was selected consistently.
Figure 1Within-peptide variances of target values. Scatter plots and correlation coefficients depicting the within-peptide peak intensity variance between runs for all peptides of both datasets (left: dataset A, right: dataset B). The recorded correlations can be considered as upper bounds of the achievable prediction performance if single measurements are used. The corresponding plots with trimmed mean values can be found in the additional file 7: tmbetweenpeptidecorrelation.
Overview of Pearson's correlation coefficients using mic normalization
| 10-fold CV | A | 0.66 | 0.52 | 0.60 | |
| 0.66 | 0.67 | 0.51 | |||
| 0.64 | 0.52 | ||||
| 0.57 | 0.34 | 0.34 | |||
| B | 0.53 | 0.46 | 0.49 | ||
| 0.53 | 0.49 | ||||
| 0.47 | 0.53 | 0.48 | |||
| 0.44 | 0.27 | 0.41 | |||
| across datasets | A | 0.65 | 0.52 | 0.14 | |
| 0.63 | 0.59 | 0.47 | |||
| 0.57 | 0.45 | ||||
| 0.46 | 0.21 | 0.40 | |||
| B | 0.45 | 0.24 | 0.01 | ||
| 0.44 | 0.39 | ||||
| 0.45 | 0.39 | 0.32 | |||
| 0.32 | 0.05 | 0.28 | |||
| A | 0.58 | 0.47 | 0.00 | ||
| 0.58 | 0.55 | 0.41 | |||
| 0.52 | 0.39 | ||||
| across datasets | 0.37 | 0.21 | 0.22 | ||
| without duplicates | B | 0.44 | 0.42 | 0.00 | |
| 0.46 | 0.40 | ||||
| 0.46 | 0.44 | 0.32 | |||
| 0.32 | 0.00 | 0.03 | |||
Values "0.00" indicate that the correlation coefficient was in the range (-0.005,0.005). The best value in each section is printed in bold face.
Summary table for additional scatter plots
| 10-fold CV | A | additional file | additional file | |
| additional file | additional file | |||
| B | additional file | additional file | ||
| additional file | additional file | |||
| across datasets | A | additional file | additional file | |
| additional file | additional file | |||
| B | additional file | additional file | ||
| additional file | additional file | |||
This table summarizes the scatter plots available as additional files, and in which file a certain plot can be found. The LLM scatterplots for the seq feature set as well as all LM scatter plots are not been included because the results (correlation values) are poor. "DS" abbreviates "dataset".
Figure 2Scatter plot target vs. predicted values. Prediction results for dataset A with the ν-SVR indicate that peak intensity prediction is feasible. Left: Cross-validation on dataset A. Right: Prediction using a model parameter-tuned on dataset B. r denotes the Pearson's correlation between target and predicted values. Plots for dataset B and the other feature sets are shown in additional files. A summary of all additional files showing scatter plots is presented in Table 4.
Figure 3Prediction results with randomly shuffled sequences. When assigning randomly shuffled sequences to the target values of dataset A, prediction by ν-SVR shows no correlation in 10-fold cross-validation. This indicates that we are picking up the true signal, i.e. the predicted values are correlated to the peptide sequence.
Figure 4Analysis of absolute prediction error. Plot of target value vs. prediction error. Data was pooled into 20 bins according to their target values. For each bin, the mean absolute prediction error is plotted on the left y-axis. Then the number of values falling into the corresponding bin is shown with squares on the right y-axis. The lowest error is achieved for intermediate target values, the highest error occurs for low ones. The absolute error is not correlated to the number of values per bin. Thus, intensities within a certain range are more difficult to predict than others.
Figure 5Feature importance. Plot of percentage increase of the prediction error if the corresponding feature is randomly permuted, using random forests for regression [42]. Of all features in the sss feature set, the relative population of conformational state E (VASM830103, [38]), the estimated gas-phase basicity (GB500, [36]), and the theoretical mass lead to the highest increase of the error if the peptide's values are permuted. The number of positive charges (FAUJ880111, [41]) and the number of glutamine residues (Q) are rated the least important features.
Two-sample t-test results
| H | 2.25e-05 | 3.20 | 3.90 | 338 | 77 |
| VKe | 4.64e-05 | 3.36 | 2.31 | 403 | 12 |
| VK | 8.92e-05 | 3.36 | 2.39 | 402 | 13 |
| VF | 1.25e-04 | 3.29 | 4.65 | 403 | 12 |
| Y | 2.11e-04 | 3.18 | 3.71 | 296 | 119 |
| F | 2.99e-04 | 3.15 | 3.67 | 272 | 143 |
| GF | 5.83e-04 | 3.29 | 4.48 | 400 | 15 |
| Q | 8.97e-04 | 3.16 | 3.61 | 260 | 155 |
| TKe | 0.001 | 3.36 | 2.45 | 401 | 14 |
| SV | 0.001 | 3.28 | 4.15 | 393 | 22 |
| TK | 0.003 | 3.36 | 2.53 | 400 | 15 |
| GK | 0.009 | 3.37 | 2.76 | 390 | 25 |
| PR | 0.009 | 3.30 | 4.51 | 403 | 12 |
| PRe | 0.009 | 3.30 | 4.51 | 403 | 12 |
| DS | 0.009 | 3.29 | 4.10 | 395 | 20 |
| DK | 3.80e-07 | 4.38 | 3.35 | 1112 | 22 |
| DKe | 1.18e-06 | 4.37 | 3.38 | 1113 | 21 |
| GM | 1.67e-05 | 4.38 | 3.23 | 1112 | 22 |
| AKe | 2.27e-05 | 4.39 | 3.64 | 1085 | 49 |
| NKe | 5.82e-05 | 4.38 | 3.42 | 1111 | 23 |
| GRe | 9.18e-05 | 4.31 | 4.93 | 1054 | 80 |
| QRe | 1.37e-04 | 4.33 | 5.10 | 1100 | 34 |
| W | 2.63e-04 | 4.40 | 3.93 | 1034 | 100 |
| AK | 2.75e-04 | 4.39 | 3.73 | 1083 | 51 |
| NK | 3.03e-04 | 4.37 | 3.51 | 1110 | 24 |
| GR | 6.16e-04 | 4.32 | 4.86 | 1051 | 83 |
| QR | 7.46e-04 | 4.34 | 5.01 | 1098 | 36 |
| FRe | 0.001 | 4.34 | 5.28 | 1111 | 23 |
| IK | 0.001 | 4.37 | 3.47 | 1113 | 21 |
| IKe | 0.001 | 4.37 | 3.47 | 1113 | 21 |
| GK | 0.002 | 4.37 | 3.80 | 1101 | 33 |
| DT | 0.003 | 4.38 | 3.80 | 1083 | 51 |
| AM | 0.003 | 4.38 | 3.40 | 1108 | 26 |
| S | 0.004 | 4.47 | 4.25 | 538 | 596 |
| P | 0.006 | 4.25 | 4.46 | 579 | 555 |
| TKe | 0.008 | 4.37 | 3.76 | 1110 | 24 |
| VRe | 0.008 | 4.33 | 4.77 | 1069 | 65 |
Results of two-sample t-tests of set s+ (normalized intensities of peptides containing a substring s) against the set s- of those not containing it in the corresponding dataset (A or B). Only substrings that occur in more than 10 (A)/20 (B) peptides and with a p-value ≤ 0.001 are shown. An "a" as a prefix denotes that the substring is located at the beginning of the string, "e" as suffix means it is located at the end of the string. Otherwise, the substring can occur anywhere in the peptide (including terminal positions). Rows in bold face mark substrings that are present in the lists of both datasets.