| Literature DB >> 22682214 |
Ralph Patrick1, Kim-Anh Lê Cao, Melissa Davis, Bostjan Kobe, Mikael Bodén.
Abstract
BACKGROUND: The half-life of a protein is regulated by a range of system properties, including the abundance of components of the degradative machinery and protein modifiers. It is also influenced by protein-specific properties, such as a protein's structural make-up and interaction partners. New experimental techniques coupled with powerful data integration methods now enable us to not only investigate what features govern protein stability in general, but also to build models that identify what properties determine each protein's metabolic stability.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22682214 PMCID: PMC3439251 DOI: 10.1186/1752-0509-6-60
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Comparison of protein degradation data sets
| | ||
|---|---|---|
| GPSP | 0.976 | -0.002305 |
| GPSP | 0.7525 | -0.003241 |
| GPSP | 0.4642 | -0.008504 |
| SILAC | 0.3826 | -0.01887 |
Comparisons were made between the Global Protein Stability Profiling method (GPSP) and the Stable Isotope Labeling of Amino Acids in Culture (SILAC) method. This comparison was repeated with transmembrane and secreted proteins removed. The bleach chase data was also compared with GPSP and SILAC. All comparisons were made with a Spearman rank correlation, as well as being fitted to a linear model.
Figure 1Heat-map and cluster analysis generated from the output of the Yen experiment [[1]]. A Euclidean distance metric was used for the cluster analysis. Bright (yellow) colouring in the heat-map represents a value approaching 1, with values approaching 0 having a dull (red) colouring. Labels on the right show how the data can naturally be broken up into groups of stable and unstable proteins, with the remainder being classed as “non-assigned”.
Feature analysis of proteins in unstable class
| | |
|---|---|
| Transmembrane | 3.75e-96 ‡ |
| Signal Peptide | 9.15e-50 ‡ |
| Loops/Coils | 1.36e-06 |
| Immunoglobulin | 1.77e-04 |
| Immunoglobulin Like | 2.01e-03 ‡ |
| Immunoglobulin C | 4.286e-03 ‡ |
| Zinc Finger C2 | 1.73e-02 |
| Cadherin | 2.39e-02 ‡ |
| | |
| Phosphorylation | 2.29e-15 |
| Acetylation | 6.15e-13 |
| Glycosylation | 9.73e-04 ‡ |
| | |
| Hot loops: 0-20% | 8.77e-03 ‡ |
| Loops/Coils: 20-40% | 1.06e-03 |
| Hot loops: 20-40% | 9.11e-06 |
| Rem465: 20-40% | 3.41e-05 |
| 4.86e-80 ‡ | |
Feature types found to be significantly (Fisher’s Exact test, E-value < 0.05) over-/under-represented in the unstable protein class when compared to stable and non-assigned proteins. E-values for over-represented features are indicated with the symbol ‡, otherwise the E-value represents an under-representation. Features include domain/architecture types, PTMs, and disorder classifications based on three disorder types (loops/coils, hotloops and rem465) and percentage of sequence containing the disorder type. Lastly, the presence of a destabilizing N-terminal residue is shown, defined as the N-terminal residue of a mature protein being one of R, K, H, F, L, W, I or Y. These results are from the full data set (without removal of transmembrane or secreted proteins).
Feature analysis of proteins in stable class
| | |
|---|---|
| Transmembrane | 3.99e-22 |
| Signal Peptide | 3.39e-17 |
| RNA Recognition Motif | 1.20e-03‡ |
| | |
| Acetylation | 4.11e-28‡ |
| Phosphorylation | 1.56e-21‡ |
| Glycosylation | 5.65e-03 |
| | |
| Hot loops: 0-20% | 5.72e-03‡ |
| Coils: 80-100% | 1.56e-02‡ |
| Hot loops: 80-100% | 4.72e-03‡ |
| 5.11e-32 | |
Feature types found to be significantly (Fisher’s Exact test, E-value < 0.05) over-/under-represented in the stable class when compared to unstable and non-assigned proteins. E-values for over-represented features are indicated with the symbol ‡, otherwise the E-value represents an under-representation. Features include domain/architecture types, PTMs, and disorder classifications based on three disorder types (“loops/coils”, “hotloops” and “rem465” as defined by Linding and colleagues [26]) and percentage of sequence containing the disorder type. Lastly, the presence of a destabilizing N-terminal residue is shown, defined as the N-terminal residue of a mature protein being one of R, K, H, F, L, W, I or Y. These results are from the full data set (without removal of transmembrane or secreted proteins).
Figure 2Graph representing BN + SVM model. The type of model parameters are indicated by conditional probability tables (CPT), noisy-OR (conditional probability table with a noisy-OR assumption) and Gaussian density tables (GDT) representing continuous values. For the sake of clarity, two continuous nodes that are children to the tyrosine and serine/threonine phosphorylation PTM nodes are not included in the graph. These continuous nodes contain the PWM scores for the sequence. The BN model is identical, but with “Sequence SVM” removed. The SVM model contained only the “Stability” node with the “Sequence SVM” node as a child.
Figure 3Comparison of true positive and false positive rates for protein stability prediction models. Receiver operating characteristic (ROC) curves for the BN, SVM and BN+SVM models were calculated using 10-fold cross validation. We evaluated the three models on a total of 743 unstable proteins and 794 stable proteins.
Summary of performance metrics for each of the trained models
| | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| BN+SVM(1) | 0.85 | 0.0026 | 0.8 | 0.002 | 0.58 | 0.01 | 0.9 | 0.031 | 0.67 | 0.045 |
| BN(1) | 0.84 | 0.0012 | 0.8 | 0.0002 | 0.58 | 0.0007 | 0.9 | 0.005 | 0.66 | 0.0076 |
| SVM(1) | 0.74 | 0.0043 | 0.73 | 0.004 | 0.42 | 0.014 | 0.83 | 0.027 | 0.58 | 0.044 |
| BN+SVM(2) | 0.75 | 0.0052 | 0.77 | 0.009 | 0.39 | 0.035 | 0.88 | 0.032 | 0.46 | 0.072 |
| BN(2) | 0.66 | 0.0046 | 0.73 | 0.001 | 0.29 | 0.07 | 0.82 | 0.08 | 0.45 | 0.18 |
| SVM(2) | 0.73 | 0.0041 | 0.76 | 0.004 | 0.35 | 0.03 | 0.85 | 0.039 | 0.46 | 0.08 |
Various performance metrics for each of the models trained on the full data set, as well as the trimmed data set. For example, BN+SVM(1) refers to the BN+SVM model trained on the full data set, while BN+SVM(2) refers to the same model trained on the trimmed data set. For each model we present the area under the curve (AUC) for a receiver operating characteristic analysis, and find the maximum F-score. The threshold from the maximum F-score was also used to calculate Matthews correlation coefficient (MCC) as well as sensitivity and specificity. For each metric the mean (μ) and standard deviation (σ) is shown for five cross-validation runs.
Figure 4Comparison with the IFS and NN classifier. The performance of the BN, SVM and BN+SVM models were compared against Huang and Colleagues’ [7] IFS plus NN method through calculation of ROC. The models were evaluated on a subset of 250 genes overlapping between our stable/unstable classes and the extra long/short classes as defined by Huang and colleagues [7].
Figure 5Comparison of protein stability prediction models trained on a trimmed data set. The performance of the BN, SVM and BN+SVM model on a trimmed data set with secreted and transmembrane proteins removed. Due to the set containing a smaller number of samples (300 stable proteins and 227 unstable proteins), 25-fold cross validation was used to calculate the ROC curves.
Breakdown of numbers of proteins across stability groups for experimental and predicted classes
| | |||
|---|---|---|---|
| Training/testing data | 795 | 2442 | 743 |
| Predictions 1 (P1) | 6974 | 10683 | 6357 |
| Predictions 2 (P2) | 3319 | 17475 | 3220 |
The first row shows the number of proteins in each of the stable, unstable and non-assigned groups that were allotted based on the cluster analysis of the GPSP data [1]. This data was employed in all of the statistical analyses and the proteins in the stable and unstable classes were used for training and testing the models. The second and third rows show how the predictions made by the BN+SVM model trained on the full data set (P1) and the predictions made by the BN+SVM model trained on the trimmed data set (P2) can be assigned to stability classes when using thresholds to define predicted “stable” and “unstable” proteins. For P1, a stable protein was defined as scoring above 0.75 and an unstable protein had to score below 0.2. For P2, a stable protein was required to score above 0.7 and an unstable protein to score below 0.3. For both P1 and P2, if a protein scored above the “unstable” threshold and below the “stable” threshold it was classified as “non-assigned”. These thresholds were set such that the number of proteins predicted to fall into the three different stability groups reflected the numbers from the cluster analysis on the GPSP data.