| Literature DB >> 20438647 |
Guan Ning Lin1, Zheng Wang, Dong Xu, Jianlin Cheng.
Abstract
BACKGROUND: Protein folding rate is an important property of a protein. Predicting protein folding rate is useful for understanding protein folding process and guiding protein design. Most previous methods of predicting protein folding rate require the tertiary structure of a protein as an input. And most methods do not distinguish the different kinetic nature (two-state folding or multi-state folding) of the proteins. Here we developed a method, SeqRate, to predict both protein folding kinetic type (two-state versus multi-state) and real-value folding rate using sequence length, amino acid composition, contact order, contact number, and secondary structure information predicted from only protein sequence with support vector machines.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20438647 PMCID: PMC2863059 DOI: 10.1186/1471-2105-11-S3-S1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Correlation between estimated and real contacts and experimental folding rates
| estLRCN | rLRCN | estLRCO | rLRCO | rate | |
|---|---|---|---|---|---|
| estLRCN | 1 (1) | 0.95 (0.84) | 0.61 (0.54) | ||
| rLRCN | - | 1 (1) | 0.79 (0.75) | 0.87 (0.81) | |
| estLRCO | - | - | 1 (1) | ||
| rLRCO | - | - | - | 1 (1) | |
| rate | - | - | - | - | 1 (1) |
Correlation between folding rates and estimated long-range contact number (estLRCN), estimated long-range contact order (estLRCO), real long-range contact number (rLRCN) and real long range contact order (rLRCO) using 37 two-state proteins in IvankovData and 24 multi-stat proteins in IvankovData (data shown in the parentheses).
Correlation between predicted folding rates and experimental folding rates using sequence length and other estimated predictors on IvankovData.
| L | LRCO | CO | LRCN | Coil content | |||
|---|---|---|---|---|---|---|---|
| Two-state folding rate | -0.32 | 0.72 | 0.61 | 0.68 | -0.51 | 0.57 | 0.13 |
| Multi-state folding rate | -0.80 | 0.46 | 0.33 | 0.55 | -0.18 | 0.11 | 0.05 |
L = protein sequence length, LRCO = estimated long-range contact order, CO = estimated contact order in [15], LRCN = estimated long-range contact number. IvankovData is used and there are 37 two-state proteins and 24 multi-state proteins.
Figure 1Classification accuracy surface vs. variations of parameters C and γ.
Linear regression analysis using different combinations of predictors
| estCN | estLRO | estLRO+estCN | All Predictors | ||
|---|---|---|---|---|---|
| Cor2Rate | 0.64 | 0.66 | 0.69 | 0.67 | 0.72 |
| RMSE | 1.3 | 1.14 | 1.12 | 1.27 | 1.16 |
| F-value | 21.27 | 37.03 | 38.69 | 34.66 | 35.52 |
| P-value | 6.9e-05 | 1.1e-06 | 8.7e-07 | 6.5e-05 | 7.3e-06 |
Results of linear regression analysis using R package [35] with different combinations of predictors, such as etsCN (estimated Contact number), estLRO (estimated Long range order), estLRO+estCN (using combination of predictors estCN and estLRO), α + β Contents (using combination of predictors α-helix content and β -sheet Content) and All Predictors (using all 4 predictors). Results can be shown as Cor2Rate (correlation between estimated rates and real rates using selected predictors), RMSE (Root mean square error for the regression), F-value and P-value.
Comparison among different folding rate prediction methods based on “IVANKOVDATA”
| Methods | Method Type | Fold kinetic Classification Accuracy | Correlation | MAD |
|---|---|---|---|---|
| Effective length method | sequence | NA | 0.70 | 0.96 |
| LRCO method | sequence | NA | 0.61 | 0.81 |
| FOLD-RATE | sequence | NA | 0.91 | 1.1 |
| K-Fold | structure | 81% | 0.74 | 0.75 |
| Multi-predictor SVM (two-state) | sequence | 80% | 0.81 | 0.79 |
| Multi-predictor SVM (multi-state) | sequence | 80% | 0.80 | 0.68 |
Method 1: Effective length method [14]
Method 2: LRCO method [19]
Method 3: FOLD-RATE [16]
Method 4: K-Fold [7]
Method 5: Our multi-predictor SVM (two-state)
Method 6: Our multi-predictor SVM (multi-state)
Method-Type means if the method is using experimental structural data (structure) or using only sequence data (sequence). Correlation here means the correlation value between predicted rates and experimental rates. MAD is mean absolute difference between predicted rates and experimental rates.