| Literature DB >> 22114707 |
Xiaofeng Song1, Tao Zhou, Hao Jia, Xuejiang Guo, Xiaobai Zhang, Ping Han, Jiahao Sha.
Abstract
Protein turnover metabolism plays important roles in cell cycle progression, signal transduction, and differentiation. Those proteins with short half-lives are involved in various regulatory processes. To better understand the regulation of cell process, it is important to study the key sequence-derived factors affecting short-lived protein degradation. Until now, most of protein half-lives are still unknown due to the difficulties of traditional experimental methods in measuring protein half-lives in human cells. To investigate the molecular determinants that affect short-lived proteins, a computational method was proposed in this work to recognize short-lived proteins based on sequence-derived features in human cells. In this study, we have systematically analyzed many features that perhaps correlated with short-lived protein degradation. It is found that a large fraction of proteins with signal peptides and transmembrane regions in human cells are of short half-lives. We have constructed an SVM-based classifier to recognize short-lived proteins, due to the fact that short-lived proteins play pivotal roles in the control of various cellular processes. By employing the SVM model on human dataset, we achieved 80.8% average sensitivity and 79.8% average specificity, respectively, on ten testing dataset (TE1-TE10). We also obtained 89.9%, 99% and 83.9% of average accuracy on an independent validation datasets iTE1, iTE2 and iTE3 respectively. The approach proposed in this paper provides a valuable alternative for recognizing the short-lived proteins in human cells, and is more accurate than the traditional N-end rule. Furthermore, the web server SProtP (http://reprod.njmu.edu.cn/sprotp) has been developed and is freely available for users.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22114707 PMCID: PMC3218052 DOI: 10.1371/journal.pone.0027836
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Features derived from protein sequence.
| feature terms | feature for dataset | abbreviation | |
| amino acids | amino acids content | mono-peptide(20) | AA_* |
| Content | di-peptide(400) | ||
| (723) | grouped amino acids content | single(6) | aa_* |
| dyad (36) | |||
| triplet(216) | |||
| transition(15) | *_>* | ||
| distribution(30) | num%* | ||
| physicochemical | sequence length | sequence length(1) | len. |
| property (4) | isoelectric point | isoelectric point(1) | isoele. |
| sulphur content | sulphur content(1) | Sulphur | |
| hydrophobicity of protein | hydrophobicity of protein(1) | Hydrophobicity | |
| structure- related | disorder region | the total length of disorder regions(1) | disorder_len |
| (7) | the average of scores(1) | disorder_score | |
| the number of disorder regions(1) | disorder_num | ||
| length of max disorder region(1) | disorder_max | ||
| protein secondary structure | helix content(1) | helix | |
| sheet content(1) | sheet | ||
| turn content(1) | turn | ||
| degradation motif | KEN box | the existence of KEN box(1) | KEN |
| (35) | destruction box | geminin content(1) | D_g |
| cyclinA content(1) | D_cA | ||
| cyclinB content(1) | D_cB | ||
| securin content(1) | D_s | ||
| PEST region | number of PEST regions | PEST_num | |
| max length of PEST regions | PEST_max | ||
| the average of PEST scores | PEST_score | ||
| the relative position of PEST regions | PEST_posi | ||
| low complexity region | total length of LCRs(1) | LCR_len. | |
| the number of LCRs (1) | LCR_num | ||
| length of max LCRs (1) | LCR_max | ||
| N terminal | animo acids of N end(20) | N_* | |
| the existence of signal peptide | the existence of signal peptide(1) | SP | |
| transmembrane | transmembrane enrichment(1) | TM | |
| transmembrane region length(1) | TM_len. | ||
| Protein | phosphorylation | the content of modification site(3) | Phos |
| Modification (7) | C-glycosylation | the content of modification site(1) | Cglyc |
| N-glycosylation | the content of modification site(1) | Nglyc | |
| O-glycosylation | the content of modification site(2) | Oglyc | |
| totally 776 | |||
Figure 1ROC curves of the proposed SProtP using different cutoff.
Figure 2The AUC value of SProtP model varies as feature number in human dataset.
Reduced optimal 11 feature set information.
| No. | features | Feature description | F-value |
| 1 | TM | transmembrane enrichment | 0.4653 |
| 2 | Hydrophobicity | hydrophobicity of protein | 0.4162 |
| 3 | SP | Signal peptide | 0.4114 |
| 4 | aa_ba | composition of amino acid class b and a combined dyad | 0.2950 |
| 5 | aa_aab | composition of amino acid class a, a, and b combined triplet | 0.2875 |
| 6 | len. | sequence length | 0.2827 |
| 7 | aa_da | composition of amino acid class d and a combined dyad | 0.2722 |
| 8 | AA_D | composition of amino acid “ aspartic acid” | 0.2682 |
| 9 | aa_cd | composition of amino acid class c and d combined dyad | 0.2430 |
| 10 | AA_LL | composition of amino acid “leucine” | 0.2408 |
| 11 | aa_bb | composition of amino acid class b and b combined dyad | 0.2255 |
Figure 3The propensity score of each features for short-lived and long-lived proteins.
Figure 4Features varies as protein half-lives.
Comparison between proposed SProtP and N-end rule.
| SProtP | N-rule | |
| 5-fold cross-validation | 0.81±1.65 | — |
| TP | 82±2 | 55±2 |
| FN | 20±2 | 47±2 |
| TN | 3126±37 | 2236±10 |
| FP | 794±37 | 1684±10 |
| SE | 0.808±0.018 | 0.539±0.019 |
| SP | 0.798±0.009 | 0.570±0.003 |
| ACC | 0.798±0.009 | 0.570±0.002 |
| MCC | 0.231±0.009 | 0.035±0.006 |
| AUC | 0.848±0.013 | — |