| Literature DB >> 27846315 |
Guohua Huang1, Chen Chu2, Tao Huang3, Xiangyin Kong3, Yunhua Zhang4, Ning Zhang5, Yu-Dong Cai6.
Abstract
Although the number of available protein sequences is growing exponentially, functional protein annotations lag far behind. Therefore, accurate identification of protein functions remains one of the major challenges in molecular biology. In this study, we presented a novel approach to predict mouse protein functions. The approach was a sequential combination of a similarity-based approach, an interaction-based approach and a pseudo amino acid composition-based approach. The method achieved an accuracy of about 0.8450 for the 1st-order predictions in the leave-one-out and ten-fold cross-validations. For the results yielded by the leave-one-out cross-validation, although the similarity-based approach alone achieved an accuracy of 0.8756, it was unable to predict the functions of proteins with no homologues. Comparatively, the pseudo amino acid composition-based approach alone reached an accuracy of 0.6786. Although the accuracy was lower than that of the previous approach, it could predict the functions of almost all proteins, even proteins with no homologues. Therefore, the combined method balanced the advantages and disadvantages of both approaches to achieve efficient performance. Furthermore, the results yielded by the ten-fold cross-validation indicate that the combined method is still effective and stable when there are no close homologs are available. However, the accuracy of the predicted functions can only be determined according to known protein functions based on current knowledge. Many protein functions remain unknown. By exploring the functions of proteins for which the 1st-order predicted functions are wrong but the 2nd-order predicted functions are correct, the 1st-order wrongly predicted functions were shown to be closely associated with the genes encoding the proteins. The so-called wrongly predicted functions could also potentially be correct upon future experimental verification. Therefore, the accuracy of the presented method may be much higher in reality.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27846315 PMCID: PMC5112993 DOI: 10.1371/journal.pone.0166580
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The number of mouse proteins in each category in our dataset.
| Functional Number | Functional categories | Number of proteins |
|---|---|---|
| 1 | METABOLISM | 2,401 |
| 2 | ENERGY | 522 |
| 3 | CELL CYCLE AND DNA PROCESSING | 971 |
| 4 | TRANSCRIPTION | 1,921 |
| 5 | PROTEIN SYNTHESIS | 399 |
| 6 | PROTEIN FATE (folding modification destination) | 2,187 |
| 7 | PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) | 7,330 |
| 8 | REGULATION OF METABOLISM AND PROTEIN FUNCTION | 972 |
| 9 | CELLULAR TRANSPORT TRANSPORT FACILITIES AND TRANSPORT ROUTES | 2,078 |
| 10 | CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM | 3,143 |
| 11 | CELL RESCUE DEFENSE AND VIRULENCE | 656 |
| 12 | INTERACTION WITH THE ENVIRONMENT | 1,212 |
| 13 | SYSTEMIC INTERACTION WITH THE ENVIRONMENT | 1,454 |
| 14 | TRANSPOSABLE ELEMENTS VIRAL AND PLASMID PROTEINS | 9 |
| 15 | CELL FATE | 1,180 |
| 16 | DEVELOPMENT (Systemic) | 939 |
| 17 | BIOGENESIS OF CELLULAR COMPONENTS | 769 |
| 18 | CELL TYPE DIFFERENTIATION | 317 |
| 19 | TISSUE DIFFERENTIATION | 313 |
| 20 | ORGAN DIFFERENTIATION | 491 |
| 21 | SUBCELLULAR LOCALIZATION | 8,467 |
| 22 | CELL TYPE LOCALIZATION | 232 |
| 23 | TISSUE LOCALIZATION | 261 |
| 24 | ORGAN LOCALIZATION | 542 |
| Total | — | 38,766 |
The physicochemical and biochemical properties of the 20 amino acids.
| Amino acid | Polarity | Second structure | Molecular volume | Codon diversity | Electrostatic charge |
|---|---|---|---|---|---|
| A | -0.591 | -1.302 | -0.733 | 1.57 | -0.146 |
| C | -1.343 | 0.465 | -0.862 | -1.02 | -0.255 |
| D | 1.05 | 0.302 | -3.656 | -0.259 | -3.242 |
| E | 1.357 | -1.453 | 1.477 | 0.113 | -0.837 |
| F | -1.006 | -0.59 | 1.891 | -0.397 | 0.412 |
| G | -0.384 | 1.652 | 1.33 | 1.045 | 2.064 |
| H | 0.336 | -0.417 | -1.673 | -1.474 | -0.078 |
| I | -1.239 | -0.547 | 2.131 | 0.393 | 0.816 |
| K | 1.831 | -0.561 | 0.533 | -0.277 | 1.648 |
| L | -1.019 | -0.987 | -1.505 | 1.266 | -0.912 |
| M | -0.663 | -1.524 | 2.219 | -1.005 | 1.212 |
| N | 0.945 | 0.828 | 1.299 | -0.169 | 0.933 |
| P | 0.189 | 2.081 | -1.628 | 0.421 | -1.392 |
| Q | 0.931 | -0.179 | -3.005 | -0.503 | -1.853 |
| R | 1.538 | -0.055 | 1.502 | 0.44 | 2.897 |
| S | -0.228 | 1.399 | -4.76 | 0.67 | -2.647 |
| T | -0.032 | 0.326 | 2.213 | 0.908 | 1.313 |
| V | -1.337 | -0.279 | -0.544 | 1.242 | -1.262 |
| W | -0.595 | 0.009 | 0.672 | -2.128 | -0.184 |
| Y | 0.26 | 0.83 | 3.097 | -0.838 | 1.512 |
Prediction accuracies of three methods and the combined method in the first three order predictions.
| Method | Number of proteins of testing dataset | 1st-order | 2nd-order | 3rd-order |
|---|---|---|---|---|
| Similarity-based | 10,252 | 0.8756 | 0.7132 | 0.5158 |
| Interaction-based | 10,539 | 0.7535 | 0.6296 | 0.5299 |
| PseAAC-based | 12,478 | 0.6786 | 0.5874 | 0.2519 |
| Combined | 12,478 | 0.8464 | 0.6814 | 0.4996 |
Contributions of the three approaches to the predicted results.
| Method | Number of proteins | Proportion | |
|---|---|---|---|
| Similarity-based approach | 10,252 | 82.16% | 0.8756 |
| Interaction-based approach | 1,876 | 15.03% | 0.7154 |
| PseAAC-based approach | 350 | 2.81% | 0.6943 |
Performances of the combined method evaluated by ten-fold cross validation.
| Order | 1 | 2 | 3 | 4 | 5 | Mean ± std |
|---|---|---|---|---|---|---|
| 1st | 0.8429 | 0.8416 | 0.8420 | 0.8440 | 0.8419 | 0.8425 ±0.0010 |
| 2nd | 0.6768 | 0.6792 | 0.6781 | 0.6787 | 0.6802 | 0.6786 ±0.0013 |
| 3rd | 0.5023 | 0.4972 | 0.5007 | 0.4977 | 0.4998 | 0.4995 ±0.0021 |
a: std is the abbreviation of standard deviation.
The sixteen significant proteins with "wrong" 1st-order predictions but "right" 2nd-order predictions based on the sequence similarity-based approach.
| Protein ID | Name | "wrong" predicted function in 1st-order prediction |
|---|---|---|
| mc11000118 | MYO1G | SUBCELLULAR LOCALIZATION |
| mc9001073 | NEO1 | SUBCELLULAR LOCALIZATION |
| mc5002204 | SDK1 | SUBCELLULAR LOCALIZATION |
| mc17000153 | PLG | PROTEIN FATE (folding, modification, destination) |
| mc2000415 | GM711 | PROTEIN FATE (folding, modification, destination) |
| mc15000840 | MAPK15 | PROTEIN FATE (folding, modification, destination) |
| mc7000273 | PRKD2 | PROTEIN FATE (folding, modification, destination) |
| mc11002342 | STRADA | PROTEIN FATE (folding, modification, destination) |
| mc7001424 | NTRK3 | PROTEIN FATE (folding, modification, destination) |
| mc14000439 | BMPR1A | PROTEIN FATE (folding, modification, destination) |
| mc11001586 | KSR1 | PROTEIN FATE (folding, modification, destination) |
| mc6000496 | EPHB6 | PROTEIN FATE (folding, modification, destination) |
| mc7000874 | KLK9 | PROTEIN FATE (folding, modification, destination) |
| mc15001663 | KRT2 | PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) |
| mc17001082 | PTK7 | PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) |
| mc1000962 | SPEG | PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) |
The twenty-two significant proteins with "wrong" 1st-order predictions but "right" 2nd-order predictions based on the weighted interaction-based approach.
| Protein ID | Name | "wrong" predicted function in 1st-order prediction |
|---|---|---|
| mc2003319 | ADRM1 | SUBCELLULAR LOCALIZATION |
| mc6000275 | ATP6V1F | SUBCELLULAR LOCALIZATION |
| mc4002507 | AURKAIP1 | SUBCELLULAR LOCALIZATION |
| mc17001119 | BYSL | SUBCELLULAR LOCALIZATION |
| mc13001367 | DHFR | SUBCELLULAR LOCALIZATION |
| mc1001293 | DTYMK | SUBCELLULAR LOCALIZATION |
| mc4000473 | GNE | SUBCELLULAR LOCALIZATION |
| mc5001787 | HPD | SUBCELLULAR LOCALIZATION |
| mc3000151 | HPS3 | SUBCELLULAR LOCALIZATION |
| mc4001314 | MAGOH | SUBCELLULAR LOCALIZATION |
| mc9000131 | MED17 | SUBCELLULAR LOCALIZATION |
| mc4001915 | NUDC | SUBCELLULAR LOCALIZATION |
| mc11000229 | PNOL | SUBCELLULAR LOCALIZATION |
| mcx000234 | RGN | SUBCELLULAR LOCALIZATION |
| mc9000734 | RPS25 | SUBCELLULAR LOCALIZATION |
| mc8000054 | SHCBP1 | SUBCELLULAR LOCALIZATION |
| mc6000048 | SHFM1 | SUBCELLULAR LOCALIZATION |
| mc2002263 | NCAPH | PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) |
| mc2000861 | RIF1 | PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) |
| mc19000070 | CDCA5 | PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) |
| mc7001471 | PRC1 | PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) |
| mc15001589 | NPFF | CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM |
The two significant proteins with "wrong" 1st-order predictions but "right" 2nd-order predictions based on the PseAAC-based approach.
| Protein ID | Name | "wrong" predicted function in 1st-order prediction |
|---|---|---|
| mc4000691 | AKAP2 | SUBCELLULAR LOCALIZATION |
| mc1001669 | KISS1 | SUBCELLULAR LOCALIZATION |