| Literature DB >> 25028675 |
Loris Nanni1, Alessandra Lumini2, Sheryl Brahnam3.
Abstract
Many domains would benefit from reliable and efficient systems for automatic protein classification. An area of particular interest in recent studies on automatic protein classification is the exploration of new methods for extracting features from a protein that work well for specific problems. These methods, however, are not generalizable and have proven useful in only a few domains. Our goal is to evaluate several feature extraction approaches for representing proteins by testing them across multiple datasets. Different types of protein representations are evaluated: those starting from the position specific scoring matrix of the proteins (PSSM), those derived from the amino-acid sequence, two matrix representations, and features taken from the 3D tertiary structure of the protein. We also test new variants of proteins descriptors. We develop our system experimentally by comparing and combining different descriptors taken from the protein representations. Each descriptor is used to train a separate support vector machine (SVM), and the results are combined by sum rule. Some stand-alone descriptors work well on some datasets but not on others. Through fusion, the different descriptors provide a performance that works well across all tested datasets, in some cases performing better than the state-of-the-art.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25028675 PMCID: PMC4084589 DOI: 10.1155/2014/236717
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Schema of the proposed method.
Summarized description of the datasets (if available, the number of training and independent samples is given in column “number of samples”). The column BKB reports whether it is possible from the dataset to obtain the PDB of the proteins for extracting the backbone structure.
| Name | Short name | Number of samples | Number of classes | Protocol | BKB |
|---|---|---|---|---|---|
| Membrane subcellular | MEM | 3249 + 4333 | 8 | Independent training and testing sets | NO |
| Human pairs | HU | 1882 | 2 | 10-fold cross validation | NO |
| Protein fold | PF | 698 | 27 | Independent training and testing sets | YES |
| GPCR | GP | 730 | 2 | 10-fold cross validation | NO |
| GRAM | GR | 452 | 5 | 10-fold cross validation | NO |
| Viral | VR | 112 | 4 | 10-fold cross validation | NO |
| Cysteines | CY | 957 | 3 | 10-fold cross validation | YES |
| SubCell | SC | 121 | 3 | 10-fold cross validation | YES |
| DNA-binding proteins | DNA | 349 | 2 | 10-fold cross validation | YES |
| Enzyme | ENZ | 1094 | 6 | 10-fold cross validation | YES |
| GO dataset | GO | 168 | 4 | 10-fold cross validation | YES |
| Human interaction | HI | 8161 | 2 | 10-fold cross validation | NO |
| Submitochondria locations | SL | 317 | 3 | 10-fold cross validation | NO |
| Virulent independent set 1 | VI1 | 2055 + 83 | 2 | Independent training and testing sets | NO |
| Virulent independent set 2 | VI2 | 2055 + 284 | 2 | Independent training and testing sets | NO |
| Adhesins | AD | 2055 + 1172 | 2 | Independent training and testing sets | NO |
Figure 2DM images extracted from 2 sample proteins of the DNA dataset.
Summary of the descriptors (short names are defined in Sections 3 and 4).
| Descriptors | ||
|---|---|---|
| Protein representation | Descriptor | Size |
| AAS | AS | 20 |
| 2G | 400 | |
| QRC | 1200 | |
| AC | 40 | |
| P2G | 800 | |
| AA | 65 | |
| GE | 480 | |
| NG | 400, 225, 512, 125, 64 | |
| SAC | 20 | |
| DW | 52 | |
|
| ||
| PSSM/SMR | AB | 400 |
| SAN | 400 | |
| SA | 400 | |
| AM | 300 | |
| PP | 320 | |
| SVD | Depends on the input representation | |
| DCT | 400 | |
| LHF_G | 176 | |
| LPQ_G | 512 | |
| LHF_L | 528 | |
| LPQ_L | 1536 | |
| BGR | 400 | |
| TGR | 8000 | |
Comparison among the different feature extractors in terms of the statistical rank on the different datasets. The 2 best descriptors for each representation are in boldface.
| Descriptors | Rank | |
|---|---|---|
| Protein representation | Descriptor | |
| AAS | AS | 23.42 |
| 2G | 27.25 | |
| QRC | 21.54 | |
| AC |
| |
| P2G | 39.78 | |
| AA |
| |
| GE | 30.24 | |
| NG | 27.85 | |
| SAC | 23.45 | |
| DW | 29.48 | |
|
| ||
| PSSM | AB | 15.25 |
| SAN |
| |
| SA | 13.20 | |
| AM | 20.50 | |
| PP |
| |
| SVD | 39.56 | |
| DCT | 28.56 | |
| LHF_G | 24.10 | |
| LPQ_G | 14.87 | |
| LHF_L | 31.81 | |
| LPQ_L | 26.72 | |
| BGR | 12.44 | |
| TGR | 15.68 | |
|
| ||
| SMR | AB | 28.78 |
| SAN | 24.80 | |
| SA | 24.82 | |
| AM | 40.52 | |
| PP |
| |
| SVD | 29.20 | |
| DCT | 32.45 | |
| LHF_G |
| |
| LPQ_G | 17.22 | |
| LHF_L | 26.24 | |
| LPQ_L | 31.24 | |
| BGR | 19.86 | |
| TGR | 23.24 | |
|
| ||
| PR (ensemble of 25 PR | SVD |
|
| DCT |
| |
| LHF_G | 41.25 | |
| LPQ_G | 38.38 | |
| LHF_L | 44.02 | |
| LPQ_L | 38.48 | |
|
| ||
| WAVE (ensemble of 25 WAVE | SVD | 40.25 |
| DCT | 47.00 | |
| LHF_G |
| |
| LPQ_G |
| |
| LHF_L | 41.10 | |
| LPQ_L | 40.20 | |
Comparison in terms of AUC in 2 class problems.
| AUC | Datasets | |||||||
|---|---|---|---|---|---|---|---|---|
| Protein representation | Descriptor | DNA | HU | HI | GP | AD | VI1 | VI2 |
| AAS | AC | 92.6 | 71.8 |
| 99.1 | 80.9 |
| 76.5 |
| AA | 90.6 | 68.3 | — | 98.8 | 78.9 |
| 75.6 | |
|
| ||||||||
| PSSM | PP |
|
| 94.8 |
|
| 86.2 |
|
| SAN |
|
|
|
|
| 87.3 |
| |
|
| ||||||||
| SMR | PP | 92.9 | 73.8 | — | 99.5 | 79.8 | 88.5 | 76.0 |
| LHF_G | 89.3 | 69.0 | — | 99.3 | 81.6 | 83.4 | 71.1 | |
|
| ||||||||
| PR | SVD | 79.6 | 74.2 | — | 98.0 | 72.3 | 59.1 | 73.3 |
| DCT | 83.4 | 67.7 | — | 95.9 | 73.4 | 68.4 | 63.0 | |
|
| ||||||||
| WAVE | LPQ_G | 83.1 | 68.6 | — | 98.5 | 74.0 | 67.4 | 67.6 |
| LHF_G | 77.7 | 68.6 | — | 97.8 | 68.9 | 65.1 | 60.8 | |
Comparison in terms of AUC in multiclass problems.
| AUC | Datasets | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Protein representation | Descriptor | MEM | PF | ENZ | GR | VR | SL | CY | GO | SC |
| AAS | AC | 93.6 | 84.8 | 66.7 | 92.7 |
| 93.2 | 78.4 | 70.0 | 67.6 |
| AA | 90.4 | 84.2 | 63.7 | 92.6 | 72.2 | 91.1 | 76.5 | 69.5 | 65.5 | |
|
| ||||||||||
| PSSM | PP |
|
|
| 80.8 |
|
|
|
|
|
| SAN | 95.5 |
|
|
| 72.0 |
|
|
|
| |
|
| ||||||||||
| SMR | PP | 94.2 | 85.9 | 66.2 |
| 76.9 | 92.2 | 78.7 | 69.0 | 66.2 |
| LHF_G |
| 87.6 | 65.6 | 91.3 |
| 89.5 | 78.2 | 72.4 | 62.9 | |
|
| ||||||||||
| PR | SVD | 94.4 | 83.5 | 59.4 | 80.8 | 76.0 | 85.4 | 73.5 | 59.7 | 60.3 |
| DCT | 91.7 | 79.5 | 60.8 | 82.6 | 74.2 | 83.9 | 71.7 | 65.3 | 64.2 | |
|
| ||||||||||
| WAVE | LPQ_G | 94.2 | 87.2 | 63.2 | 82.7 | 79.2 | 83.4 | 68.1 | 65.7 | 58.1 |
| LHF_G | 92.7 | 86.2 | 61.5 | 80.3 | 80.6 | 81.0 | 66.6 | 65.2 | 57.0 | |
Comparisons with previous versions of WAVE and PR.
| AUC | Dataset | |||
|---|---|---|---|---|
| Protein representation | Descriptor | HU | GP | AD |
| WAVE | Best in [ | 66.1 | 96.6 | 67.1 |
| PR | Best in [ | 62.8 | 87.8 | 57.5 |
| WAVE | LPQ_G | 68.6 |
| 72.3 |
| PR | SVD |
| 98.0 |
|
Comparison among ensembles and best stand-alone descriptors in terms of AUC in 2 class problems.
| AUC | Datasets | ||||||
|---|---|---|---|---|---|---|---|
| Protein representation | DNA | HU | HI | GP | AD | VI1 | VI2 |
| PSSM(PP) | 95.5 | 81.2 | 94.8 | 99.8 | 87.7 | 86.2 | 87.2 |
| PSSM(SAN) | 95.2 | 76.4 | 95.7 | 99.7 | 82.7 | 87.3 | 85.7 |
| AAS(AC) | 92.6 | 71.8 | 95.9 | 99.1 | 80.9 | 90.0 | 76.4 |
| FUS1 | 97.2 |
|
|
|
|
|
|
| FUS2 |
| — | — | — | — | — | — |
Comparison among ensembles and best stand-alone descriptors in terms of AUC in multiclass problems.
| AUC | Datasets | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Protein representation | MEM | PF | ENZ | GR | VR | SL | CY | GO | SC |
| PSSM(PP) | 96.8 | 93.1 | 78.0 | 80.8 | 81.8 | 95.7 | 79.4 |
| 70.3 |
| PSSM(SAN) | 95.5 | 87.7 | 71.1 |
| 72.0 | 94.1 | 81.8 | 78.6 | 73.4 |
| AAS(AC) | 93.6 | 84.8 | 66.7 | 92.7 | 81.8 | 93.2 | 78.4 | 70.0 | 67.6 |
| FUS1 |
| 92.7 |
| 92.3 |
|
|
| 83.8 | 75.3 |
| FUS2 | — |
| 80.1 | — | — | — | 84.3 | 82.8 |
|
Comparison with the state-of-the-art using AUC as performance indicator.
| AUC | Datasets | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Methods | HU | PF | GP | GR | VR | DNA | ENZ | MEM | GO | SL | HI | AD | VI1 | VI2 |
| [ | 77.0 | 87.0 | 83.4 | |||||||||||
| [ | 93.3 | 72.5 | 50.0 | |||||||||||
| [ | 72.5 | 99.7 |
| 82.5 | 96.0 | 82.9 | 86.1 | 76.0 | ||||||
| [ | 98.2 | |||||||||||||
| [ | 81.6 | 91.2 | 84.1 | |||||||||||
| [ | 95.9 | 79.4 | 96.8 | 93.8 | 98.0 | 87.1 | 87.9 | |||||||
| FUS1 |
| 92.7 |
| 92.3 |
|
|
|
|
|
|
|
|
|
|
Comparison with the state-of-the-art using accuracy as performance indicator.
| Accuracy | Datasets | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Methods | HU | PF | GP | GR | VR | DNA | ENZ | MEM | GO | SL | HI | AD | VI1 | VI2 |
| [ | 56.50 | |||||||||||||
| [ | 65.50 | |||||||||||||
| [ | 58.18 | |||||||||||||
|
[ | 61.04 | |||||||||||||
|
[ | 70.0 | |||||||||||||
| [ | 69.60 | |||||||||||||
| [ | 91.6 | |||||||||||||
|
[ | 91.6 | |||||||||||||
|
[ | 84.1 | |||||||||||||
| [ | 92.7 | |||||||||||||
|
[ | 92.6 | |||||||||||||
| [ | 70.0 | 98.1 | 84.4 | 78.6 | 91.5 | |||||||||
| [ | 56.2 | 94.1 | 59.4 | 85.8 | 93.1 | 85.5 | 81.7 | |||||||
| FUS1 |
| 68.6 |
| 87.9 |
|
|
| 94.3 | 64.3 |
| 93.9 |
|
|
|
| FUS2 | 74.6 |
|
| 63.0 | ||||||||||
Comparison between AAS(RC) and AAS(AC).
| AUC | Datasets | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Methods | HU | PF | GP | GR | VR | DNA | ENZ | MEM | GO | SL | HI | AD | VI1 | VI2 | CY | SC |
| AAS(RC) | 70.3 |
| 98.9 | 90.0 | 69.0 | 86.2 | 64.5 |
| 68.3 | 87.8 |
|
| 89.2 | 75.9 | 77.6 | 62.4 |
| AAS(AC) |
| 84.8 |
|
|
|
|
| 93.6 |
|
| 95.9 | 80.9 |
|
|
|
|
Comparison among ensembles and best stand-alone descriptors in terms of AUC.
| AUC | Datasets | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Methods | HU | PF | GP | GR | VR | DNA | ENZ | MEM | GO | SL | HI | AD | VI1 | VI2 | CY | SC |
| PSSM(PP) |
|
| 99.8 | 80.8 |
|
|
|
|
|
| 94.8 |
| 86.2 |
| 79.4 | 70.3 |
| PSSM(SAN) | 76.4 | 87.7 | 99.7 |
| 72.0 | 95.2 | 71.1 | 95.5 | 78.6 | 94.1 | 95.7 | 82.7 |
| 85.7 |
|
|
| PSSM(LPQ_G) | 72.0 | 89.5 |
| 82.3 | 77.7 | 89.5 | 66.2 | 93.6 | 73.0 | 93.7 |
| 86.8 | 82.3 | 83.9 | 70.3 | 61.6 |