| Literature DB >> 27529053 |
Zhijun Liao1, Ying Ju2, Quan Zou3.
Abstract
G protein-coupled receptors (GPCRs) are the largest receptor superfamily. In this paper, we try to employ physical-chemical properties, which come from SVM-Prot, to represent GPCR. Random Forest was utilized as classifier for distinguishing them from other protein sequences. MEME suite was used to detect the most significant 10 conserved motifs of human GPCRs. In the testing datasets, the average accuracy was 91.61%, and the average AUC was 0.9282. MEME discovery analysis showed that many motifs aggregated in the seven hydrophobic helices transmembrane regions adapt to the characteristic of GPCRs. All of the above indicate that our machine-learning method can successfully distinguish GPCRs from non-GPCRs.Entities:
Year: 2016 PMID: 27529053 PMCID: PMC4978840 DOI: 10.1155/2016/8309253
Source DB: PubMed Journal: Scientifica (Cairo) ISSN: 2090-908X
The number of proteins and composition for every class of GPCRs (from GPCRdb).
| GPCRdb family | Number of proteins (human) | Composition |
|---|---|---|
| Class A (rhodopsin) | 16526 (311) | Aminergic receptors, peptide receptors, protein receptors, lipid receptors, melatonin receptors, nucleotide receptors, steroid receptors, alicarboxylic receptors, sensory receptors, orphan receptors, and others |
| Class B1 (secretin) | 748 (15) | Peptide receptors |
| Class B2 (adhesion) | 381 (33) | Orphan receptors |
| Class C (glutamate) | 1038 (22) | Ion receptors, amino acid receptors, sensory receptors, and orphan receptors |
| Class F (frizzled) | 48 (11) | Peptide receptors |
| Other GPCRs | 37 (6) | Orphan receptors |
The composition of 188D features of a protein.
| Physicochemical property | Dimensions |
|---|---|
| Amino acid composition | 20 |
| Hydrophobicity | 21 |
| Normalized Van der Waals volume | 21 |
| Polarity | 21 |
| Polarizability | 21 |
| Charge | 21 |
| Surface tension | 21 |
| Secondary structure | 21 |
| Solvent accessibility | 21 |
|
| |
| Total | 188 |
The distribution of positive and negative sample numbers for training and test dataset.
| Performance | Part | Number of GPCRs | Number of non-GPCRs | Total number |
|---|---|---|---|---|
| 1st | Training | 1996 | 8309 | 10305 |
| 1st | Test | 499 | 2077 | 2576 |
| 2nd | Training | 1996 | 8309 | 10305 |
| 2nd | Test | 499 | 2077 | 2576 |
| 3rd | Training | 1996 | 8309 | 10305 |
| 3rd | Test | 499 | 2077 | 2576 |
| 4th | Training | 1996 | 8309 | 10305 |
| 4th | Test | 499 | 2077 | 2576 |
| 5th | Training | 1996 | 8308 | 10304 |
| 5th | Test | 499 | 2078 | 2577 |
Performance measures for random forest from SVM-Prot feature.
| Measure | Formula | Meaning |
|---|---|---|
| Sensitivity |
| Measure to avoid type II error |
| Specificity |
| Measure to avoid type I error |
| Accuracy |
| Measure of correctness |
| Matthew's correlation coefficient |
| Correlation coefficient |
TP (true positive) stands for the number of true GPCRs that are predicted correctly, TN (true negative) stands for the number of true non-GPCRs that are predicted correctly, FP (false positive) is the number of true non-GPCRs that are incorrectly predicted to be GPCRs, and FN (false negative) is the number of true GPCRs that are incorrectly predicted to be non-GPCRs.
Performance qualities measure for test dataset by using the models from the corresponding training dataset.
| Test dataset | Sn | Sp | Acc | MCC | AUC |
|---|---|---|---|---|---|
| 1st | 0.5952 | 0.9812 | 0.7882 | 0.6248 | 0.930 |
| 2nd | 0.5832 | 0.9807 | 0.7820 | 0.6146 | 0.909 |
| 3rd | 0.6013 | 0.9620 | 0.7817 | 0.5763 | 0.879 |
| 4th | 0.7675 | 0.9726 | 0.8700 | 0.7562 | 0.943 |
| 5th | 0.9238 | 0.9654 | 0.9446 | 0.8900 | 0.980 |
| Mean ± SD | 0.6942 ± 0.1491 | 0.9724 ± 0.0087 | 0.8333 ± 0.0726 | 0.6924 ± 0.1296 | 0.928 ± 0.038 |
AUC, also called receiver operating characteristic (ROC) area, means the area under the receiver operating characteristic curve which is a measure of the accuracy of a classification model.
Figure 1The discovered motifs of human GPCRs from the MEME system (for details see Table 6). (a) MEME run showing combined block diagram for top ten motifs distribution with corresponding sequence ID and E-value (E-value threshold: 0.01, showing 31 GPCR sequences). (b) The ten motif logos found by MEME.
Human top 10 conserved motifs of GPCR sequences found by the MEME system.
| Motif | Width |
| Best possible match |
|---|---|---|---|
| 1 | 40 | 4.3 | KMACTIMAMFLHYFYLAAFFWMLIEGLHLYLMAVMVWHHE |
| 2 | 29 | 1.5 | VMHYLFTIFNSFQGFFIFIFHCLLNRQVR |
| 3 | 41 | 4.4 | CLDRPIPPCRSLCERARQGCEPLMNKFGFPWPEMMKCDKFP |
| 4 | 50 | 5.3 | VITWVGIIISLVCLLICIFTFLFCRAIQNTRTSIHKNLCICLFLAHLLFL |
| 5 | 21 | 3.8 | NKTHTTCRCNHLTNFAVLMAH |
| 6 | 29 | 1.0 | GTDKRCWLHLDKGFIWSFIGPVCVIILVN |
| 7 | 50 | 3.9 | IFFIITLWIMKRHLSSLNPEVSTLQNTRMWAFKAFAQLFILGCTWCFGIL |
| 8 | 29 | 1.8 | LQVHQWYPLVKKQCHPDLKFFLCSMYAPV |
| 9 | 29 | 1.6 | CQPIDIPLCHDIGYNQMIMPNLLNHETQE |
| 10 | 50 | 2.0 | MKHDGTKTEKLEKLMIRIGVFSVLYTVPATIVIACYFYEQAFRDHWERTW |