| Literature DB >> 34938415 |
Fang Ge1, Yi-Heng Zhu1, Jian Xu1, Arif Muhammad1,2, Jiangning Song3,4, Dong-Jun Yu1.
Abstract
Transmembrane proteins have critical biological functions and play a role in a multitude of cellular processes including cell signaling, transport of molecules and ions across membranes. Approximately 60% of transmembrane proteins are considered as drug targets. Missense mutations in such proteins can lead to many diverse diseases and disorders, such as neurodegenerative diseases and cystic fibrosis. However, there are limited studies on mutations in transmembrane proteins. In this work, we first design a new feature encoding method, termed weight attenuation position-specific scoring matrix (WAPSSM), which builds upon the protein evolutionary information. Then, we propose a new mutation prediction algorithm (cascade XGBoost) by leveraging the idea learned from consensus predictors and gcForest. Multi-level experiments illustrate the effectiveness of WAPSSM and cascade XGBoost algorithms. Finally, based on WAPSSM and other three types of features, in combination with the cascade XGBoost algorithm, we develop a new transmembrane protein mutation predictor, named MutTMPredictor. We benchmark the performance of MutTMPredictor against several existing predictors on seven datasets. On the 546 mutations dataset, MutTMPredictor achieves the accuracy (ACC) of 0.9661 and the Matthew's Correlation Coefficient (MCC) of 0.8950. While on the 67,584 dataset, MutTMPredictor achieves an MCC of 0.7523 and area under curve (AUC) of 0.8746, which are 0.1625 and 0.0801 respectively higher than those of the existing best predictor (fathmm). Besides, MutTMPredictor also outperforms two specific predictors on the Pred-MutHTP datasets. The results suggest that MutTMPredictor can be used as an effective method for predicting and prioritizing missense mutations in transmembrane proteins. The MutTMPredictor webserver and datasets are freely accessible at http://csbio.njust.edu.cn/bioinf/muttmpredictor/ for academic use.Entities:
Keywords: 1000 Genomes, 1000 genomes project consortium; APOGEE, pathogenicity prediction through the logistic model tree; BorodaTM, boosted regression trees for disease-associated mutations in transmembrane proteins; COSMIC, catalogue of somatic mutations in cancer; Cascade XGBoost; ClinVar, clinical variants; Condel, consensus deleteriousness score of missense mutations; Disease-associated mutations; Entprise, entropy and predicted protein structure; ExAC, the exome aggregation consortium; Meta-SNP, meta single nucleotide polymorphism; Mutation prediction; PROVEAN, protein variation effect analyzer; PolyPhen, polymorphism phenotyping; PolyPhen-2, polymorphism phenotyping v2; Pred-MutHTP, prediction of mutations in human transmembrane proteins; PredictSNP1, predict single nucleotide polymorphism v1; Protein evolutionary information; REVEL, rare exome variant ensemble learner; SDM, site-directed mutate; SIFT, sorting intolerant from tolerant; SNAP, screening for non-acceptable polymorphisms; SNP&GO, single nucleotide polymorphisms and gene ontology annotations; SwissVar, variants in UniProtKB/Swiss-Prot; TMSNP, transmembrane single nucleotide polymorphisms; Transmembrane protein; WEKA, waikato environment for knowledge analysis; fathmm, functional analysis through hidden markov models; humsavar, human polymorphisms and disease mutations
Year: 2021 PMID: 34938415 PMCID: PMC8649221 DOI: 10.1016/j.csbj.2021.11.024
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Statistical summary of the seven benchmark datasets used in this study.
| Order | Name | Number of mutations (number of proteins with mutations) | Note | |
|---|---|---|---|---|
| Disease | Neutral | |||
| 1 | 546 mutations | 392 (31) | 154 (51) | From BorodaTM |
| 2 | Whole data*(Pred-MutHTP) | 11,846 (1,014) | 9,533 (2,958) | From Pred-MutHTP |
| 3 | Cytoplasmic or inside | 4,416 (625) | 2,958 (1,513) | From Pred-MutHTP |
| 4 | Membrane | 2,421 (454) | 1,285 (853) | From Pred-MutHTP |
| 5 | Extracellular or outside | 4,948 (677) | 5,083 (1,800) | From Pred-MutHTP |
| 6 | 67,584 mutations | 29,020 (2,581) | 38,564 (11,597) | From BorodaTM |
| 7 | TMSNP | 2,624 (354), 437 likely (143) | 196,705 (2,924) | From TMSNP |
Note: For each dataset the number of proteins with mutations is given in parenthesis. Whole data*(Pred-MutHTP): all mutations in human transmembrane proteins are considered.
Fig. 1An overall workflow of MutTMPredictor.
Performance evaluation of WAPSSM and the original PSSM features on the test data of the 546 mutations dataset.
| Features | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| WAPSSM | 0.6364 | 0.2657 | 0.6190 | 0.7222 | 0.3600 | 0.8667 | 26 (47.3%) | 9 (16.4%) | 16 (29.1%) | 4 (7.27%) |
| PSSM | 0.5636 | 0.0969 | 0.5750 | 0.6571 | 0.3200 | 0.7667 | 23 (41.8%) | 8 (14.5%) | 17 (30.9%) | 7 (12.7%) |
Performance comparison of “Original”, “WAPSSM”, “Individuals’ output”, and “Combined” features on the test data of the 546 mutations dataset.
| Features | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| WAPSSM | 26 (47.3%) | 9 (16.4%) | 16 (29.1%) | 4 (7.27%) | 0.6190 | 0.8667 | 0.7222 | 0.6364 | 0.2657 |
| Original | 27 (49.1%) | 12 (21.8%) | 13 (23.6%) | 3 (5.45%) | 0.6750 | 0.9000 | 0.7714 | 0.7091 | 0.4249 |
| Individuals’ output | 28(50.91%) | 13(23.64%) | 12(21.82%) | 2(3.64%) | 0.7000 | 0.9333 | 0.8000 | 0.7455 | 0.5068 |
| Combined | 28 (50.9%) | 18 (32.7%) | 7 (12.7%) | 2 (3.64%) | 0.8000 | 0.9333 | 0.8615 | 0.8364 | 0.6763 |
Performance comparison of XGBoost and cascade XGBoost on the test data of the 546 mutations dataset.
| Model | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| XGBoost# | 28(50.9%) | 18(32.7%) | 7(12.7%) | 2(3.64%) | 0.8364 | 0.8000 | 0.9333 | 0.8615 | 0.6763 |
| cascade XGBoost# | 35(63.64%) | 13(23.64%) | 6(10.91%) | 1(1.81%) | 0.8727 | 0.8537 | 0.9722 | 0.9091 | 0.7166 |
Note: we adopted the programs in iLearn toolkit [62] to implement CHI2, IG, and MI methods and the comparison results and description of CHI2, IG, MI, and mRMR methods can be respectively found in Supplementary Figs. S1 (A)-S1 (D), Tables S4-S5, and Text S3. XGBoost#: all features were used; cascade XGBoost#: the top 27 features selected by mRMR [54] were applied.
Performance comparison of MutTMPredictor and six existing predictors on 442 mutations.
| Predictor | |||||||
|---|---|---|---|---|---|---|---|
| fathmm# | 227(51.36%) | 51(11.54%) | 41(9.28%) | 123(27.83%) | 0.8470 | 0.6486 | 0.7346 |
| PROVEAN# | 305(69.00%) | 54(12.22%) | 38(8.60%) | 45(10.18%) | 0.8892 | 0.8714 | 0.8802 |
| SIFT# | 322(72.85%) | 50(11.31%) | 42(9.50%) | 28(6.33%) | 0.8846 | 0.9200 | 0.9020 |
| PolyPhen-2# | 326(73.76%) | 45(10.18%) | 47(10.63%) | 24(5.43%) | 0.8740 | 0.9314 | 0.9018 |
| Entprise | 299(54.76%) | 168(30.77%) | 46(8.42%) | 33(6.04%) | 0.8667 | 0.9006 | 0.8833 |
| BorodaTM | 360(65.93%) | 151(27.61%) | 3(0.56%) | 32(5.86%) | 0.9917 | 0.9184 | 0.9536 |
| MutTMPredictor | 347(78.51%) | 80(18.10%) | 12(2.71%) | 3(0.68%) | 0.9666 | 0.9914 | 0.9788 |
Note: PROVEAN#/SIFT#, http://provean.jcvi.org; PolyPhen-2#, http://genetics.bwh.harvard.edu/pph2; fathmm#, http://fathmm.biocompute.org.uk/inherited.html. When generated the outputs of individual predictors, BorodaTM [33] and Entprise [72] were not included, so the proteins used in Entprise and BorodaTM were not removed from 546 mutations when constructing the new dataset. If no protein sequences were removed, the evaluation values of MutTMPredictor on 546 mutations are given below: TP, 384(70.33%); TN, 142(26.01%); FP, 12(2.20%); FN, 8(1.47%); Pre, 0.9697; Recall, 0.9796; F, 0.9746; MCC, 0.9090; and ACC, 0.9634. In terms of TP, FP, FN, Recall, F, MCC, and ACC values, it can be clearly seen that MutTMPredictor is superior to Entprise and BorodaTM on 546 mutations dataset. To avoid confusion, we did not list the prediction results of MutTMPredictor on 546 mutations in Table 5.
Fig. 2MCC and ACC values of fathmm, PROVEAN, SIFT, PolyPhen-2, Entprise, BorodaTM, and MutTMPredictor on 442 mutations.
Fig. 3The 3D structure, mutant site, and residues within 5Å around the mutation site of protein 5ER7 and 5UEN. 3(A): 3D structure of 5ER7 (i.e. P29033) and the mutation site I203 with “sticks + spheres” format, where I203T represents that the residue I at the position 203 mutated to T. In the dashed box of 3(A), we depicted the residues within 5Å around the mutation site I203, with sticks format. 3(B): 3D structure of 5UEN (i.e. P30542) and the mutation site R105 with “sticks + spheres” format, where R105H represents that residue R at the position 105 mutated to H. Again, in the dashed box of 3(B), we depicted the residues within 5Å around the mutation site R105, with sticks format. Single-letter abbreviations for 20 types of native amino acid utilized in Fig. 3 and this section include: G, Glycine; A, Alanine; V, Valine; L, Leucine; I, Isoleucine; P, Proline; F, Phenylalanine; Y, Tyrosine; W, Tryptophan; S, Serine; T, Threonine; C, Cystine; M, Methionine; N, Asparagine; Q, Glutarnine; D, Aspartic acid; E, Glutamic acid; K, Lysine; R, Arginine; H, Histidine.
Performance comparison of MutTMPredictor, PredictSNP, MAPP, PhDSNP, PolyPhen1, and SNAP on 546 mutations dataset.
| Predictor | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| PredictSNP | 0.7985 | 0.8750 | 0.8393 | 0.8568 | 0.5190 | 0.8393 | 0.7670 | 0.6948 | 0.6294 |
| MAPP | 0.7161 | 0.8496 | 0.7347 | 0.7880 | 0.3743 | 0.7347 | 0.7670 | 0.6688 | 0.4976 |
| PhDSNP | 0.7967 | 0.8539 | 0.8648 | 0.8593 | 0.4932 | 0.8648 | 0.7441 | 0.6234 | 0.6443 |
| PolyPhen1 | 0.7289 | 0.8486 | 0.7577 | 0.8005 | 0.3879 | 0.7577 | 0.7067 | 0.6558 | 0.5153 |
| SNAP | 0.7363 | 0.8780 | 0.7347 | 0.8000 | 0.4364 | 0.7347 | 0.7375 | 0.7403 | 0.5229 |
| MutTMPredictor | 0.9634 | 0.9697 | 0.9796 | 0.9746 | 0.9090 | 0.9796 | 0.9508 | 0.9221 | 0.9467 |
Note: the PredictSNP webserver can provide prediction results for nine predictors, including MAPP [82], nsSNPAnalyzer [84], PANTHER [85], PhD-SNP [83], PolyPhen1 [10], PolyPhen-2 [11], SIFT [6], SNAP [9], and PredictSNP [13]. In Table 6, Table 7, the results of nsSNPAnalyzer, PANTHER, SIFT, and PolyPhen-2 were not listed, because: (1) there were too many “unknown” in nsSNPAnalyzer and PANTHER outputs; (2) performance comparison with SIFT, and PolyPhen-2 is discussed in the previous section.
Performance comparison of MutTMPredictor, PredictSNP, MAPP, PhDSNP, PolyPhen1, and SNAP in terms of TP, TN, FP, FN, and three types of error on 546 mutations dataset.
| Predictor | |||||||
|---|---|---|---|---|---|---|---|
| PredictSNP | 329(60.26%) | 107(19.60%) | 47(8.61%) | 63(11.54%) | 0.0861 | 0.3052 | 0.1607 |
| MAPP | 288(52.75%) | 103(18.86%) | 51(9.34%) | 104(19.05%) | 0.0934 | 0.3312 | 0.2653 |
| PhDSNP | 339(62.09%) | 96(17.58%) | 58(10.62%) | 53(9.71%) | 0.1062 | 0.3766 | 0.1352 |
| PolyPhen1 | 297(54.40%) | 101(18.50%) | 53(9.71%) | 95(17.40%) | 0.0971 | 0.3442 | 0.2423 |
| SNAP | 288(52.75%) | 114(20.88%) | 40(7.33%) | 104(19.05%) | 0.0733 | 0.2597 | 0.2653 |
| MutTMPredictor | 384(70.33%) | 142(26.01%) | 12 (2.20%) | 8 (1.47%) | 0.0220 | 0.0779 | 0.0204 |
Performance comparison of MutTMPredictor, Pred-MutHTP, and CSM-membrane on the 546 mutations dataset.
| Predictor | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Pred-MutHTP (546 mutations) | 362(66.30%) | 103(18.86%) | 50(9.16%) | 31(5.68%) | 0.8516 | 0.8786 | 0.9211 | 0.8994 | 0.6732 | 0.7687 |
| mCSM-membrane (546 mutations) | 322(59.08%) | 110(20.18%) | 42(7.71%) | 71(13.03%) | 0.7927 | 0.8846 | 0.8193 | 0.8507 | 0.7237 | 0.6077 |
| mCSM-membrane (445 mutations) | 321(72.13%) | 111(24.94%) | 12(2.70%) | 1(0.22%) | 0.9708 | 0.9640 | 0.9969 | 0.9802 | 0.9024 | 0.9911 |
| MutTMPredictor (546 mutations) | 384(70.33%) | 142(26.01%) | 12(2.20%) | 8 (1.47%) | 0.9634 | 0.9697 | 0.9796 | 0.9746 | 0.9221 | 0.9467 |
Pred-MutHTP: https://www.iitm.ac.in/bioinfo/PredMutHTP/; mCSM-membrane: http://biosig.unimelb.edu.au/mcsm_membrane/.
Performance comparison of MutTMPredictor, Pred-MutHTP, and mCSM-membrane in terms of three types of error on 546 mutations dataset.
| Predictor | |||
|---|---|---|---|
| Pred-MutHTP (546 mutations) | 0.0916 | 0.3268 | 0.0789 |
| mCSM-membrane (546 mutations) | 0.0771 | 0.2763 | 0.1807 |
| mCSM-membrane (445 mutations) | 0.0270 | 0.0976 | 0.0031 |
| MutTMPredictor (546 mutations) | 0.0220 | 0.0779 | 0.0204 |
Fig. 4Performance comparison of Pred-MutHTP, mCSM-membrane, and MutTMPredictor in terms of MCC and AUC on 546 mutations dataset.
Confusion matrix and three types of errors of MutTMPredictor and four existing predictors on 67,584 mutations dataset.
| Predictor | |||||||
|---|---|---|---|---|---|---|---|
| SIFT# | 24290(35.94%) | 25031(37.04%) | 13533(20.02%) | 4730(7.00%) | 0.2002 | 0.3509 | 0.1630 |
| PolyPhen-2# | 25638(38.13%) | 23864(35.49%) | 14695(21.85%) | 3043(4.53%) | 0.2185 | 0.3811 | 0.1061 |
| PROVEAN# | 23665(35.02%) | 28240(41.79%) | 10324(15.28%) | 5355(7.92%) | 0.1527 | 0.2677 | 0.1845 |
| fathmm# | 22069(32.65%) | 31948(42.27%) | 6616(9.79%) | 6915(10.28%) | 0.0979 | 0.1716 | 0.2386 |
| MutTMPredictor | 24743(36.61%) | 34572(51.15%) | 3992(5.91%) | 4277(6.33%) | 0.0591 | 0.1035 | 0.1474 |
Performance evaluation of MutTMPredictor and four existing predictors on 67,584 mutations dataset.
| Predictor | ||||||
|---|---|---|---|---|---|---|
| SIFT# | 0.7298 | 0.6422 | 0.8370 | 0.7268 | 0.6491 | 0.8411 |
| PolyPhen-2# | 0.7362 | 0.6357 | 0.8939 | 0.7430 | 0.6189 | 0.8869 |
| PROVEAN# | 0.7680 | 0.6963 | 0.8155 | 0.7512 | 0.7323 | 0.8406 |
| fathmm# | 0.7993 | 0.7694 | 0.7605 | 0.7649 | 0.8284 | 0.8213 |
| MutTMPredictor | 0.8776 | 0.8641 | 0.8526 | 0.8567 | 0.8965 | 0.8914 |
Note: PROVEAN#: ; PolyPhen-2#: ; fathmm#: .
Fig. 5Performance assessment of five predictors in terms of the MCC and AUC values on 67,584 mutations dataset.
Performance evaluation of MutTMPredictor and Pred-MutHTP in terms of confusion matrix and three types of errors on mutations located in three different topological regions of membrane proteins.
| Dataset | Predictor | Num of fea# | Validation# | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Whole data | MutTMPredictor | 61 | 10-fold | 10501(49.72%) | 7803(36.98%) | 1523(7.20%) | 1284(6.11%) | 0.0720 | 0.1629 | 0.1094 |
| test | 2175(51.50%) | 1569(37.15%) | 288(6.82%) | 191(4.52%) | 0.0682 | 0.1551 | 0.0807 | |||
| Pred-MutHTP | 20 | 10-fold-group-wise | 9130(42.71%) | 6822(31.91%) | 2593(12.13%) | 6822(13.25%) | 0.1213 | 0.2754 | 0.2368 | |
| test | 1380(32.28%) | 1972(46.13%) | 537(12.56%) | 386(9.03%) | 0.1256 | 0.214 | 0.2186 | |||
| Cytoplasmic or Inside | MutTMPredictor | 20 | 10-fold | 3856(52.24%) | 2289(31.07%) | 669(9.09%) | 560(7.60%) | 0.0909 | 0.2264 | 0.1270 |
| test | 790(53.56%) | 434(29.42%) | 132(8.95%) | 119(8.07%) | 0.0895 | 0.2332 | 0.1309 | |||
| Pred-MutHTP#v | 15 | 10-fold | 2666(36.16%) | 2711(36.77%) | 975(13.23%) | 1020(13.84%) | 0.1323 | 0.2646 | 0.2768 | |
| test | 790(53.57%) | 325(22.07%) | 102(6.92%) | 257(17.44%) | 0.0692 | 0.2387 | 0.2456 | |||
| Membrane | MutTMPredictor | 60 | 10-fold | 2304(62.50%) | 1137(30.71%) | 148(3.80%) | 117(2.99%) | 0.0380 | 0.1102 | 0.0456 |
| test | 457(61.59%) | 237(31.94%) | 36(4.85%) | 12(1.62%) | 0.0485 | 0.1319 | 0.0256 | |||
| Pred-MutHTP#v | 15 | 10-fold-group-wise | 2074(55.99%) | 865(23.34%) | 291(7.86%) | 474(12.81%) | 0.0786 | 0.2519 | 0.1862 | |
| test | 366(49.42%) | 266(36.00%) | 51(6.96%) | 56(7.62%) | 0.0696 | 0.162 | 0.1336 | |||
| Extracellular or Outside | MutTMPredictor | 25 | 10-fold | 4332(43.23%) | 4431(44.12%) | 652(6.47%) | 616(6.18%) | 0.0647 | 0.1280 | 0.1250 |
| test | 902(44.94%) | 882(43.95%) | 114(5.68%) | 109(5.43%) | 0.0568 | 0.1145 | 0.1078 | |||
| Pred-MutHTP#v | 19 | 10-fold-group-wise | 1679(16.74%) | 5794(57.76%) | 1948(19.42%) | 610(6.08%) | 0.1942 | 0.2516 | 0.2665 | |
| test | 969(48.34%) | 579(28.90%) | 194(9.68%) | 262(13.08%) | 0.0968 | 0.2510 | 0.2129 |
Note: TP, TN, FP, and FN values of Pred-MutHTP# were calculated based on the given SN, SP, ACC, and the number of total/20% test mutations in Pred-MutHTP [31]. Based on the obtained TP, TN, FP, and FN values, we further calculated the ER, FPR, FNR, Pre, F, and NPV values of Pred-MutHTP. “Num of fea#” is the number of features used in the model prediction. Validation#: in Pred-MutHTP [31], the authors used CD-HIT [44] to aggregate sequences into ten clusters and performed 10-fold-group-wise cross-validation on the datasets. However, the authors did not provide the specific sequences in ten clusters. Herein, we applied 10-fold cross-validation to the corresponding datasets. “test” means 20% independent test.
Performance evaluation of MutTMPredictor and Pred-MutHTP on mutations located in three different topological regions of membrane proteins.
| Dataset | Predictor | Num of fea# | Validation# | |||||
|---|---|---|---|---|---|---|---|---|
| Whole data | MutTMPredictor | 61 | 10-fold | 0.8906 | 0.8371 | 0.867 | 0.7297 | 0.8048 |
| test | 0.9193 | 0.8449 | 0.8866 | 0.7693 | 0.8821 | |||
| Pred-MutHTP#v | 20 | 10-fold-group-wise | 0.7632 | 0.7246 | 0.7462 | 0.4800 | 0.8200 | |
| test | 0.7814 | 0.7860 | 0.7841 | 0.5000 | 0.8600 | |||
| Cytoplasmic or Inside | MutTMPredictor | 20 | 10-fold | 0.8732 | 0.7738 | 0.8333 | 0.6527 | 0.8235 |
| test | 0.8691 | 0.7668 | 0.8298 | 0.6388 | 0.8179 | |||
| Pred-MutHTP#v | 15 | 10-fold | 0.7232 | 0.7354 | 0.7293 | 0.4500 | 0.7900 | |
| test | 0.7544 | 0.7613 | 0.7564 | 0.4700 | 0.8100 | |||
| Membrane | MutTMPredictor | 60 | 10-fold | 0.9544 | 0.8898 | 0.9321 | 0.8490 | 0.9141 |
| test | 0.9744 | 0.8681 | 0.9353 | 0.8605 | 0.9213 | |||
| Pred-MutHTP#v | 15 | 10-fold-group-wise | 0.8138 | 0.7481 | 0.7933 | 0.5400 | 0.8400 | |
| test | 0.8664 | 0.8380 | 0.8542 | 0.7000 | 0.9100 | |||
| Extracellular or Outside | MutTMPredictor | 25 | 10-fold | 0.8750 | 0.8720 | 0.8735 | 0.7470 | 0.8889 |
| test | 0.8922 | 0.8855 | 0.8889 | 0.7778 | 0.8889 | |||
| Pred-MutHTP#v | 19 | 10-fold-group-wise | 0.7335 | 0.7484 | 0.7450 | 0.4400 | 0.8100 | |
| test | 0.7871 | 0.7490 | 0.7724 | 0.5300 | 0.8400 |
Note: the SN, SP, ACC, MCC, and AUC values of Pred-MutHTP were collected from Pred-MutHTP [31].
Performance evaluation of MutTMPredictor, four non-specific, and two specific predictors on mutations located in three different topological regions of membrane proteins.
| Topology | Predictor types* | Predictor | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Cytoplasmic or Inside | Non-specific | SIFT | 0.6958 | 0.7480 | 0.7421 | 0.7450 | 0.6268 | 0.6194 | 0.3681 | 0.6844 |
| PolyPhen-2 | 0.7255 | 0.754 | 0.8039 | 0.7782 | 0.6085 | 0.6752 | 0.4207 | 0.7062 | ||
| PROVEAN | 0.7017 | 0.7850 | 0.6911 | 0.7351 | 0.7174 | 0.6087 | 0.4010 | 0.7042 | ||
| fathmm | 0.7529 | 0.8182 | 0.7552 | 0.7854 | 0.7495 | 0.6722 | 0.4975 | 0.7524 | ||
| Specific | Pred-MutHTP# | 0.7293 | 0.7321 | 0.7232 | 0.7276 | 0.7354 | 0.7265 | 0.4500 | 0.7900 | |
| TMSNP(1.91%)# | 0.9362 | 0.9603 | 0.9680 | 0.9641 | 0.6875 | 0.7333 | 0.6743 | 0.8277 | ||
| MutTMPredictor# | 0.8333 | 0.8537 | 0.8732 | 0.8627 | 0.7738 | 0.8049 | 0.6527 | 0.8235 | ||
| Membrane | Non-specific | SIFT | 0.7563 | 0.7906 | 0.853 | 0.8206 | 0.5743 | 0.6746 | 0.4458 | 0.6844 |
| PolyPhen-2 | 0.7760 | 0.7911 | 0.8930 | 0.8390 | 0.5556 | 0.7338 | 0.4853 | 0.7062 | ||
| PROVEAN | 0.7542 | 0.811 | 0.8133 | 0.8121 | 0.6428 | 0.6463 | 0.4567 | 0.7042 | ||
| fathmm | 0.7669 | 0.8978 | 0.7257 | 0.8026 | 0.8444 | 0.6204 | 0.5435 | 0.7524 | ||
| Specific | Pred-MutHTP# | 0.7933 | 0.8769 | 0.8138 | 0.8442 | 0.7481 | 0.6457 | 0.5400 | 0.8400 | |
| TMSNP(49.92%)# | 0.8919 | 0.9606 | 0.9126 | 0.9360 | 0.7581 | 0.5732 | 0.5983 | 0.8353 | ||
| MutTMPredictor# | 0.9321 | 0.9426 | 0.9544 | 0.9485 | 0.8898 | 0.9113 | 0.8490 | 0.9141 | ||
| Extracellular or Outside | Non-specific | SIFT | 0.7006 | 0.6863 | 0.7239 | 0.7046 | 0.6779 | 0.7161 | 0.4022 | 0.7009 |
| PolyPhen-2 | 0.7359 | 0.6993 | 0.8153 | 0.7528 | 0.6587 | 0.7855 | 0.4793 | 0.7370 | ||
| PROVEAN | 0.7173 | 0.7189 | 0.7009 | 0.7098 | 0.7332 | 0.7158 | 0.4344 | 0.7171 | ||
| fathmm | 0.7450 | 0.8331 | 0.6041 | 0.7003 | 0.8822 | 0.6959 | 0.5072 | 0.7431 | ||
| Specific | Pred-MutHTP# | 0.7450 | 0.4629 | 0.7450 | 0.5676 | 0.7484 | 0.9047 | 0.4400 | 0.8100 | |
| TMSNP(1.83%)# | 0.9185 | 0.9688 | 0.9394 | 0.9538 | 0.7368 | 0.5833 | 0.6110 | 0.8381 | ||
| MutTMPredictor# | 0.8735 | 0.8697 | 0.8750 | 0.8724 | 0.8720 | 0.8772 | 0.7470 | 0.8889 |
Note: The evaluation values of Pred-MutHTP# and MutTMPredictor# are from “10-fold”/“10-fold-group-wise” row in Table 13. TMSNP*: we downloaded the entire TMSNP database (i.e. TMSNPdb_2021-09-17.csv) and searched each mutation in “Cytoplasmic or Inside”, “Membrane”, and “Extracellular or Outside” datasets. As many mutations were not stored in the TMSNP database, we calculated the evaluation metrics based on the searched results. Values in parenthesis of TMSNP# denote the ratio of the mutation number stored in TMSNP relative to the total number in the datasets. For example, TMSNP (49.92%)# means that 49.92% of mutations in the “Membrane” dataset can be found in the TMSNP database.
Performance comparison of MutTMPredictor, four non-specific, and two specific predictors in terms of the confusion matrix and three types of errors for predicting the mutations located in three different topological regions of membrane proteins.
| Topology | Predictor types* | Predictor | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Cytoplasmic or Inside region | Non-specific | SIFT | 3277(44.44%) | 1854(25.14%) | 1104(14.97%) | 1139(15.45%) | 0.1497 | 0.3732 | 0.2579 |
| PolyPhen-2 | 3550(48.14%) | 1800(24.41%) | 1158(15.70%) | 866(11.74%) | 0.1570 | 0.3915 | 0.1961 | ||
| PROVEAN | 3052(41.39%) | 2122(28.78%) | 836(11.34%) | 1364(18.50%) | 0.1134 | 0.2826 | 0.3089 | ||
| fathmm | 3335(45.23%) | 2217(30.07%) | 741(10.05%) | 1081(14.66%) | 0.1005 | 0.2505 | 0.2448 | ||
| Specific | Pred-MutHTP | 2666(36.16%) | 2711(36.77%) | 975(13.23%) | 1020(13.84%) | 0.1323 | 0.2646 | 0.2768 | |
| TMSNP(1.91%)# | 121(85.82%) | 11(7.80%) | 5(3.55%) | 4(2.84%) | 0.0355 | 0.3125 | 0.0320 | ||
| MutTMPredictor | 3856(52.24%) | 2289(31.07%) | 669(9.09%) | 560(7.60%) | 0.0909 | 0.2264 | 0.1270 | ||
| Membrane | Non-specific | SIFT | 2065(55.72%) | 738(19.91%) | 547(14.76%) | 356(9.61%) | 0.1476 | 0.4257 | 0.147 |
| PolyPhen-2 | 2162(58.34%) | 714(19.27%) | 571(15.41%) | 259(6.99%) | 0.1541 | 0.4444 | 0.107 | ||
| PROVEAN | 1969(53.13%) | 826(22.29%) | 459(12.39%) | 452(12.20%) | 0.1239 | 0.3572 | 0.1867 | ||
| fathmm | 1757(47.41%) | 1085(29.28%) | 200(5.40%) | 664(17.92%) | 0.054 | 0.1556 | 0.2743 | ||
| Specific | Pred-MutHTP | 2074(55.99%) | 865(23.34%) | 291(7.86%) | 474(12.81%) | 0.0786 | 0.2519 | 0.1862 | |
| TMSNP(49.92%)# | 1462(79.03%) | 188(10.16%) | 60(3.24%) | 140(7.57%) | 0.0324 | 0.2419 | 0.0874 | ||
| MutTMPredictor | 2304(62.50%) | 1137(30.71%) | 148(3.80%) | 117(2.99%) | 0.0380 | 0.1102 | 0.0456 | ||
| Extracellular or Outside | Non-specific | SIFT | 3582(35.71%) | 3446(34.35%) | 1637(16.32%) | 1366(13.62%) | 0.1632 | 0.3221 | 0.2761 |
| PolyPhen-2 | 4034(40.22%) | 3348(33.38%) | 1735(17.30%) | 914(9.11%) | 0.173 | 0.3413 | 0.1847 | ||
| PROVEAN | 3468(34.57%) | 3727(37.15%) | 1356(13.52%) | 1480(14.75%) | 0.1352 | 0.2668 | 0.2991 | ||
| fathmm | 2989(29.80%) | 4484(44.70%) | 599(5.97%) | 1959(19.53%) | 0.0597 | 0.1178 | 0.3959 | ||
| Specific | Pred-MutHTP | 1679(16.74%) | 5794(57.76%) | 1948(19.42%) | 610(6.08%) | 0.1942 | 0.2516 | 0.2665 | |
| TMSNP(1.83%)# | 155(84.24%) | 14(7.61%) | 5(2.72%) | 10(5.43%) | 0.0272 | 0.2632 | 0.0606 | ||
| MutTMPredictor | 4332(43.23%) | 4431(44.12%) | 652(6.47%) | 616(6.18%) | 0.0647 | 0.1280 | 0.1250 |
| Gaussian WAPSSM algorithm |
|---|
| Cascade XGBoost algorithm |
|---|