| Literature DB >> 34893688 |
Phasit Charoenkwan1, Warot Chotpatiwetchkul2, Vannajan Sanghiran Lee3, Chanin Nantasenamat4, Watshara Shoombuatong5.
Abstract
Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906-0.910) and 2-17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlabstack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34893688 PMCID: PMC8664844 DOI: 10.1038/s41598-021-03293-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of existing ML-based models for thermophilic protein prediction.
| Author (year) | Classifier a | Features b | Evaluation strategyc | Web server availabilityd |
|---|---|---|---|---|
| Zhang et al.[ | PLS | AAC | 5CV/IND | No |
| Zhang et al.[ | LogitBoost | AAC | 5CV/IND | No |
| Gromiha et al.[ | NN | AAC | 5CV/IND | No |
| Montanucci et al.[ | SVM | AAC, DPC | 5CV | Not accessible |
| Lin et al.[ | SVM | AAC, GGAC | Jackknife | Yes |
| Wang et al.[ | SVM | AAC, DPC, PCP, CTD | 5CV | No |
| Nakariyakul et al.[ | SVM | AAC, DPC | 5CV/IND | No |
| Zuo et al.[ | KNN | AAC | Jackknife | Not accessible |
| Wang et al.[ | SVM | AAC, GGAC | 5CV/IND | No |
| Fan et al.[ | SVM | AAC, pka, PSSM | 10CV/IND | No |
| Tang et al.[ | SVM | k-mer | 5CV | No |
| Feng et al.[ | SVM | ACC, DPC, PCP,RAAC | 10CV/IND | No |
| Charoenkwan et al. (this study) | SCM | DPS | 10CV/IND | Yes |
aKNN k-nearest neighbor, NN neural networks, PLS partial least-square regression, SVM support vector machine.
bAAC amino acid composition, CTD composition-transition-distribution, DPC dipeptide composition, DPS dipeptide propensity scores, GGAP g-gap dipeptide composition, k-mer fragment-based technique, pka acid dissociation constant, PCP physicochemical properties, PseACC pseudo amino acid composition, PSSM position specific scoring matrix, RACC reduce amino acid composition, TC tripeptide composition.
c5CV fivefold cross-validation, 10CV tenfold cross-validation, jackknif jackknife cross-validation, IND independent test.
dNot accessible: the webserver was not functional during the preparation of this manuscript.
Figure 1Schematic framework of the development of SCMTPP. This can be summarized into five main steps: (i) Training and independent test datasets preparation, (ii) Feature extraction, (iii) SCM-based model development, (iv) TPPs characterization and (v) SCMTPP webserver construction.
Cross-validation results of SCM models using different optimal propensity scores of g-gap dipeptides.
| R | Cutoff | ACC | Sn | Sp | MCC | AUC | |
|---|---|---|---|---|---|---|---|
| 0 | 0.650 | 418 | 0.883 | 0.878 | 0.887 | 0.766 | 0.926 |
| 1 | 0.592 | 420 | 0.872 | 0.879 | 0.865 | 0.744 | 0.918 |
| 2 | 0.634 | 414 | 0.867 | 0.865 | 0.868 | 0.734 | 0.919 |
| 3 | 0.653 | 412 | 0.869 | 0.864 | 0.874 | 0.739 | 0.916 |
| 4 | 0.602 | 417 | 0.865 | 0.867 | 0.862 | 0.730 | 0.918 |
| 5 | 0.601 | 416 | 0.867 | 0.873 | 0.861 | 0.735 | 0.918 |
| 6 | 0.601 | 407 | 0.865 | 0.862 | 0.868 | 0.730 | 0.913 |
| 7 | 0.664 | 415 | 0.862 | 0.885 | 0.840 | 0.726 | 0.911 |
| 8 | 0.668 | 415 | 0.862 | 0.848 | 0.875 | 0.724 | 0.912 |
| 9 | 0.585 | 425 | 0.861 | 0.885 | 0.837 | 0.724 | 0.909 |
| Mean | 0.625 | 416 | 0.867 | 0.871 | 0.864 | 0.735 | 0.916 |
| SD | 0.032 | 4.77 | 0.006 | 0.012 | 0.015 | 0.013 | 0.005 |
Independent test results of SCM models using different optimal propensity scores of g-gap dipeptides.
| R | Cutoff | ACC | Sn | Sp | MCC | AUC | |
|---|---|---|---|---|---|---|---|
| 0 | 0.650 | 418 | 0.865 | 0.849 | 0.881 | 0.731 | 0.925 |
| 1 | 0.592 | 420 | 0.844 | 0.846 | 0.841 | 0.687 | 0.912 |
| 2 | 0.634 | 414 | 0.863 | 0.868 | 0.857 | 0.725 | 0.918 |
| 3 | 0.653 | 412 | 0.860 | 0.836 | 0.884 | 0.721 | 0.908 |
| 4 | 0.602 | 417 | 0.852 | 0.863 | 0.841 | 0.704 | 0.909 |
| 5 | 0.601 | 416 | 0.852 | 0.854 | 0.849 | 0.704 | 0.915 |
| 6 | 0.601 | 407 | 0.867 | 0.863 | 0.871 | 0.733 | 0.914 |
| 7 | 0.664 | 415 | 0.853 | 0.860 | 0.846 | 0.706 | 0.909 |
| 8 | 0.668 | 415 | 0.840 | 0.822 | 0.857 | 0.680 | 0.910 |
| 9 | 0.585 | 425 | 0.837 | 0.849 | 0.825 | 0.674 | 0.897 |
| Mean | 416 | 0.625 | 0.853 | 0.851 | 0.855 | 0.706 | 0.912 |
| SD | 0.032 | 4.77 | 0.011 | 0.014 | 0.019 | 0.021 | 0.007 |
Figure 2Propensity scores of 400 dipeptides as obtained from the proposed SCMTPP.
Cross-validation and independent test results of SCM-based classifiers using initial-PS and optimized-PS.
| Cross-validation | Feature | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|---|
| Tenfold CV | Initial-PS | 0.844 | 0.858 | 0.829 | 0.688 | 0.910 |
| optimized-PS | 0.883 | 0.878 | 0.887 | 0.766 | 0.926 | |
| Independent test | Initial-PS | 0.848 | 0.852 | 0.844 | 0.695 | 0.914 |
| optimized-PS | 0.865 | 0.849 | 0.881 | 0.731 | 0.925 |
Figure 3Histogram plot represent scores of thermophilic and non-thermophilic proteins as derived from SCMTPP using initial (A) and optimized (B) dipeptides propensity scores on the training dataset where the mean and standard deviation are indicated by bars and closed circles, respectively.
Figure 4Performance evaluations of SCMTPP and conventional TPP predictors. (A,B) tenfold cross-validation of ACC and MCC from SCMTPP versus conventional TPP predictors. (C,D) Independent test of ACC and MCC from SCMTPP versus conventional TPP predictors.
Cross-validation and independent test results of SCMTPP and ThermoPred.
| Cross-validation | Methoda | Ac | Sn | Sp | MCC |
|---|---|---|---|---|---|
| Tenfold CV | ThermoPred | – | – | – | – |
| SCMTPP | 0.883 | 0.878 | 0.887 | 0.766 | |
| Independent test | ThermoPred | 0.860 | 0.938 | 0.782 | 0.729 |
| SCMTPP | 0.865 | 0.849 | 0.881 | 0.731 |
aResults were obtained by feeding the protein sequences in the independent validation set to the web servers of ThermoPred.
Top ten TPPs having the highest PS-TPP derived from the proposed SCMTPP.
| Rank | Name (Uniprot) | PS-TPP | UniProt ID | Function | Organism |
|---|---|---|---|---|---|
| 1 | 50S ribosomal protein L38E | 528.74 | Q9YFR9 | Structural constituent of ribosome | |
| 2 | Uncharacterized protein MJ0223 | 527.79 | Q57676 | Unknown | |
| 3 | 50S ribosomal protein L31e | 525.29 | Q9YD25 | Structural constituent of ribosome | |
| 4 | Protein GrpE | 519.54 | Q9WZV4 | Hyperosmotic and heat shock by preventing the aggregation of stress-denatured proteins | |
| 5 | Elongation factor 1-beta | 519.28 | Q8TYN8 | Promote the exchange of GDP for GTP in EF-1-alpha/GDP | |
| 6 | 50S ribosomal protein L29 | 518.45 | Q8TX34 | Structural constituent of ribosome | |
| 7 | DNA double-strand break repair Rad50 ATPase | 516.88 | Q8TXI4 | Facilitate opening of the processed DNA ends to aid in the recruitment of HerA and NurA | |
| 8 | Putative antitoxin VapB21 | 516.77 | O28071 | Possibly the antitoxin component of a type II toxin-antitoxin (TA) system | |
| 9 | V-type ATP synthase subunit E | 514.51 | Q8TWL9 | Produces ATP from ADP in the presence of a proton gradient across the membrane | |
| 10 | 50S ribosomal protein L18Ae | 513.46 | P58289 | Structural constituent of ribosome |
Figure 5Three-dimensional structures of TPPs (Q9YFR9, Q57676 and Q9YD25) and non-TPPs (Q8ZDC4, Q66A07 and A1AZ52) having the highest (528.74, 527.79 and 525.29, respectively) and lowest (319.67, 331.20 and 340.61, respectively) TPP scores, respectively, where the optimal threshold value is 418.
Propensity scores of twenty amino acids in becoming a thermophilic protein (PS-TPP) along with amino acid compositions (%) of TPPs and non-TPPs.
| Amino acid | PS-TPP | TPP (%) | Non-TPP (%) | Difference |
|---|---|---|---|---|
| E-Glu | 510.18 (1) | 9.28 | 6.49 | 2.79 (1) |
| K-Lys | 480.00 (2) | 7.83 | 5.79 | 2.04 (2) |
| V-Val | 470.75 (3) | 8.45 | 7.09 | 1.36 (3) |
| R-Arg | 464.08 (4) | 6.47 | 5.14 | 1.32 (4) |
| I-Ile | 435.65 (5) | 7.41 | 6.45 | 0.96 (5) |
| G-Gly | 433.48 (6) | 7.34 | 7.12 | 0.22 (7) |
| Y-Tyr | 425.93 (7) | 3.42 | 2.89 | 0.53 (6) |
| P-Pro | 421.40 (8) | 4.26 | 4.13 | 0.13 (8) |
| C-Cys | 388.28 (9) | 0.92 | 1.07 | − 0.15 (9) |
| M-Met | 387.10 (10) | 2.33 | 2.50 | − 0.17 (11) |
| D-Asp | 386.25 (11) | 5.18 | 5.34 | − 0.17 (10) |
| W-Trp | 383.25 (12) | 0.88 | 1.09 | − 0.22 (12) |
| L-Leu | 367.18 (13) | 9.35 | 10.14 | − 0.79 (15) |
| H-His | 364.58 (14) | 1.65 | 2.22 | − 0.57 (14) |
| S-Ser | 363.20 (15) | 4.85 | 5.90 | − 1.05 (17) |
| F-Phe | 351.25 (16) | 3.63 | 4.06 | − 0.43 (13) |
| N-Asn | 332.48 (17) | 3.33 | 4.14 | − 0.80 (16) |
| A-Ala | 323.63 (18) | 7.29 | 8.90 | − 1.61 (19) |
| T-Thr | 306.00 (19) | 4.13 | 5.32 | − 1.20 (18) |
| Q-Gln | 255.43 (20) | 2.01 | 4.21 | − 2.20 (20) |
| R | 1.00 | 0.54 | 0.12 | 0.96 |
Summary of four important physicochemical properties as determined by SCMTPP.
| Amino acid | PS-TPP (Rank) | FUKS010101 (Rank) | FUKS010102 (Rank) | ZIMJ680101 (Rank) |
|---|---|---|---|---|
| E-Glu | 510.18 (1) | 16.56 (1) | 12.93 (1) | 0.65 (13) |
| K-Lys | 480.00 (2) | 12.98 (2) | 10.20 (2) | 1.6 (7) |
| V-Val | 470.75 (3) | 4.05 (10) | 3.57 (13) | 1.79 (6) |
| R-Arg | 464.08 (4) | 8.48 (3) | 6.87 (5) | 0.83 (12) |
| I-Ile | 435.65 (5) | 3.3 (13) | 2.72 (15) | 3.07 (1) |
| G-Gly | 433.48 (6) | 8.29 (4) | 7.95 (4) | 0.1 (18) |
| Y-Tyr | 425.93 (7) | 2.75 (15) | 2.26 (16) | 2.97 (2) |
| P-Pro | 421.40 (8) | 5.41 (6) | 4.79 (11) | 2.7 (4) |
| C-Cys | 388.28 (9) | 0.29 (20) | 0.31 (20) | 1.48 (8) |
| M-Met | 387.10 (10) | 1.71 (18) | 1.87 (18) | 1.4 (9) |
| D-Asp | 386.25 (11) | 7.05 (5) | 8.57 (3) | 0.64 (14) |
| W-Trp | 383.25 (12) | 0.67 (19) | 0.54 (19) | 0.31 (16) |
| L-Leu | 367.18 (13) | 5.06 (7) | 4.43 (12) | 2.52 (5) |
| H-His | 364.58 (14) | 1.74 (17) | 2.80 (14) | 1.1 (10) |
| S-Ser | 363.20 (15) | 4.27 (9) | 5.41 (8) | 0.14 (17) |
| F-Phe | 351.25 (16) | 2.32 (16) | 1.92 (17) | 2.75 (3) |
| N-Asn | 332.48 (17) | 3.89 (11) | 5.50 (7) | 0.09 (19) |
| A-Ala | 323.63 (18) | 4.47 (8) | 6.77 (6) | 0.83 (11) |
| T-Thr | 306.00 (19) | 3.83 (12) | 5.36 (9) | 0.54 (15) |
| Q-Gln | 255.43 (20) | 2.87 (14) | 5.24 (10) | 0 (20) |
| R | 1.00 | 0.616 | 0.348 | 0.307 |
Figure 6Interpolated charge surface of Q9YFR9 (TPP) and P0A223 (non-TPP) having TPP scores of 528.74 and 341.99, respectively, where the optimal threshold value is 418. Blue, white and red colors denote high, medium and low interpolated charge, respectively.
Figure 7Surface hydrophobicity of Q9YFR9 (TPP) and P0A223 (non-TPP) having TPP scores of 528.74 and 341.99, respectively, where the optimal threshold value is 418. Brown, white and blue colors denote high, medium and low hydrophobicity, respectively.