| Literature DB >> 35222841 |
Hocheol Lim1,2, Hyeon-Nae Jeon2, Seungcheol Lim3, Yuil Jang1, Taehee Kim1, Hyein Cho1, Jae-Gu Pan4, Kyoung Tai No1,2.
Abstract
The importance of protein engineering in the research and development of biopharmaceuticals and biomaterials has increased. Machine learning in computer-aided protein engineering can markedly reduce the experimental effort in identifying optimal sequences that satisfy the desired properties from a large number of possible protein sequences. To develop general protein descriptors for computer-aided protein engineering tasks, we devised new protein descriptors, one sequence-based descriptor (PCgrades), and three structure-based descriptors (PCspairs, 3D-SPIEs_5.4 Å, and 3D-SPIEs_8Å). While the PCgrades and PCspairs include general and statistical information in physicochemical properties in single and pairwise amino acids respectively, the 3D-SPIEs include specific and quantum-mechanical information with parameterized quantum mechanical calculations (FMO2-DFTB3/D/PCM). To evaluate the protein descriptors, we made prediction models with the new descriptors and previously developed descriptors for diverse protein datasets including protein expression and binding affinity change in SARS-CoV-2 spike glycoprotein. As a result, the newly devised descriptors showed a good performance in diverse datasets, in which the PCspairs showed the best performance ( R 2 = 0.783 for protein expression and R 2 = 0.711 for binding affinity). As a result, the newly devised descriptors showed a good performance in diverse datasets, in which the PCspairs showed the best performance. Similar approaches with those descriptors would be promising and useful if the prediction models are trained with sufficient quantitative experimental data from high-throughput assays for industrial enzymes or protein drugs.Entities:
Keywords: Fragment molecular orbitals; Machine learning; Protein descriptor; Protein engineering; Quantum mechanics
Year: 2022 PMID: 35222841 PMCID: PMC8841378 DOI: 10.1016/j.csbj.2022.01.027
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Summary of data sets in this work.
| Protein Name | Species | Abbreviation | Observable variable | All set | Training set | Test set |
|---|---|---|---|---|---|---|
| Bacteriorhodopsin | GR-wave | Max absorption wavelength | 71 | 56 | 15 | |
| GR-shift | Max absorption wavelength shift | |||||
| Epoxide hydrolase | ANEH-evalue | Enantiomeric selectivity (e-value) | 163 | 130 | 33 | |
| ANEH-ddG | Enantiomeric selectivity (ddG‡) | |||||
| Nitric oxide dioxygenase | RmaNOD-ee | Enantiomeric excess | 552 | 441 | 111 | |
| Spike glycoprotein | SARS-CoV-2 | SARS2-expr | Protein expression | 3799 | 3039 | 760 |
| SARS2-bind | Binding affinity with hACE2 | 3803 | 3042 | 761 |
Fig. 1Workflow of protein descriptor generation for computer-aided rational protein engineering tasks in this work.
Fig. 2Violin plots of each dataset in this work. (A) Maximum absorption wavelength of bacteriorhodopsin (GR-wave). (B) Maximum absorption wavelength shift of bacteriorhodopsin from wild-type (GR-shift). (C) Enantiomeric selectivity (e-value) of epoxide hydrolase (ANEH-evalue). (D) Enantiomeric selectivity (ddG) of epoxide hydrolase (ANEH-ddG). (E) Enantiomeric excess of nitric oxide dioxygenase (RmaNOD-ee). (F) Protein expression of spike glycoprotein of SARS-CoV-2 (SARS2-expr). (G) Binding affinity between spike glycoprotein of SARS-CoV-2 and human angiotensin converting enzyme 2 (SARS2-bind).
Mean R-Squared of 10-fold Cross-validation sets Predictions.
| Data set | method | PCscores | PCgrades | sPairs | PCspairs | UniRep fusion | 3D-SPIEs_5.4 Å | 3D-SPIEs_8Å |
|---|---|---|---|---|---|---|---|---|
| GR-wave | RF | 0.822 ± 0.166 | 0.813 ± 0.150 | 0.754 ± 0.188 | 0.799 ± 0.196 | 0.677 ± 0.340 | 0.766 ± 0.140 | 0.761 ± 0.164 |
| XGB | 0.821 ± 0.126 | 0.823 ± 0.125 | 0.756 ± 0.226 | 0.761 ± 0.208 | 0.739 ± 0.183 | 0.748 ± 0.166 | 0.718 ± 0.203 | |
| GR-shift | RF | 0.822 ± 0.166 | 0.813 ± 0.151 | 0.755 ± 0.188 | 0.799 ± 0.196 | 0.677 ± 0.340 | 0.767 ± 0.140 | 0.761 ± 0.164 |
| XGB | 0.757 ± 0.216 | 0.766 ± 0.220 | 0.743 ± 0.190 | 0.733 ± 0.227 | 0.772 ± 0.155 | 0.745 ± 0.171 | 0.729 ± 0.179 | |
| ANEH-evalue | RF | 0.712 ± 0.161 | 0.713 ± 0.153 | 0.706 ± 0.159 | 0.706 ± 0.187 | 0.528 ± 0.202 | 0.555 ± 0.176 | 0.570 ± 0.119 |
| XGB | 0.630 ± 0.349 | 0.632 ± 0.348 | 0.666 ± 0.139 | 0.693 ± 0.198 | 0.568 ± 0.258 | 0.555 ± 0.165 | 0.568 ± 0.138 | |
| ANEH-ddG | RF | 0.802 ± 0.127 | 0.800 ± 0.129 | 0.818 ± 0.099 | 0.809 ± 0.112 | 0.706 ± 0.141 | 0.673 ± 0.115 | 0.685 ± 0.116 |
| XGB | 0.763 ± 0.173 | 0.756 ± 0.180 | 0.751 ± 0.101 | 0.768 ± 0.141 | 0.701 ± 0.186 | 0.696 ± 0.133 | 0.701 ± 0.125 | |
| RmaNOD-ee | RF | 0.707 ± 0.080 | 0.717 ± 0.068 | 0.706 ± 0.073 | 0.719 ± 0.074 | 0.641 ± 0.097 | 0.666 ± 0.080 | 0.678 ± 0.071 |
| XGB | 0.696 ± 0.088 | 0.695 ± 0.087 | 0.676 ± 0.089 | 0.709 ± 0.078 | 0.659 ± 0.105 | 0.667 ± 0.086 | 0.691 ± 0.082 | |
| SARS2-expr | RF | 0.724 ± 0.032 | 0.728 ± 0.031 | 0.736 ± 0.025 | 0.790 ± 0.025 | 0.517 ± 0.019 | 0.635 ± 0.034 | 0.649 ± 0.028 |
| XGB | 0.705 ± 0.043 | 0.708 ± 0.039 | 0.718 ± 0.028 | 0.760 ± 0.035 | 0.614 ± 0.024 | 0.662 ± 0.032 | 0.673 ± 0.027 | |
| SARS2-bind | RF | 0.701 ± 0.040 | 0.695 ± 0.048 | 0.695 ± 0.055 | 0.752 ± 0.045 | 0.484 ± 0.019 | 0.650 ± 0.033 | 0.660 ± 0.037 |
| XGB | 0.686 ± 0.032 | 0.680 ± 0.027 | 0.696 ± 0.050 | 0.748 ± 0.045 | 0.602 ± 0.022 | 0.680 ± 0.032 | 0.689 ± 0.036 |
R-Squared of Test sets Predictions of Best-found parameter models.
| Data set | method | PCscores | PCgrades | sPairs | PCspairs | UniRep fusion | 3D-SPIEs_5.4 Å | 3D-SPIEs_8Å |
|---|---|---|---|---|---|---|---|---|
| GR-wave | RF | 0.931 | 0.934 | 0.926 | 0.906 | 0.795 | 0.877 | 0.862 |
| XGB | 0.892 | 0.896 | 0.894 | 0.928 | 0.834 | 0.934 | 0.947 | |
| GR-shift | RF | 0.931 | 0.934 | 0.926 | 0.906 | 0.795 | 0.877 | 0.862 |
| XGB | 0.901 | 0.892 | 0.922 | 0.950 | 0.849 | 0.915 | 0.921 | |
| ANEH-evalue | RF | 0.844 | 0.848 | 0.831 | 0.859 | 0.685 | 0.685 | 0.660 |
| XGB | 0.836 | 0.837 | 0.833 | 0.851 | 0.780 | 0.747 | 0.732 | |
| ANEH-ddG | RF | 0.935 | 0.938 | 0.926 | 0.929 | 0.830 | 0.751 | 0.783 |
| XGB | 0.923 | 0.929 | 0.915 | 0.923 | 0.886 | 0.845 | 0.839 | |
| RmaNOD-ee | RF | 0.723 | 0.708 | 0.701 | 0.718 | 0.637 | 0.659 | 0.659 |
| XGB | 0.691 | 0.693 | 0.706 | 0.702 | 0.637 | 0.675 | 0.675 | |
| SARS2-expr | RF | 0.708 | 0.724 | 0.739 | 0.783 | 0.490 | 0.588 | 0.608 |
| XGB | 0.690 | 0.712 | 0.689 | 0.743 | 0.606 | 0.607 | 0.630 | |
| SARS2-bind | RF | 0.651 | 0.651 | 0.653 | 0.711 | 0.464 | 0.590 | 0.600 |
| XGB | 0.648 | 0.671 | 0.638 | 0.702 | 0.576 | 0.629 | 0.628 |
Fig. 3The correlation plots from the test set prediction of the best models in seven datasets. (A) XGB model trained with the 3D-SPIEs_8Å in the GR-wave dataset. (B) XGB model trained with the PCspairs in the GR-shift dataset. (C) RF model trained with the PCspairs in the ANEH-evalue dataset. (D) RF model trained with the PCgrades in the ANEH-ddG dataset. (E) RF model trained with the PCscores in the RmaNOD dataset. (F) RF model trained with the PCspairs in the SARS2-expr dataset. (G) RF model trained with the PCspairs in the SARS2-bind dataset. The green dotted line indicates the identity function and the red dotted line indicates the trend line. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 4The Box plot comparison rank. (A) Boxplot comparison rank of descriptors by R-squared metric in the test set, (B) Boxplot comparison rank of descriptors by RMSE metric in the test set, and (C) Boxplot comparison rank of models (machine learning method and descriptor combination) by R-squared in the test set.