| Literature DB >> 32316682 |
Jiarui Chen1, Shirley W I Siu1.
Abstract
Protein structures play a very important role in biomedical research, especially in drug discovery and design, which require accurate protein structures in advance. However, experimental determinations of protein structure are prohibitively costly and time-consuming, and computational predictions of protein structures have not been perfected. Methods that assess the quality of protein models can help in selecting the most accurate candidates for further work. Driven by this demand, many structural bioinformatics laboratories have developed methods for estimating model accuracy (EMA). In recent years, EMA by machine learning (ML) have consistently ranked among the top-performing methods in the community-wide CASP challenge. Accordingly, we systematically review all the major ML-based EMA methods developed within the past ten years. The methods are grouped by their employed ML approach-support vector machine, artificial neural networks, ensemble learning, or Bayesian learning-and their significances are discussed from a methodology viewpoint. To orient the reader, we also briefly describe the background of EMA, including the CASP challenge and its evaluation metrics, and introduce the major ML/DL techniques. Overall, this review provides an introductory guide to modern research on protein quality assessment and directions for future research in this area.Entities:
Keywords: CASP; DL; EMA; ML; MQA; deep learning; estimating model quality; machine learning; model quality assessment; protein structure prediction
Mesh:
Substances:
Year: 2020 PMID: 32316682 PMCID: PMC7226485 DOI: 10.3390/biom10040626
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1Four major approaches in the quality assessment of protein models: Single-model, multi-model, quasi-single model, and hybrid-model approaches. The hybrid-model methods combine selected models based on different methods.
Figure 2Three approaches of protein structure prediction.
Top-performing estimating model accuracy (EMA) methods in CASP13 [20].
| Name | Approach | Ref. | Advantage/Use Case | Available | S/C 1 | Links |
|---|---|---|---|---|---|---|
| FaeNNz | NN | [ | Top 1 accuracy estimate, absolute accuracy estimate | Y | C/S |
|
| ModFOLD7 | NN | [ | Both global and local accuracy estimates | N | - | - |
| ProQ3D | NN | [ | Top 1 accuracy estimate | Y | S |
|
| ProQ4 | CNN | [ | Per target ranking | Y | C |
|
| SART | Linear regression | [ | Local accuracy estimate and predict inaccurately modeled regions | N | - | - |
| VoroMQA | Statistical potential | [ | Local accuracy estimate and predict inaccurately modeled regions, good for native oligomeric structures | Y | C/S |
|
| MULTICOM | NN | [ | Top 1 accuracy estimate, absolute accuracy estimate | Y | C/S |
|
1 “S” denotes server, and “C” denotes code.
Categories of commonly used features in ML-based EMA methods.
| Categories | Abbr. | Brief Description | Examples |
|---|---|---|---|
| Physicochemical properties | PC | Basic physical or chemical properties extracted directly from the protein structural model | Residue or atom contact information, atom density map, hydrophobicity, polarity, charge, dihedral angle, etc. |
| Surface exposure area | SE | Features calculated from the different types of a molecule’s surface area | |
| Solvent accessibility | SA | Features based on the molecule’s surface area that is accessible to solvent | |
| Primary structure | PS | Protein sequence or features calculated from the sequence | |
| Secondary structure | SS | Secondary structure or features calculated from the secondary structure | |
| Evolutionary property | EI | Features based on the protein profile providing evolutionary information, collected from a family of similar protein sequences | |
| Energetic properties | ER | Features based on different energy terms | |
| Statistical potential | SP | Features involving statistical calculation or statistical potential | |
| Properties from other evaluation methods | FOM | Scores or features directly generated by other prediction methods |
Several commonly used data sources for training and testing EMA methods.
| Data Sources | No. of Structures/Targets | URLs | Reference | |
|---|---|---|---|---|
| CASP | CASP 7 | - 1 |
| [ |
| CASP 8 | ||||
| CASP 9 | ||||
| CASP 10 | ||||
| CASP 11 | ||||
| CASP 12 | ||||
| CASP 13 | ||||
| PISCES | - 1 |
| [ | |
| CAMEO | 50,187/ - 2 |
| [ | |
| 3DRobot | 300 per target/200 |
| [ | |
| I-TASSERDecoy | Set I | 12,500–32,000 per target/56 |
| [ |
| Set II | 300–500 per target/56 | |||
| MESHI | 36,682/308 |
| [ | |
1 The amount of data varies according to the demands of researchers. 2 As of 4 February 2020.
Comparison of different EMA methods.
| Name | Year | Dataset | Approach | Ref. | Input Property Categories 1 | Available | S/C 3 | Links | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PC | SE | SA | PS | SS | EI | ER | SP | FOM | ||||||||
| ProQ2 | 2012 | CASP7-9 | SVM | [ | • | • | • | ∘ | • | • | ∘ | ∘ | ∘ | Y | S |
|
| DL-Pro (NN) 2 | 2014 | CASP | NN | [ | • | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | N | - | - |
| RFMQA | 2014 | CASP8-10 | EL | [ | ∘ | ∘ | • | ∘ | • | ∘ | • | ∘ | • | Y | C |
|
| Wang deep1 2 | 2015 | CASP11 | NN | [ | • | ∘ | • | • | • | • | ∘ | ∘ | • | N | - | - |
| Wang deep2 2 | 2015 | CASP11 | NN | [ | • | ∘ | • | • | • | • | ∘ | ∘ | • | N | - | - |
| Wang deep3 2 | 2015 | CASP11 | NN | [ | • | ∘ | • | • | • | • | ∘ | ∘ | • | N | - | - |
| Wang SVM 2 | 2015 | CASP11 | SVM | [ | • | ∘ | • | • | • | • | ∘ | ∘ | • | N | - | - |
| QACon 2 | 2016 | CASP9, 11 | NN | [ | • | • | • | ∘ | • | ∘ | • | • | • | N | - | - |
| ProQ3 | 2016 | CASP9, 11, CAMEO | SVM | [ | • | • | • | ∘ | • | • | • | ∘ | • | Y | S |
|
| SVM-e | 2016 | CASP8-10, MESHI | SVM | [ | • | ∘ | ∘ | ∘ | • | ∘ | • | • | • | N | - | - |
| MESHI-score | 2016 | CASP8-10, MESHI | EL | [ | • | ∘ | ∘ | ∘ | • | ∘ | • | • | • | N | - | - |
| DeepQA | 2016 | CASP8-11, 3DRobot, PISCES | DBN | [ | ∘ | • | • | ∘ | • | ∘ | • | • | • | Y | C |
|
| ProQ3D | 2017 | CASP9-11, CAMEO | NN | [ | • | • | • | ∘ | • | • | • | ∘ | • | Y | S |
|
| SVMQA | 2017 | CASP8-12 | SVM | [ | ∘ | ∘ | • | ∘ | • | ∘ | • | ∘ | • | Y | C |
|
| ModFOLD6 | 2017 | CASP12, CAMEO | NN | [ | • | ∘ | ∘ | • | ∘ | ∘ | ∘ | ∘ | • | Y | S |
|
| Qprob | 2017 | CASP9, 11, PISCES | BL | [ | • | • | • | ∘ | ∘ | ∘ | • | ∘ | • | Y | S |
|
| 3DCNN MQA | 2018 | CASP7-10, 11-12, CAMEO, 3DRobot | CNN | [ | • | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | Y | C |
|
| ProQ4 | 2018 | CASP9-11, CAMEO, PISCES | CNN | [ | ∘ | ∘ | ∘ | • | • | ∘ | • | • | ∘ | Y | C |
|
| ModFOLD7 | 2018 | CASP10-13 | NN | [ | • | ∘ | ∘ | • | ∘ | ∘ | ∘ | ∘ | • | N | - | - |
| MULTICOM | 2018 | CASP8-13 | NN | [ | • | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | • | Y | C/S |
|
| Ornate | 2019 | CASP11-12 | CNN | [ | • | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | Y | C |
|
| AngularQA | 2019 | 3DRobot, CASP9-12 | LSTM | [ | • | ∘ | ∘ | • | • | ∘ | ∘ | ∘ | ∘ | Y | C |
|
| 3DCNN(Sato) | 2019 | 3DRobot, CASP11-12 | CNN | [ | • | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | ∘ | Y | C |
|
1 “•”/“∘” denotes that this property is adopted/not adopted by the EMA method. 2 Owing to their relatively low popularity and availability of servers or source codes, these models are not reviewed here. 3 “S” denotes server, and “C” denotes code.
Figure 3A simple illustration of SVM; + and - represent the sample labels.
Figure 4Diagram of the NN used by ProQ3D. Hidden Layers 1 and 2 contain 600 and 200 cells, respectively.
Performances of the ProQ models in CASP11 [58].
| Model | CC-Glob | CC-Target | CC-Loc | CC-Model | AUC | GDT TS Loss |
|---|---|---|---|---|---|---|
| ProQ | 0.60 | 0.44 | 0.50 | 0.39 | 0.78 | 0.06 |
| ProQ2 | 0.81 | 0.65 | 0.69 | 0.47 | 0.83 | 0.06 |
| ProQ2D | 0.85 | 0.68 | 0.72 | 0.49 | 0.89 | 0.05 |
| ProQ3 | 0.85 | 0.65 | 0.73 | 0.51 | 0.89 | 0.06 |
|
|
|
|
|
|
|
|
CC-glob and CC-target are the Pearson correlations of the global model quality, calculated over the whole dataset and averaged per target, respectively. CC-loc and CC-model are the equivalent Pearson correlations of the local model quality. AUC is the area under curve in local predictions. Residues closer than 3.8 Å from their positions in the target are considered correct. GDT_TS loss is the average global distance test total score (GDT_TS) difference between the selected model and the best possible model of that target. Bold font—top performance.
Comparison of local scoring performances of EMA methods in CAMEO6 [34].
| Model | AUC | StdErr | AUC 0–0.1 | AUC 0–0.1 Rescaled |
|---|---|---|---|---|
|
|
| 0.00096 |
|
|
| ModFOLD4 | 0.8638 | 0.00099 | 0.0467 | 0.4669 |
| ProQ2 | 0.8374 | 0.00107 | 0.0428 | 0.4283 |
| Verify3d | 0.7020 | 0.00134 | 0.0208 | 0.2081 |
| Dfire v1.1 | 0.6606 | 0.00138 | 0.0168 | 0.1675 |
Twenty-six weeks of data collected between 29 April and 21 October of 2016. AUC = area under the ROC curve. StdErr = standard error in AUC score. AUC 0–0.1 = area under the ROC curve with false positive rate ≤ 0.1. This table is sorted by AUC score. Bold font—top performance.
Figure 5The convolutional neural network structure of 3DCNN.
Local-score performance comparison of ProQ4 and ProQ3D on CASP11 [21].
| Method | R-Local | RMSE Local | local R-per Model |
|---|---|---|---|
| ProQ3D |
|
|
|
|
| 0.77 | 0.147 | 0.56 |
R-local: Pearson correlation between all local predicted and true scores in the dataset; RMSE local: root mean squared error between the aforementioned predicted local scores and their true values; R-per model: average Pearson correlation between the predicted and true scores of each model in the dataset. Bold font—top performance.
Global-score performance comparison of ProQ4 and ProQ3D on CASP11 [21].
| Method | R-Global | RMSE Global | R-per Target | First Rank Loss |
|---|---|---|---|---|
| ProQ3D | 0.90 |
| 0.82 | 0.040 |
|
|
| 0.085 |
|
|
R-global: correlation between all global predicted and true scores in the dataset; RMSE global: RMSE between the global predicted and true scores; R-per target: average correlation of global scores of the models for each protein; First Rank Loss: average difference between the true scores of the best model and the top-ranked model for each target. Bold font—top performance.
Performances of 3DCNN_MQA and other state-of-the-art MQA methods on CASP11 Stage 1 and Stage 2 [91].
| Method | Loss 1 | Pearson | Spearman | Kendall |
|---|---|---|---|---|
|
| ||||
| ProQ3D | 0.046 | 0.755 | 0.673 | 0.529 |
| ProQ2D | 0.064 | 0.729 | 0.604 | 0.468 |
|
|
|
|
|
|
| VoroMQA | 0.087 | 0.637 | 0.521 | 0.394 |
| RWplus | 0.122 | 0.512 | 0.402 | 0.303 |
|
| ||||
| VoroMQA | 0.063 | 0.457 | 0.449 | 0.321 |
|
|
|
|
|
|
| ProQ3D | 0.066 | 0.452 | 0.433 | 0.307 |
| ProQ2D | 0.072 | 0.437 | 0.422 | 0.299 |
| RWplus | 0.089 | 0.206 | 0.248 | 0.176 |
1 Loss = : difference between the GDT_TS of the best decoy and the GDT_TS of the decoy with the lowest predicted score s [91]. Bold font—top performance.
Figure 6Structure of DeepQA. The neurones in each restricted Boltzmann machine (RBM) are independent and unconnected within the layers, but fully connected between the layers.
Comparison between DeepQA and other top-performing single-model QA methods on CASP11 (Stage 1 and Stage 2) [74].
| Method | Corr. on Stage 1 | Loss on Stage 1 | Corr. on Stage 2 | Loss on Stage 2 |
|---|---|---|---|---|
|
| 0.64 | 0.09 | 0.42 | 0.06 |
| ProQ2 | 0.64 | 0.09 | 0.37 | 0.06 |
| Qprob | 0.63 | 0.10 | 0.38 | 0.07 |
| Wang_SVM | 0.66 | 0.11 | 0.36 | 0.09 |
| Wang_deep_2 | 0.63 | 0.12 | 0.31 | 0.09 |
| Wang_deep_1 | 0.61 | 0.13 | 0.30 | 0.09 |
| Wang_deep_3 | 0.63 | 0.12 | 0.30 | 0.09 |
| RFMQA | 0.54 | 0.12 | 0.29 | 0.08 |
| ProQ3 | 0.65 | 0.07 | 0.38 | 0.06 |
Corr.: average per-target correlation, Pearson’s correlation between the real and predicted GDT_TS scores of all models; Loss: average per-target loss, defining the difference between the GDT_TS scores of the selected model and the best model in the model pool.
Figure 7Simplified neural cell of LSTM, showing its three gates. , , and represent the cell state, input, and output in time step t, respectively.
Global score performances of AngularQA on CASP12 (Stage 1 and Stage 2) [78].
| Method | Corr. on Stage 1 | Loss on Stage 1 | Corr. on Stage 2 | Loss on Stage 2 |
|---|---|---|---|---|
|
| 0.545 | 0.116 | 0.393 | 0.128 |
| ProQ3 | 0.638 | 0.048 | 0.616 | 0.068 |
| DeepQA | 0.654 | 0.078 | 0.578 | 0.100 |
| Wang1 | 0.462 | 0.170 | 0.256 | 0.144 |
| QMEAN | 0.342 | 0.174 | 0.292 | 0.125 |
Corr.: average per-target correlation, Pearson’s correlation between the real and predicted GDT scores of all models; Loss: average per-target loss, defining the difference between the GDT scores of the selected model and the best model in the model pool.