| Literature DB >> 33808604 |
Yuma Takei1,2, Takashi Ishida1.
Abstract
Model quality assessment (MQA), which selects near-native structures from structure models, is an important process in protein tertiary structure prediction. The three-dimensional convolution neural network (3DCNN) was applied to the task, but the performance was comparable to existing methods because it used only atom-type features as the input. Thus, we added sequence profile-based features, which are also used in other methods, to improve the performance. We developed a single-model MQA method for protein structures based on 3DCNN using sequence profile-based features, namely, P3CMQA. Performance evaluation using a CASP13 dataset showed that profile-based features improved the assessment performance, and the proposed method was better than currently available single-model MQA methods, including the previous 3DCNN-based method. We also implemented a web-interface of the method to make it more user-friendly.Entities:
Keywords: 3DCNN; CASP; deep learning; estimation of model accuracy (EMA); machine learning; model quality assessment (MQA); protein structure prediction
Year: 2021 PMID: 33808604 PMCID: PMC8003382 DOI: 10.3390/bioengineering8030040
Source DB: PubMed Journal: Bioengineering (Basel) ISSN: 2306-5354
Figure 1Overall workflow of this work. First, a bounding box is generated for each residue from the coordinate information of the model structure. Then, 14-dimensional atom-type features are obtained from the model structure. In addition, 20-dimensional sequence profile features and 4-dimensional local structure features are generated from the sequences. These features are then input to the three-dimensional convolutional neural network to predict a local score for each residue. Finally, the local scores are averaged to obtain a global score for the entire model.
14 atom-type features.
| Type | Description | Residue:Atom |
|---|---|---|
| 1 | Sulfur/selenium | CYS:SG, MET:SD, MSE:SE |
| 2 | Nitrogen (amide) | ASN:ND2, GLN:NE2, backbone N (including N-terminal) |
| 3 | Nitrogen (aromatic) | HIS:ND1/NE1, TRP:NE1 |
| 4 | Nitrogen (guanidinium) | ARG:NE/NH * |
| 5 | Nitrogen (ammonium) | LYS:NZ |
| 6 | Oxygen (carbonyl) | ASN:OD1, GLN:OE1, backbone O (except C-terminal) |
| 7 | Oxygen (hydroxyl) | SER:OG, THR:OG1, TYR:OH |
| 8 | Oxygen (carboxyl) | ASP:OD *, GLU:OE *, C-terminal O, C-terminal OXTc |
| 9 | Carbon (sp2) | ARG:CZ, ASN:CG, ASP:CG, GLN:CD, GLU:CD, backbone C |
| 10 | Carbon (aromatic) | HIS:CG/CD2/CE1, PHE:CG/CD */CE */CZ, TRP:CG/CD */CE */CZ */CH2, TYR:CG/CD */CE */CZ |
| 11 | Carbon (sp3) | ALA:CB, ARG:CB/CG/CD, ASN:CB, ASP:CB, CYS:CB, GLN:CB/CG, GLU:CB/CG, HIS:CB, ILE:CB/CG */CD1, LEU:CB/CG/CD *, LYS:CB/CG/CD/CE, MET:CB/CG/CE, MSE:CB/CG/CE, PHE:CB, PRO:CB/CG/CD, SER:CB, THR:CB/CG2, TRP:CB, TYR:CB, VAL:CB/CG *, backbone CA |
| 12 | Occupancy | *:* |
| 13 | Backbone | *:N, *:CA, *:C |
| 14 | CA | *:CA |
The first to the eleventh features are the features used by Derevyanko et al. [4] and the twelfth to the fourteenth features are the features added by Sato et al. [5]. An asterisk (*) for a residue represents all residues, and an asterisk for an atom represents either 1, 2, or 3. Note that the asterisk for a atom in the Occupancy represents all atoms.
Neural network architecture of Sato-3DCNN.
| Layer Name | Output Shape | Detail |
|---|---|---|
| Input |
| |
| Conv3D |
| Batch Normalization, PReLU |
| Conv3D |
| Batch Normalization, PReLU |
| Conv3D |
| Batch Normalization, PReLU |
| Conv3D |
| Batch Normalization, PReLU |
| Conv3D |
| Batch Normalization, PReLU |
| Conv3D |
| Batch Normalization, PReLU |
| Global Average Pooling | 1024 | |
| Linear | 1024 | Batch Normalization, PReLU |
| Linear | 256 | Batch Normalization, PReLU |
| Linear | 1 |
The first column shows the layer name; the second column shows the output shape of the layer; and the third column shows the detailed information of the layer. The total number of parameters is about 22 million.
Details of the training dataset and the test dataset.
| Dataset | Number of Targets | Number of Model Structures per Target | |
|---|---|---|---|
| Train | Train | 337 |
|
| Validation | 85 |
| |
| Test | CASP12 | 51 |
|
| CASP13 | 66 |
|
Prediction performance for the combination of each feature.
| Atom-Type Features | Evolutionary Information | Predicted Local Structure | Pearson (Validation) |
|---|---|---|---|
| ✓ | ✗ | ✗ |
|
| ✓ | ✓ | ✗ |
|
| ✓ | ✗ | ✓ |
|
| ✗ | ✓ | ✓ |
|
| ✓ | ✓ | ✓ |
|
Columns 1–3 represent combinations of features referenced in Section 2.1. The fourth column represents the average Pearson correlation coefficient for each target in the validation dataset. The best performance in the fourth column is in boldface.
Performance in the CASP12 stage 2 test dataset.
| Method | Pearson | Spearman | Loss | Z-Score |
|---|---|---|---|---|
|
|
|
|
|
|
| (−) | (−) | (−) | (−) | |
| Sato-3DCNN |
|
|
|
|
| ( | ( | ( | ( | |
| ProQ3D |
|
|
|
|
| ( | ( | ( | ( | |
| SBROD |
|
|
|
|
| ( | ( | ( | ( | |
| VoroMQA |
|
|
|
|
| ( | ( | ( | ( |
The first column represents the method name. The second and third columns show the average Pearson and Spearman correlation coefficients per target. The fourth and fifth columns show the average GDT_TS loss and average Z-score of selected models for each target. The values in parentheses are the p-values calculated by the Wilcoxon signed-rank test for the difference between the proposed method and the comparison method. The best values and p-values smaller than 0.01 are shown in bold.
Performance in the CASP13 stage 2 test dataset.
| Method | Pearson | Spearman | Loss | Z-Score |
|---|---|---|---|---|
|
|
|
|
|
|
| (−) | (−) | (−) | (−) | |
| Sato-3DCNN |
|
|
|
|
| ( | ( | ( | ( | |
| ProQ3D |
|
|
|
|
| ( | ( | ( | ( | |
| SBROD |
|
|
|
|
| ( | ( | ( | ( | |
| VoroMQA |
|
|
|
|
| ( | ( | ( | ( |
The legends are the same as those in Table 3.
The average Pearson correlation coefficient for each category of targets on CASP13 dataset.
| Method | FM (12 Targets) | FM/TBM (15 Targets) | TBM (37 Targets) |
|---|---|---|---|
| Proposed | |||
| Sato-3DCNN (AMSGrad) | |||
| ProQ3D | |||
| SBROD | |||
| VoroMQA |
The first column represents the method name. The columns 2 show the average Pearson correlation coefficients for targets in category FM. Columns 3 and 4 similarly represent the values for targets in category FM/TBM and TBM, respectively. The values in parentheses are the p-values calculated by the Wilcoxon signed-rank test for the difference between the proposed method and the comparison method. The best values and p-values smaller than 0.01 are shown in bold.
Figure A1Swarm plot and box plot of the Pearson correlation coefficient for each target on CASP13. The x-axis represents the Pearson correlation coefficient, and the y-axis represents the method. A point represents a target, and the color of the point represents the category of the target.
Figure 2The input page of the web tool. The email address and the model structure in PDB format or mmCIF format are required inputs, and the sequence in FASTA format is an optional input. You can check the number of running jobs and the number of waiting jobs.
Figure 3The output page of the prediction results. The predicted score for the whole model, the predicted score for each residue, and the three-dimensional structure colored by the local score is shown. The parts colored in blue represent high local scores, and the parts colored in red represent low local scores. The results can be downloaded in multiple formats.
Performance comparison in the validation dataset by feature combinations.
| Atom-Type Features | Evolutionary Information | Predicted Local Structure | Pearson | Spearman | Loss | Z-Score | AUC |
|---|---|---|---|---|---|---|---|
| ✓ | ✗ | ✗ |
|
|
|
|
|
| ✓ | ✓ | ✗ |
|
|
|
|
|
| ✓ | ✗ | ✓ |
|
|
|
|
|
| ✗ | ✓ | ✓ |
|
|
|
|
|
| ✓ | ✓ | ✓ |
|
|
|
|
|
Columns 1–3 represent combinations of features referenced in Section 2.1. The fourth and fifth columns show the average Pearson and Spearman correlation coefficient for each target in the validation dataset. The sixth and seventh show the average GDT_TS loss and average Z-score of selected models for each target. The eighth column shows the area under an ROC curve (AUC) calculated from the predicted local scores and the local labels.