| Literature DB >> 34119522 |
Abstract
Since Anfinsen demonstrated that the information encoded in a protein's amino acid sequence determines its structure in 1973, solving the protein structure prediction problem has been the Holy Grail of structural biology. The goal of protein structure prediction approaches is to utilize computational modeling to determine the spatial location of every atom in a protein molecule starting from only its amino acid sequence. Depending on whether homologous structures can be found in the Protein Data Bank (PDB), structure prediction methods have been historically categorized as template-based modeling (TBM) or template-free modeling (FM) approaches. Until recently, TBM has been the most reliable approach to predicting protein structures, and in the absence of reliable templates, the modeling accuracy sharply declines. Nevertheless, the results of the most recent community-wide assessment of protein structure prediction experiment (CASP14) have demonstrated that the protein structure prediction problem can be largely solved through the use of end-to-end deep machine learning techniques, where correct folds could be built for nearly all single-domain proteins without using the PDB templates. Critically, the model quality exhibited little correlation with the quality of available template structures, as well as the number of sequence homologs detected for a given target protein. Thus, the implementation of deep-learning techniques has essentially broken through the 50-year-old modeling border between TBM and FM approaches and has made the success of high-resolution structure prediction significantly less dependent on template availability in the PDB library.Entities:
Keywords: contact map; deep learning; distance prediction; end-to-end structure prediction; free modeling; multiple sequence alignment; protein structure prediction; template-based modeling;
Mesh:
Year: 2021 PMID: 34119522 PMCID: PMC8254035 DOI: 10.1016/j.jbc.2021.100870
Source DB: PubMed Journal: J Biol Chem ISSN: 0021-9258 Impact factor: 5.486
List of the useful methods for protein structure prediction covered in this review with available links to access the resources
| Multiple sequence alignment (MSA) construction | |
| PSI-BLAST | |
| HHBlits | Web server- |
| Jackhmmer | |
| Hmmsearch | |
| DeepMSA | |
| Threading and Fold-recognition | |
| LOMETS | |
| HHsearch | |
| MUSTER | |
| map_align | |
| EigenTHREADER | |
| CEthreader | |
| DisCovER | |
| RaptorX | |
| Full-length Structure Assembly for Template-Based Modeling (TBM) | |
| I-TASSER | |
| MODELLER | |
| RosettaCM | |
| SWISS-MODEL | |
| Phyre2 | |
| Fragment Assembly Simulation Methods for Free Modeling (FM) | |
| Rosetta | Web server: |
| QUARK | |
| FragFold | |
| Co-evolution and Deep Learning-Based Contact/Distance Prediction | |
| PSICOV | |
| CCMpred | |
| GREMLIN | |
| NeBcon | |
| MetaPSICOV | |
| ResPRE | |
| TripletRes | |
| RaptorX-Contact | |
| MSA Transformer | |
| Deep Learning-Based Full-length Structure Prediction | |
| AlphaFold | |
| D-I-TASSER | |
| D-QUARK | |
| trRosetta | |
| DMPfold | Web server - |
Figure 1Important milestones in protein structure prediction that are covered in this review.
Figure 2Typical steps in a homology-based modeling pipeline. Starting from a query sequence, templates are identified using sequence-based alignment algorithms. Then the structural framework of the best template alignment is copied, and the unaligned regions are constructed to produce the final model.
Figure 3Typical steps in template/fragment assembly and gradient descent-based protein structure prediction pipelines. Starting from a query sequence, a multiple sequence alignment (MSA) is constructed by identifying homologous sequences from a sequence database. Then using profiles or predicted structural features derived from the MSA, either global template structures (for TBM) or local fragments (for FM) are identified from databases of solved protein structures. Additionally, coevolutionary analysis of the MSA is fed into deep neural networks to predict pairwise restraints such as distance maps, interresidue orientations, and hydrogen bond networks. The structure assembly stage may either assembly the local fragments, global template structure, or directly minimize the structure using rapid gradient descent methods. From here, the final model may be selected by clustering the conformations generated during the structure assembly stage or by identifying the lowest energy structure, which is further refined using atomic-level refinement simulations to produce a final model.
Figure 4Interresidue spatial restraints that are often used to assist protein 3D structure assembly simulations. The protein backbone atoms include the N, Cα, and C atoms, while the side chains include the Cβ atoms, with the exception of glycine, as well as the R groups, which distinguish the different amino acid residues. A, Cα/Cβ contacts and distances; B, interresidue torsion angles; C, hydrogen bond networks. Here, the backbone hydrogen bonds are represented using a Cα-based model, where three consecutive Cα atoms form a local coordinate system, from which various vectors and their orientations represent regular hydrogen bonding patterns observed in native proteins. D, typical pipeline for spatial restraint prediction. Starting from the amino acid sequence of a target protein, homologous protein sequences are collected from sequence databases and compiled to form a multiple sequence alignment (MSA). For the MSA, coevolutionary relationships are deduced and fed into a deep neural network, which may output the predicted contact/distance maps, interresidue orientations, and hydrogen bond networks.
Figure 5Summary of contact map prediction results in CASP11 to 14.A, contact prediction results for different groups on all FM and FM/TBM targets. Groups are sorted in descending order of the average precision of their top L/5 long-range contacts, where L is the protein length and long-range contacts occur between positions that are separated by at least 24 residues. B, relationship between contact prediction precision and the MSA Neff value obtained by the DeepMSA program (184), where lines are the best fit on the individual targets by linear regression.
Summary of the current state-of-the-art structure prediction methods, including their results in the most recent CASP experiment and their web server URL addresses
| Method | CASP14 group name | CASP14 results | Description; URL address |
|---|---|---|---|
| D-I-TASSER | Zhang-Server | First Place Server | Template and deep learning distance/orientation/hydrogen bond network-guided folding; |
| D-QUARK | QUARK | Second Place Server | Deep learning distance/orientation-guided folding; |
| AlphaFold2 | AlphaFold2 | First Place Human Group | End-to-end deep learning-based model prediction; |
| Rosetta | BAKER | Second Place Human Group | Deep learning distance/orientation-guided folding; Robetta Server: |
Methods in CASP are divided into server and human groups. Predictions by server groups are fully automated, whereas those by human groups do not have to be.
Figure 6Summary of structure prediction results in the recent CASP experiments.A, relationship between the best TM score of the first submitted model and the Neff value of the MSA generated by the DeepMSA program (184). B, mean TM score of the best first TBM and FM models submitted in the corresponding CASP competitions. C, results for the best first TBM models (including TBM, TBM-easy, TBMA-hard, and FM/TBM) submitted by any group in CASP7/11 to 14, where the models are categorized into one of three categories based on their TM scores: [0, 0.5), [0.5, 0.914], (0.914, 1.0]. D, results for the best first FM models submitted by any group in CASP7/11 to 14, where the models are categorized into one of three categories based on their TM scores: [0, 0.5), [0.5, 0.914], (0.914, 1.0].
Summary of AlphaFold2’s modeling performance on CASP14 multidomain targets and each constituent domain
| Target | Domain (length) | TM-score |
|---|---|---|
| T1038 | Full Length (L = 190) | 0.92 |
| Domain 1 (L = 114) | 0.90 | |
| Domain 2 (L = 76) | 0.91 | |
| T1047s2 | Full Length (L = 346) | 0.77 |
| Domain 1 (L = 147) | 0.96 | |
| Domain 2 (L = 83) | 0.93 | |
| Domain 3 (L = 116) | 0.62 | |
| T1052 | Full Length (L = 832) | 0.69 |
| Domain 1 (L = 539) | 0.96 | |
| Domain 2 (L = 213) | 0.99 | |
| Domain 3 (L = 80) | 0.98 | |
| T1053 | Full Length (L = 576) | 0.97 |
| Domain 1 (L = 405) | 0.99 | |
| Domain 2 (L = 171) | 0.95 | |
| T1058 | Full Length (L = 382) | 0.96 |
| Domain 1 (L = 221) | 0.94 | |
| Domain 2 (L = 161) | 0.96 | |
| T1061 | Full Length (L = 838) | 0.77 |
| Domain 1 (L = 464) | 0.93 | |
| Domain 2 (L = 271) | 0.81 | |
| Domain 3 (L = 103) | 0.95 | |
| T1070 | Full Length (L = 321) | 0.49 |
| Domain 1 (L = 76) | 0.62 | |
| Domain 2 (L = 101) | 0.97 | |
| Domain 3 (L = 76) | 0.78 | |
| Domain 4 (L = 68) | 0.95 | |
| T1085 | Full Length (L = 406) | 0.94 |
| Domain 1 (L = 167) | 0.95 | |
| Domain 2 (L = 182) | 0.98 | |
| Domain 3 (L = 57) | 0.83 | |
| T1086 | Full Length (L = 381) | 0.94 |
| Domain 1 (L = 193) | 0.96 | |
| Domain 2 (L = 188) | 0.96 | |
| T1093 | Full Length (L = 629) | 0.94 |
| Domain 1 (L = 141) | 0.88 | |
| Domain 2 (L = 382) | 0.95 | |
| Domain 3 (L = 106) | 0.93 | |
| T1094 | Full Length (L = 484) | 0.91 |
| Domain 1 (L = 277) | 0.87 | |
| Domain 2 (L = 207) | 0.96 | |
| T1096 | Full Length (L = 426) | 0.56 |
| Domain 1 (L = 255) | 0.94 | |
| Domain 2 (L = 171) | 0.85 | |
| Average | Full Length (L = 484.3) | 0.82 |
| Domains (L = 187.5) | 0.91 |
Figure 7Representative examples of AlphaFold2 on multidomain protein structures in CASP14. The experimental structures are shown in red cartoons, while the predicted models are shown in different colors for different domains. A, modeling results for T1038, where AlphaFold2 achieved excellent performance on both the domain-level and full-length models. B, modeling results for T1052, where the domain-level models achieved an extremely high accuracy, but the full-length assembled structure had incorrect domain orientations.