| Literature DB >> 15805122 |
Krzysztof Ginalski1, Nick V Grishin, Adam Godzik, Leszek Rychlewski.
Abstract
Despite recent efforts to develop automated protein structure determination protocols, structural genomics projects are slow in generating fold assignments for complete proteomes, and spatial structures remain unknown for many protein families. Alternative cheap and fast methods to assign folds using prediction algorithms continue to provide valuable structural information for many proteins. The development of high-quality prediction methods has been boosted in the last years by objective community-wide assessment experiments. This paper gives an overview of the currently available practical approaches to protein structure prediction capable of generating accurate fold assignment. Recent advances in assessment of the prediction quality are also discussed.Entities:
Mesh:
Year: 2005 PMID: 15805122 PMCID: PMC1074308 DOI: 10.1093/nar/gki327
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Sequence-to-structure-to-function paradigm. The leftmost picture shows the structure of Bacteriocin AS-48 (1e68, left) from Enterococcus faecalis, a 70 residues long cyclic bacterial lysin (100). This protein is structurally and functionally related to mammalian NK-lysin (101) (1nkl, right) despite undetectable sequence similarity, as only 4% of residues are identical after structural superposition. The Bacteriocin sequence was a target T0102 in the CASP-4 experiment. An excellent model (middle) was obtained by the Baker (59) group using the ab initio method Rosetta with an RMSD of 3.5 Å over all 70 residues. No other method was able to predict this fold with similar accuracy. A search of the protein structure database with this model yielded NK-lysin as the first structural match of comparable length. This illustrates that the ab initio approach was able to predict the structure that could be used to predict the function of the protein.
Figure 2Protein structure prediction methods. (a) Sequence–sequence, profile–sequence, sequence–profile comparison methods represent a traditional evolutionary-based approach to predict structures of proteins. The simplest method (I) aligns the sequence of the target with the sequence of the template using a substitution matrix. More sensitive methods (II) define scores for aligning different amino acids separately for each position of the target sequence (PSI-BLAST) or the template sequence (RPS-BLAST). The scores are taken from the analysis of sequence variability in multiple alignments of the corresponding sequence families. Such position-specific scores are also called profiles. They are similar in format to the representation of sequence families used by prediction methods based on HMMs. (b) Profile–profile comparison methods utilize the profiles generated by the above mentioned sequence alignment methods. Instead of a lookup of a substitution score, they compare two vectors with each other when building the dynamic programming matrix used to draw the alignment. The comparison is usually conducted by calculating a dot product of the two positional vectors (as shown in the figure) or by multiplying one vector times a substitution matrix time the other vector. Depending on the choice of the comparison function the vectors are often rescaled before the operation. The sequence variability vectors are sometimes also augmented with meta information, such as predicted secondary structure as indicated in the figure. (c) Threading or hybrid methods utilize the structure of the template protein in the comparison function. The position-specific alignment scores are computed for the template protein by replacing the side-chain of a residue with side-chains of all possible amino acids and by calculating the resulting substitution scores using statistically derived contact potentials. In addition, factors such as matching of predicted and observed secondary structure or burial preferences are also taken into account when aligning two positions. Most threading methods use frozen approximation where the sequence is threaded through the template structure and contacts are calculated between the target side-chain and side-chains of the residues of the template. In the much slower, defrosted threading template side-chains are replaced with side-chains of the target according to the alignment before calculating the contact scores. (d) Ab initio methods represent a physical approach to predict the structure of the target protein. The methods are based on an energy function, which estimates the conformational energy of the chain of the modeled protein. The energy can be calculated in a similar fashion as in the threading methods, i.e. utilizing contact potentials. The advantage of ab initio is that the database of folds does not constrain the set of possible results and theoretically any conformation can be generated and tested. Ab initio methods differ in employed energy functions and in the way conformational modifications are generated. Most common methods employ fragment insertion techniques or constrain the move set by placing the molecule on a lattice. (e) Meta predictors represent statistical approaches to improve the accuracy of protein structure predictions. Simple meta predictors collect models from prediction servers, compare the models and select the one, which is most similar to other models. The consensus model corresponds to a model selected from the collected set and represents the final prediction. More advanced meta predictors are able to modify the set of collected models either by filing missing parts with ab initio or loop modeling or by creating hybrid models from segments of structures collected from prediction servers. Hybrid models have a higher chance to provide a more complete model but are sometimes unphysical in terms of chain connectivity.
Publicly available fold recognition servers
| Code | |
|---|---|
| Sequence-only methods (no structural information required) | |
| PDBb | PDB-BLAST is based on the PSI-BLAST ( |
| FFAS FFA3 | FFAS ( |
| ORFs ORF2 | ORFeus ( |
| mBAS BasD BasP | Meta-BASIC (mBAS) ( |
| ST99 | Sam-T99 ( |
| SFAM SFPP | SUPERFAMILY ( |
| FRT1 | FORTE-1 ( |
| Hybrid methods (use structural information of the template) | |
| ST02 | SAM-T2K ( |
| 3DPS | 3D-PSSM ( |
| GETH MGTH | GenTHREADER (GETH) ( |
| FUG2 FUG3 | In FUGUE ( |
| RAPT | RAPTOR ( |
| SPKS | SPARKS (Sequence, secondary structure Profiles And Residue-level Knowledge-based Score for fold recognition) ( |
| PRO2 | PROSPECT (PROtein Structure Prediction and Evaluation Computer Toolkit) ( |
| INBG | INBGU ( |
| SHGU | ShotGun-INBGU ( |
| Structure meta predictors (build consensus form other servers) | |
| PCO2 PCO3 PCO4 PCO5 PMOD PMO3 PMO4 | Pcons ( |
| 3DS3 3DS5 | ShotGun ( |
| 3JAa 3JBa 3JCa 3JA1 3JB1 3JC1 | 3D-Jury ( |
| Ab initio | |
| RBTA | Robetta ( |
| PRCM | PROTINFO-CM ( |
The table provides a short description of selected, publicly available servers that took part in LiveBench-7 or LiveBench-8 and gives a short description of the underlying algorithm. The new ab initio meta predictors are quite slow and because of this not yet available for public use. Their additional weakness is the lack of a confidence score. Sequence-only methods neglect by definition any information about the structure of the template and in contrast to hybrid methods can be used as general homology inference methods between any protein families. Structure meta predictors offer currently the highest utility by producing accurate models with reliable confidence assessment.
Selected evaluation measures used to assess the quality of 3D models
| GDT TS (Global Distance Test) ( |
| LG-score ( |
| Mammoth ( |
| MaxSub ( |
| 3D-score ( |
| CA-atoms<3 Å ( |
| Q(CA-atoms<3 Å) is aimed at evaluating the specificity of the alignment and penalizes wrong sections of the models. It is equal to the square of (CA-atoms<3 Å) divided by the number of residues in the model. This is the only measure used in LiveBench, which penalizes overpredictions (too long alignments). Servers that return coordinates always for all residues of the target perform worse than if evaluated with other measures. |
| Contact(A&B) ( |
Methods performing sequence-independent superposition (first three) are relatively slow and are not used in current LiveBench experiments. Only one measure [Q(CA-atoms<3 Å)] penalizes for wrong parts of models. All methods, except the contact measure [Contact (A&B)], conduct rigid body superposition. The contact measure can handle the evaluation of multiple domains. GDT TS and MaxSub divide the score by the size of the target. Mammoth and LG-score estimate the probability of non-random structural similarity expressed as E-value. The scores of the others are proportional to the size of the model.
Comparison of selected servers participating in LiveBench-8
Only publicly available servers that provide a description of the underlying algorithm are listed. Results obtained in LiveBench-7 are also displayed if available. Results for the PROTINFO-CM server are not presented because of late predictions. Servers are colored blue (sequence only methods), red (hybrid methods) and black (structure meta predictors). The ‘Code’ column shows the code of the method as provided in Table 1. Results obtained using the 3D-score assessment measure are shown (see Table 2). The ‘Sum’ column prints the sum of scores obtained for correct models of difficult targets (no PSI-BLAST assignment with E-value below 0.001). The ‘FR’ column shows the number of correct models generated for difficult targets. The ‘All’ column shows the number of correct models generated for all targets including the easy ones. The ‘ROC’ (Receiver Operator Characteristic) value describes the specificity of the confidence scores reported by the methods. It corresponds to the average number of correct models that have a higher confidence score than the first, second … tenth false prediction. The ‘ROC%’ column prints the ‘ROC’ value divided by the total number of targets and multiplied by 100. Robetta is not listed here since it does not provide confidence scores. The ‘Score’ column reports the score of the third false positive prediction and the ‘3’ column presents the number of correct predictions with higher score than the third false one. The score can be used as an approximate value for the confidence threshold, below which false positive predictions become frequent. The ‘Lost’ column shows the number of missing predictions for each server. Servers that have more than a few missing predictions cannot be properly evaluated. Some servers that entered LiveBench in the eighth round have missing scores in LiveBench-7. The new sequence-only methods exhibit high specificity and can compete with structure meta predictors in this ranking. The structure meta predictors rank much higher in the sensitivity (FR) and model quality (Sum) based evaluation. In LiveBench-8 the meta–meta predictors, such as 3JC1 and 3JCa, which use results of other meta predictors, profit greatly from its parasitic nature. Servers, which are not maintained over long period of time become obsolete due to outdated fold libraries. As an example, FFAS now seems to perform similarly to PDBb, while it used to perform much better in first rounds of LiveBench. High-quality servers are able to generate ∼50% more correct models than PDBb (‘All’ column).
Comparison of rankings obtained using different evaluation measures
Only publicly available servers participating in LiveBench-8 that provide a description of the underlying algorithm are listed. Results obtained in LiveBench-7 are also displayed if available. Servers are colored blue (sequence only methods), red (hybrid methods) and black (structure meta predictors). The ‘Code’ column shows the code of the method as provided in Table 1. Rankings of servers obtained using five different assessment measures (see Table 2): 3D-score, MaxSub, CA-atoms<3 Å, Q(CA-atoms<3 Å), Contact(A) and Contact(B) are shown in columns ‘3D’, ‘MS’, ‘CA’, ‘Q’, ‘CA’ and ‘CB’, respectively. The ‘Avg’ column prints the average ranking of the server.