Literature DB >> 20209018

Diversity of protein structures and difficulties in fold recognition: the curious case of protein G.

Abstract

We examine the ability of current state-of-the-art methods in protein structure prediction to discriminate topologically distant folds encoded by highly similar (>90% sequence identity) designed proteins in blind protein structure prediction experiments. We detail the corresponding prognosis for the protein fold recognition field and highlight the features of the methodologies that successfully deciphered this folding riddle.

Entities: Chemical Disease Gene Species

Year: 2009 PMID： 20209018 PMCID： PMC2832337 DOI： 10.3410/B1-69

Source DB: PubMed Journal: F1000 Biol Rep ISSN： 1757-594X

Introduction and context

Natural proteins with over 35% sequence similarity tend to fold into similar conformations [1], yet several evolutionarily related natural protein pairs with up to 40% similarity have been observed to produce substantially different topologies [2,3]. Two sequences bearing the same length and only three nonidentical residues were posted as sequential targets in the recent 8th Community-Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP8). Targets T0498 and T0499 therefore posed a riddle for the international protein fold prediction community to determine whether the conformations of these 95% identical sequences maintain the same topological folds or adopt different ones. The proteins were artificially produced in the group of John Orban and Philip Bryan [4] as a study of the tolerance of sequence identity to maintain the 3-α and α/β folds of streptococcal protein G domain A (GA) and domain B (GB), respectively. The two 16% identity domains of protein G were brought together in sequence space first by adding terminal tails to GA to make it equal in length to GB and then by progressively mutating sites of nonidentity. The key in this approach was linking each fold to its natural function: human serum albumin binding for the 3-α GA fold and IgG binding for the α/β GB fold. This linkage of fold to function allowed the application of powerful biologic selection methods to determine clusters of sites in each protein, which could be substituted with the corresponding amino acid in the other protein. Iteratively combining mutations identified by the selection methods resulted in two 88% identical proteins [4]. More recently, two 95% identical sequences possessing the same fold, GA95 and GB95, were designed and provided as the two CASP8 targets discussed here [5]. The designed protein pairs maintain the fold and specific binding function of the proteins from which they were derived, with immeasurable structural or functional character of the domain represented in the alternate protein [4]. A prerequisite of recognizing a fold is prior observation of the fold. Structural genomics consortia contribute thousands of new protein structures each year, yet previously unobserved folds are seldom found [6]. This pattern seems to indicate that the majority of folds that can be detected by current laboratory techniques have already been observed. The completeness of the structural fold space has been addressed using a subset of 1,489 proteins covering the protein data bank [7] at the level of 35% sequence identity; all but two folds can be resolved using templates found within the same set [8]. Thus, template-based modeling appears to be feasible given the best template(s) within the set. The search for the best template for a given query protein is known as ‘fold recognition’.

Major recent advances

The best performing freely available fold recognition web server methods are maintained by Yang Zhang [9] within the local meta-threading server (LOMETS) fold recognition pipeline of I-TASSER (iterative threading assembly refinement algorithm), the best performing protein structure prediction server in the past two CASP experiments. As an isolated meta-threading server, LOMETS uses local implementation to avoid the destructive aspects of internet dynamic regulation corrupting so many meta-servers [10]. The nine methods of LOMETS are representative of the fold recognition field (normally targeted toward naturally occurring proteins) and can be summarized as various combinations of the following: comparing target to known structure sequence profiles, secondary structure preferences, environmental fitness, pairwise contact probabilities, structure profiles, simulated mutations, single-body or residue-specific knowledge-based potentials, and profile hidden Markov models (HMMs) [10]. Most web server groups predicted both T0498 and T0499 to adopt the α/β fold of protein GB (Figure 1a). For example, our own predictions for T0498 did not significantly resemble the target structure [37.2 global distance test total score (GDT-TS); Figure 1b, left], yet all five of our predictions for T0499 were within the top 10 total predictions (88.4 GDT-TS; Figure 1b, right). The models for T0499 exemplify progress in another major challenge in protein structure prediction: refinement of model quality from the best template [11].

Figure 1.

Difficulties in fold recognition for the redesigned streptococcal protein domains GA95 versus GB95

Difficulties in fold recognition for the redesigned streptococcal protein domains GA95 versus GB95

(a) Only four out of over 150 contributing team groups recognized the difference in fold caused by three nonidentical residues in the 56 residue proteins: HHpred (cyan), Feig (black), FOLDpro (blue), and Coma (others are in orange). The results are shown in global distance test (GDT) plot format, in which the alpha carbon atoms of the predicted model and experimental structure are spatially aligned within distance cutoffs of 0.5 Å, 1 Å, and 1.5 Å up to 10 Å, such that lower lines denote higher accuracy. A common trend of these four groups was to predict the alternate fold as a lower confidence model. Most groups correctly identified the GB95 T0499 fold, yet most models were no better than random for GA95 T0498. The ability for four automated servers to disentangle this riddle provides a positive outlook for the fold recognition field. (b) Predictions made by our group (purple) for T0498 and T0499 compared with the experimental structures for GA95 and GB95 (cyan), respectively. While our predictions were among the very best for T0499/GB95 (the only group with all five submissions in the top 10), the incorrect fold assignment led to highly inaccurate predictions for T0498/GA95. As with so many other protein structure prediction groups, we failed to predict that profoundly similar sequences would produce different folds. CA, alpha carbon; GA95, the artificial protein with GA fold and 95% sequence identity to GB95; GB95, the artificial protein with GB fold and 95% sequence similarity to GA95. The side chain interactions visible in the experimental structures of GA95 and GB95, as well as the simulated mutant models depicted in Figure 2, demonstrate interactions within a relatively stable, folded state, which are not necessarily illustrative of those interactions occurring during the folding process. Even when the structures are known, it is difficult to ascertain exactly what makes the two proteins follow different fold trajectories. Yet fold recognition methods do not simulate folding. Rather, they rely on calculated interactions within simulated mutants of these folded structures to test the accuracy of fit for a possible template; thus, even with a perfect energy function, mistakes in fold recognition could occur.

Figure 2.

Putative causes of CASP8 fold recognition failure and success for redesigned streptococcal protein GA95 versus GB95

Putative causes of CASP8 fold recognition failure and success for redesigned streptococcal protein GA95 versus GB95

A sequence-to-structure cross of GA95 and GB95 is presented to demonstrate determinants of fold recognition from side chain packing of the nonidentical residues (red). The lack of profound steric clashes created by applying the side chain identities from T0498 to the structure of GB95 (left) misleads predictors to identify an incorrect fold topology. Conversely, the clash that occurs between F30 and A20 when applying the side chain identities from T0499 to the structure of GA95 (right) illustrates an incorrect fold for predictors. GA95, the artificial protein with GA fold and 95% sequence identity to GB95; GB95, the artificial protein with GB fold and 95% sequence similarity to GA95. In this case, a multitude of experimentally derived structures for GA and GB and detectable sequence similarity within this group reasonably limit the fold search to these topologies. Crossing fold assignments for GA95 and GB95 enables interrogation of side chain packing for the three nonidentical residues (Figure 2). The clash occurring between F30 and A20 when the nonidentities from T0499 are applied to the structure of GA95 (Figure 2, right) implicates an incorrect fold to predictors. Conversely, minimal steric clashes emerge when the T0498 sequence is applied to the structure of GB95 (Figure 2, left). This absence of incriminating evidence for the T0498 GB95 sequence fold pair could mislead predictors to select this fold topology. Out of over 150 contributing teams, four groups recognized the difference in fold caused by three nonidentical residues in the 56 amino acid proteins: HHpred, FOLDpro, Feig, and Coma. The accurate predictions of these groups demonstrate sensitivity to subtle changes affecting folding not previously demonstrated in a bona fide blind prediction scenario.

HHpred

The Söding group uses HMM emission sequences to evaluate target template matches. The emission sequence of HMMs includes position-specific insertion and deletion probabilities along with the sequence distributions found in multiple sequence alignment profiles. HHsearch specifically includes secondary structures via a substitution matrix derived from comparing measurements on the template to target predictions and to the confidence thereof. To interrogate alignments, the HHsearch method maximizes the coemission log-odds probability for the pair of HMMs derived for a given protein pair. HHsearch directs the structural similarity search hierarchically by searching databases of alignments organized by fold family rather than lists of disconnected sequences [12]. The CSI-BLAST (context-specific iterative basic local alignment search tool) sequence similarity search method recently published by the group was likely used to build the profile input to the HMMs for each alignment [13].

FOLDpro

The Cheng group uses a supervised classification approach previously used for fold classification, invoking support vector machines to combine global profile-profile alignment, secondary structure, solvent accessibility, contact map, and strand hydrogen bond pairing [14].

Feig

The Feig group used a very typical set of methods, including fold recognition functions overlapping those in LOMETS (including HHsearch/HHpred), standard model construction, and a modified cluster calculation using a standard discriminatory potential function [15]. Other promising work by this group in the refinement category includes the use of an implicit continuum dielectric solvent based on generalized Born theory to drive lattice-based course grain searches, Monte Carlo molecular dynamics, and restrained normal mode sampling [16].

Coma

The Venclovas group invokes a profile comparison method for detection of distant evolutionary relationships across profile databases, adding a modified two-level SEG (segment sequences by local complexity) algorithm to filter noninformative profile regions, variable gap penalties, and adaptive parameterization. The underlying sequence similarity search is driven by their PSI-BLAST-ISS (position-specific iterative BLAST intermediate sequence search), which evaluates and refines output profile alignments [17]. The manual submissions by this group displayed the overall best performance in CASP8.

Future directions

A handful of the automated algorithms were able to recognize the fold switch caused by the three nonidentical residues of GA95 and GB95 (Figure 2). However, the experimentally unobserved 60% of naturally occurring proteins [6] and the prospect of designing new folds heralded by Top7 [18] demand more methods sensitive enough to detect subtle triggers in fold switching and predict previously unobserved topologies. Developments in the protein fold prediction field can often be limited to incremental engineering optimizations. In this fold recognition problem, the proper application of support vector machines and HMM methods enabled success for two groups. Also, two groups created their own improvements on PSI-BLAST [19]: CSI-BLAST [13] and PSI-BLAST-ISS [17], which both enhance quality and relevance of a search by interrogating low-quality regions in the alignment by context and together comprise the first significant improvements on the enormously popular algorithm in a decade. The novel algorithmic adjustments in fold recognition used in CASP8 demonstrate significant progress amounting to new tools for the field. Future developments are anticipated to include the steady stream of mathematical enhancements observed since the inception of the protein structure prediction field but also include new conceptual paradigms such as functional signatures [20] and the use of template-free modeling [21] to drive the difficult fold recognition problems.

21 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. The protein structure prediction problem could be solved using the current PDB library.

Authors: Yang Zhang; Jeffrey Skolnick
Journal: Proc Natl Acad Sci U S A Date: 2005-01-14 Impact factor: 11.205

3. Toward high-resolution de novo structure prediction for small proteins.

Authors: Philip Bradley; Kira M S Misura; David Baker
Journal: Science Date: 2005-09-16 Impact factor: 47.728

4. A correlation-based method for the enhancement of scoring functions on funnel-shaped energy landscapes.

Authors: Andrew W Stumpff-Kane; Michael Feig
Journal: Proteins Date: 2006-04-01

5. A machine learning information retrieval approach to protein fold recognition.

Authors: Jianlin Cheng; Pierre Baldi
Journal: Bioinformatics Date: 2006-03-17 Impact factor: 6.937

6. Transitive homology-guided structural studies lead to discovery of Cro proteins with 40% sequence identity but different folds.

Authors: Christian G Roessler; Branwen M Hall; William J Anderson; Wendy M Ingram; Sue A Roberts; William R Montfort; Matthew H J Cordes
Journal: Proc Natl Acad Sci U S A Date: 2008-01-28 Impact factor: 11.205

7. Sequence context-specific profiles for homology searching.

Authors: A Biegert; J Söding
Journal: Proc Natl Acad Sci U S A Date: 2009-02-20 Impact factor: 11.205

8. A novel method for predicting and using distance constraints of high accuracy for refining protein structure prediction.

Authors: Tianyun Liu; Jeremy A Horst; Ram Samudrala
Journal: Proteins Date: 2009-10