Literature DB >> 12499538

Automated side-chain model building and sequence assignment by template matching.

Abstract

An algorithm is described for automated building of side chains in an electron-density map once a main-chain model is built and for alignment of the protein sequence to the map. The procedure is based on a comparison of electron density at the expected side-chain positions with electron-density templates. The templates are constructed from average amino-acid side-chain densities in 574 refined protein structures. For each contiguous segment of main chain, a matrix with entries corresponding to an estimate of the probability that each of the 20 amino acids is located at each position of the main-chain model is obtained. The probability that this segment corresponds to each possible alignment with the sequence of the protein is estimated using a Bayesian approach and high-confidence matches are kept. Once side-chain identities are determined, the most probable rotamer for each side chain is built into the model. The automated procedure has been implemented in the RESOLVE software. Combined with automated main-chain model building, the procedure produces a preliminary model suitable for refinement and extension by an experienced crystallographer.

Entities: Chemical Disease Species

Mesh：

Substances：
Peptide Fragments

Year: 2002 PMID： 12499538 PMCID： PMC2745879 DOI： 10.1107/s0907444902018048

Source DB: PubMed Journal: Acta Crystallogr D Biol Crystallogr ISSN： 0907-4449

Introduction

Building side chains of an atomic model into an electron-density map is a very different problem to building the main chain. Normally the main chain is built first, so the approximate location of the Cβ atom of the side chain and the approximate direction of the Cα—Cβ and Cα—C bonds are already known by the time side chains are built. On the other hand, the identities of the side chains at each position in the map are normally not known at this stage, and at moderate resolution (∼2–3 Å) in a map of moderate quality it can be very difficult to distinguish many of the side chains. Consequently, the bulk of the problem is not placing the side chains but rather identifying them. A number of methods have been developed to address the problem of identifying and fitting side-chain density. Most of the methods use the electron density at the coordinates of atoms of a side-chain model to evaluate the model. Jones et al. (1991 ▶) used a rotamer library (Ponder & Richards, 1987 ▶) to assist in manual fitting of side chains. Oldfield (2002 ▶) and Levitt (2001 ▶) fit a rotamer libraries to side-chain density, considering density at the positions of the atomic coordinates. In contrast, Morris et al. (2002 ▶) directly build a side-chain model from the coordinates of free atoms representing peaks in the density. In another very different approach, Holton et al. (2000 ▶) used machine-learning techniques to identify side chains in a map using methods that are rotation-independent, so that they require Cα positions but not the directions of the Cα—Cβ or Cα—C bonds; the fitting of side chains to the density is then carried out by maximizing the correlation of density with that from a side-chain template. Pavelcik et al. (2002 ▶) described a method for template matching of arbitrary fragments of structure to a map by a rotation/translation search that could be used to identify side chains without knowledge of the main-chain coordinates. Here, we describe a method for side-chain identification that uses the correlation of side-chain rotamer templates with density in the map to evaluate the fit of a side chain to the map. The side-chain templates are built from average density in refined models so that they reflect the patterns of density found in real structures. A Bayesian approach is used to estimate the probability that each side chain is present at each site. These probabilities are then used to align the protein sequence to the main-chain tracing of the map.

Methods

Side-chain rotamer library

Libraries of side-chain rotamers have been constructed numerous times (Ponder & Richards, 1987 ▶; Dunbrack & Cohen, 1997 ▶; Lovell et al., 2000 ▶) and it would be possible to use one of these as the basis for modeling side chains. However, for the present purposes it was necessary to have a paired rotamer library and corresponding averaged template density library, so it was most convenient to construct both libraries at once from the same database. A library of side-chain rotamers was constructed using 574 refined protein structures chosen arbitrarily from non-redundant PDB files (Berman et al., 2000 ▶; Hobohm et al., 1993 ▶) with R factors of 20% or lower and a resolution of 1.8 Å or better. In order to limit the total number of rotamers for all amino acids, the maximum number of rotamers considered for any one amino acid was 40. For each amino-acid position in these refined protein structures, the coordinates of the main-chain N, Cα and C atoms were used to place the side chain in a standard orientation. Then, for each amino-acid type, all conformations of the amino-acid side chain in this group of structures were listed. A library was generated consisting of the smallest subset of these conformations that could be found such that every conformation in the list differed from a member of the library by at most 0.8 Å r.m.s. For several amino acids (methionine, lysine, glutamine, glutamate, asparagine and tyrosine) this was not possible with just 40 rotamers in the library and in these cases some configurations are not represented. Additionally, for arginine the Nη1 and Nη2 atoms were not included in the r.m.s.d. calculation for defining the libraries; even so, more than 40 rotamers would have been required to represent all configurations found and the list was truncated at 40 rotamers. A total of 503 side-chain rotamers were present in the entire library of side chains.

Side-chain rotamer electron-density templates

The library of side-chain rotamers was used to cluster all conformations of each side chain from the 574 refined protein structures into rotamer groups. These groups contained from one to 18 543 conformations. For each rotamer group, a template was then constructed from the average electron density for the entire group, calculated using the coordinates and thermal factors from the refined structures (mapped into the standard orientation). In this way, the electron-density templates reflected both the variability in side-chain conformation within a rotamer group and the pattern of thermal factors from atom to atom in a rotamer. The electron-density templates were sampled on a grid with a spacing of 1 Å. All points within 3 Å of an atom in a side chain in one or more of the conformations present in the corresponding rotamer group were contained in the templates and all other points excluded. The templates for different amino acids and for different rotamers were therefore sampled at partially overlapping sets of lattice points. The region defining the template for glycine is not well defined by these criteria and as a special case the template for glycine was calculated in the same region as the template for alanine.

Estimation of the side-chain probability

The relative probability that each of the 20 amino-acid side chains was located at a side-chain position in a polypeptide chain was estimated in several steps. The overall strategy at each position in the chain was to find the rotamer of each side chain that fitted the density best, then to use the fits of these 20 side chains to the density to obtain the probability for each possible side chain at that position. Firstly, the correlation coefficient cc of each side-chain rotamer density template j with the electron density at each side-chain position k in the polypeptide chain was determined. This was accomplished after transforming the density in the map to the standard reference frame defined by the main-chain N, Cα and C positions. For each side-chain type, only the best-fitting rotamer was considered further. Next, a Z score was calculated for the fit of each side-chain template to each side-chain position. The Z score was based on the correlation cc for the fit and the correlations for the fits of this template j to all other side-chain positions. The Z score calculated in this way describes how likely it is that the value of the correlation cc would be obtained by chance. This was used to estimate the probability of measuring a value cc for a template that is incorrect. We further assumed that the correct template can have any value of the correlation. This is a useful assumption because although we expect that a correct template will have a high value of the correlation, it is difficult to specify how high this value should be. Using these assumptions, the mean correlation for the template j to all side-chain positions, 〈cc〉, was used as an estimate of the correlation to be expected for this side chain to arbitrary side-chain density (i.e. generally not associated with this rotamer). Similarly, the standard deviation of this correlation, 〈σ〉, was used as an estimate of the variation of this correlation to be expected for arbitrary side chains. Then the Z score, Z = (cc − 〈cc〉)/σ, was expected to be related to the probability of obtaining this value of the correlation by chance (that is, if side chain j is not the correct side chain at position k) using the relation p(cc) ≃ exp(−Z /2). In order to prevent very poor fits from being confused with very good ones, p(cc) was taken to be unity for values of Z < 0 in this calculation. Finally, the probability that amino-acid type i is the correct one at position k was calculated from Bayes’ rule. The a priori probability for each amino acid p was estimated from numbers of each amino acid j in the protein, n . The probability of observing the set of correlations {cc} at position k if the correct amino acid at this position is i is the product over all of the other amino-acid types j (not including the correct one, i) of the probabilities of observing the correlations {cc} by chance,Using Bayes’ rule, we then estimate the probability, p , that amino acid i is the correct one at position k, yielding, after some simplification,

Alignment of fragments of main-chain model to the protein sequence

A fragment of main-chain model containing n residues was matched to the protein sequence using the matrix of probability estimates p describing the probability that amino acid i is the correct type at position k in the model. As the main-chain model might sometimes be missing amino acids (commonly at loop positions) or might cross from one segment of chain to another incorrectly, a first step in the alignment was to identify the sub-fragment of the fragment which could be matched to the sequence with the highest probability. This was considered likely to be the longest segment that is contiguous in the protein chain. The match of this sub-fragment was identified and the remainder of the fragment was then considered independently as a separate fragment. To accomplish the identification of a contiguous sub-fragment and its match to the sequence, the alignment procedure described next was carried out n(n − 1)/2 times, once for each possible sub-fragment of the n-residue model fragment. Two different scoring algorithms were used in this process. For comparisons between sub-fragments of the same length, the probability of the alignment of each sub-fragment was used as the score. For comparisons between sub-fragments of different lengths, the probability estimates were not a good indicator of the relative qualities of the alignments and instead the score for each was 〈Z〉N 1/2, where N is the number of residues in the sub-fragment and 〈Z〉 is the mean Z score 〈Z 〉 of the amino acids in this alignment. For a given sub-fragment or fragment with m residues, all possible alignments l of the model with m sequential amino acids in the protein were considered. The relative probability p that alignment l, with amino-acid type t matched to position k in the model, was correct was estimated assuming that the probability estimates for all the residues in the model were independent, leading toThe sequence alignment was considered to be reliably identified if the probability for one match was at least 95% (that is, the combined probability of all other matches was 5% or less). Once a reliable sequence alignment was identified, the amino-acid sequence was mapped onto the model fragment. At each residue, the most probable rotamer corresponding to the amino acid assigned to that residue was built into the model.

Results and discussion

The reliability of the probabilistic method described here for side-chain assignment and sequence alignment was examined using the density-modified electron-density map for eight different experimental maps with varying resolution, quality (figure of merit) and number of residues in the asymmetric unit. The main chain was built into each map as described in Terwilliger (2003 ▶). The coordinates of the main-chain atoms were then used to place side chains. We first evaluated the utility of (1) by determining how well it actually predicts the probability that a particular side chain is present at a particular position in the model. Fig. 1 ▶ shows a histogram of the fraction of correct amino-acid side-chain assignments as a function of the probability assessed using (1 ▶). A total of 4349 side-chain densities from the eight experimental electron-density maps were compared with 20 side-chain templates to generate the histogram. The correct side chain at each position was identified from the refined model of the corresponding protein. Overall, Fig. 1 ▶ shows that the probability estimates obtained from (1) give a very good indication of the actual probability that the assignments are correct for these test cases.

Figure 1

Fraction of correct amino-acid side-chain assignments as a function of the probability estimated from (1). For each residue in the main-chain models for the eight structures listed in Table 1 ▶, the relative probabilities for each of the possible side chains were obtained using (1). The correct side chains were identified as the nearest amino acid in the refined model of each structure. The fraction of correct amino-acid side-chain assignments is tabulated as a function of the probability estimates.

We next evaluated how well the sequence-alignment probabilities estimated with (2) relate to the actual probabilities. Fig. 2 ▶ shows a histogram of the fraction of correct sequence alignments as a function of the probabilities of correct assignments estimated using (2). Assignments were considered correct if 90% or more of the residues in the alignment match the closest residue in the refined structure. A total of 85 very strong predictions with confidence >97% were obtained and all but one of these were correct assignments. Overall, the probability estimates for sequence alignment using (2) are fairly accurate, though there is less of a clear discrimination among alignments with moderate probability than in the case of the side-chain assignments.

Figure 2

Fraction of correct fragment alignments as functions of the probabilities estimated from (2). For each main-chain fragment built, the sub-fragment with the highest weighted Z score was identified as described in §2. All alignments of this sub-fragment with the protein sequence were considered and the relative probabilities of each alignment were estimated with (2). An alignment was considered correct if the residue numbers of 90% of the residues in the fragment matched those of the nearest amino acid in the refined model.

The overall side-chain modeling results are summarized in Table 1 ▶. The side-chain modeling procedure identified and aligned the protein sequence to 71% of the 5131 residues in eight proteins. These eight proteins included two that were used in the development of the method (NDP kinase and gene 5 protein). They also included two that were among those in the database of proteins used to construct side-chain templates (gene 5 protein, PDB code 1vqb; β-catenin, PDB code 1dow). These three may therefore be slightly better fitted than a typical protein. The remaining five are likely to be relatively good indicators of the utility of the method. For the set of eight proteins as a whole, 99% of the sequence assignments were correct. The mean difference in coordinates between the side-chain atoms of the model and those of the corresponding refined structures was 1.3 Å and the r.m.s. difference was 1.8 Å, including lysine and arginine residues, where the positions of atoms in even the refined structures is often somewhat uncertain, but excluding atoms more than 10 Å away from any atom in the refined structures.

Table 1

Test structures for which side-chain models have been built with RESOLVE

Structure	Resolution ()	Figure of merit m	Residues in refined model	Main chain built (%)	Side chains built (%)	Correct alignment (%)	Side-chain mean coordinate error ()	Side-chain r.m.s. coordinate error ()
Gene 5 protein (Skinner et al., 1994 ▶)	2.6	0.62	87	61	11	100	1.2	1.4
Granulocyte-stimulating factor (Rozwarski et al., 1996 ▶)	3.5	0.70	242	50	0	N/A	0	0
Initiation factor 5A (Peat et al., 1998 ▶)	2.1	0.85	136	84	84	99	1.3	1.8
-Catenin (Huber et al., 1997 ▶)	2.7	0.72	455	81	62	100	1.2	1.7
NDP kinase (Pdelacq et al., 2002 ▶)	2.6	0.56	556 (3 186)	56	37	98	1.2	1.6
Hypothetical (P. aerophilum ORF, NCBI accession No. AAL64711; Fitz-Gibbon et al., 2002 ▶)	2.6	0.58	494 (2 247)	79	75	98	1.3	2.0
Red fluorescent protein (Yarbrough et al., 2001 ▶)	2.5	0.91	936 (4 234)	88	88	99	1.2	1.8
2-Aminoethylphosphonate (AEP) transaminase (Chen et al., 2000 ▶)	2.6	0.84	2232 (6 372)	85	81	99	1.3	1.8

To test the resolution-dependence of side-chain model building, the IF5A structure was built at a variety of resolutions, truncating the data in each case. This of course does not fully simulate the ability to build a model for a poorly diffracting crystal, as the data and phases are very good to the resolution cutoff in this test. Nevertheless, it can give an idea of what is possible with very good data. Fig. 3 ▶(a) shows the number of main-chain residues and side chains built as a function of the high-resolution cutoff. Fig. 3 ▶(b) shows the r.m.s. coordinate error in main-chain and side-chain atoms for the same models. Fig. 3 ▶ illustrates that in the presence of very good data, as much as 75% of the main chain and 50% of the side chains can be built at a resolution as low as 3.4 Å and with an r.m.s. coordinate error that is only slightly higher than at a resolution of 2.1 Å.

Figure 3

Effect of resolution on model building of IF5A. The phases and amplitudes for IF5A (Peat et al., 1998 ▶) after density modification were truncated at varying resolutions and the resulting maps were used for automated main-chain and side-chain model building. (a) Percentage of the main chain (closed circles) and the side chains (open circles) in the refined structure that were built. (b) R.m.s. coordinate error for main-chain (closed circles) and side-chain (open circles) atoms. Side-chain atoms include Cβ.

Conclusions

The probabilistic methods described here for identifying side chains and their rotamers in the electron density at positions derived from a main-chain tracing are found to be very effective. With a map of reasonable quality and a segment of ten residues or longer, the alignment to the sequence can often be identified with a confidence greater than 98%. As a side effect of aligning the model to the sequence, the quality of the main-chain protein tracing can be considerably improved. This is largely owing to the identification of errors in the main chain and elimination of the corresponding segments of main-chain model. For example, mis-tracings caused by tracing the chain through a loop region using fewer residues than are present in the actual loop are removed in this procedure. The procedure of finding the longest sub-fragment of a main-chain fragment that had a very strong match to the protein sequence was very useful for identifying these cases. This removal of residues is the reason for the differences between the number of main-chain residues in the models in this work compared with those obtained when using only the main-chain tracing algorithm (Terwilliger, 2003 ▶). The algorithm for side-chain fitting and alignment developed here does have significant limitations. The side-chain rotamer libraries used are not as complete or as accurate as others available (e.g. Dunbrack & Cohen, 1997 ▶; Lovell et al., 2000 ▶). This means that not all reasonable rotamers in proteins can be accurately represented by one of those in the library and that some side chains will therefore be poorly fitted. This could be improved by using a more complete library, but at a cost of examining a larger number of templates for a fit to the electron-density map. A possible compromise would be to use a more complete library only for those side chains that are not well fitted with a rotamer from the standard one. It could also be improved by using a filtered library such as that of Lovell et al. (2000 ▶), which removes conformations that are unlikely to be correct, or by explicitly checking side chains for poor contacts. An additional limitation is that some common situations are not recognized by the rotamer libraries (and by the main-chain model building that precedes side-chain fitting). These include disulfide bonds in proteins, unusual amino acids and all non-protein electron density. In most of these situations, the model simply does not include the corresponding region. There could be cases where such density is misinterpreted in terms of main-chain and side-chain conformations that are in the corresponding libraries, however.

20 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. The penultimate rotamer library.

Authors: S C Lovell; J M Word; J S Richardson; D C Richardson
Journal: Proteins Date: 2000-08-15

3. Determining protein structure from electron-density maps using pattern matching.

Authors: T Holton; T R Ioerger; J A Christopher; J C Sacchettini
Journal: Acta Crystallogr D Biol Crystallogr Date: 2000-06

4. Pattern-recognition methods to identify secondary structure within X-ray crystallographic electron-density maps.

Authors: Thomas Oldfield
Journal: Acta Crystallogr D Biol Crystallogr Date: 2002-02-21

5. Engineering soluble proteins for structural genomics.

Authors: Jean-Denis Pédelacq; Emily Piltch; Elaine C Liong; Joel Berendzen; Chang-Yub Kim; Beom-Seop Rho; Min S Park; Thomas C Terwilliger; Geoffrey S Waldo
Journal: Nat Biotechnol Date: 2002-08-19 Impact factor: 54.908

6. Selection of representative protein data sets.

Authors: U Hobohm; M Scharf; R Schneider; C Sander
Journal: Protein Sci Date: 1992-03 Impact factor: 6.725

7. Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes.

Authors: J W Ponder; F M Richards
Journal: J Mol Biol Date: 1987-02-20 Impact factor: 5.469

8. Refined crystal structure of DsRed, a red fluorescent protein from coral, at 2.0-A resolution.

Authors: D Yarbrough; R M Wachter; K Kallio; M V Matz; S J Remington
Journal: Proc Natl Acad Sci U S A Date: 2001-01-16 Impact factor: 11.205

9. Structure of the gene V protein of bacteriophage f1 determined by multiwavelength x-ray diffraction on the selenomethionyl protein.

Authors: M M Skinner; H Zhang; D H Leschnitzer; Y Guan; H Bellamy; R M Sweet; C W Gray; R N Konings; A H Wang; T C Terwilliger
Journal: Proc Natl Acad Sci U S A Date: 1994-03-15 Impact factor: 11.205

10. Structure of translation initiation factor 5A from Pyrobaculum aerophilum at 1.75 A resolution.

Authors: T S Peat; J Newman; G S Waldo; J Berendzen; T C Terwilliger
Journal: Structure Date: 1998-09-15 Impact factor: 5.006

52 in total

1. Structure of the cytoplasmic domain of Yersinia pestis YscD, an essential component of the type III secretion system.

Authors: George T Lountos; Joseph E Tropea; David S Waugh
Journal: Acta Crystallogr D Biol Crystallogr Date: 2012-02-07

2. Expression, purification, crystallization and preliminary X-ray crystallographic analysis of a major fragment of the resuscitation-promoting factor RpfB from Mycobacterium tuberculosis.

Authors: Alessia Ruggiero; Flavia Squeglia; Luciano Pirone; Stefania Correale; Rita Berisio
Journal: Acta Crystallogr Sect F Struct Biol Cryst Commun Date: 2010-12-24

3. Crystal structure of the functional region of Uro-adherence factor A from Staphylococcus saprophyticus reveals participation of the B domain in ligand binding.

Authors: Eriko Matsuoka; Yoshikazu Tanaka; Makoto Kuroda; Yuko Shouji; Toshiko Ohta; Isao Tanaka; Min Yao
Journal: Protein Sci Date: 2011-02 Impact factor: 6.725

4. Structural basis for myosin V discrimination between distinct cargoes.

Authors: Natasha Pashkova; Yui Jin; S Ramaswamy; Lois S Weisman
Journal: EMBO J Date: 2006-01-26 Impact factor: 11.598

5. Structural basis for multimeric heme complexation through a specific protein-heme interaction: the case of the third neat domain of IsdH from Staphylococcus aureus.

Authors: Masato Watanabe; Yoshikazu Tanaka; Ayuko Suenaga; Makoto Kuroda; Min Yao; Nobuhisa Watanabe; Fumio Arisaka; Toshiko Ohta; Isao Tanaka; Kouhei Tsumoto
Journal: J Biol Chem Date: 2008-07-30 Impact factor: 5.157

6. TM0486 from the hyperthermophilic anaerobe Thermotoga maritima is a thiamin-binding protein involved in response of the cell to oxidative conditions.

Authors: Zorah Dermoun; Amélie Foulon; Mitchell D Miller; Daniel J Harrington; Ashley M Deacon; Corinne Sebban-Kreuzer; Philippe Roche; Daniel Lafitte; Olivier Bornet; Ian A Wilson; Alain Dolla
Journal: J Mol Biol Date: 2010-05-13 Impact factor: 5.469

7. Structural insights into membrane targeting by the flagellar calcium-binding protein (FCaBP), a myristoylated and palmitoylated calcium sensor in Trypanosoma cruzi.

Authors: Jennifer N Wingard; Jane Ladner; Murugendra Vanarotti; Andrew J Fisher; Howard Robinson; Kathryn T Buchanan; David M Engman; James B Ames
Journal: J Biol Chem Date: 2008-06-17 Impact factor: 5.157

8. Crystallization and preliminary crystallographic analysis of the transpeptidase domain of penicillin-binding protein 2B from Streptococcus pneumoniae.

Authors: Mototsugu Yamada; Takashi Watanabe; Nobuyoshi Baba; Takako Miyara; Jun Saito; Yasuo Takeuchi
Journal: Acta Crystallogr Sect F Struct Biol Cryst Commun Date: 2008-03-21

9. The structures of transcription factor CGL2947 from Corynebacterium glutamicum in two crystal forms: a novel homodimer assembling and the implication for effector-binding mode.

Authors: Yong-Gui Gao; Min Yao; Hiroshi Itou; Yong Zhou; Isao Tanaka
Journal: Protein Sci Date: 2007-09 Impact factor: 6.725

10. Crystal structure of a novel polyisoprenoid-binding protein from Thermus thermophilus HB8.

Authors: Noriko Handa; Takaho Terada; Yukiko Doi-Katayama; Hiroshi Hirota; Jeremy R H Tame; Sam-Yong Park; Seiki Kuramitsu; Mikako Shirouzu; Shigeyuki Yokoyama
Journal: Protein Sci Date: 2005-03-01 Impact factor: 6.725