| Literature DB >> 17937816 |
Joe Dundas1, T A Binkowski, Bhaskar DasGupta, Jie Liang.
Abstract
BACKGROUND: Identifying structurally similar proteins with different chain topologies can aid studies in homology modeling, protein folding, protein design, and protein evolution. These include circular permuted protein structures, and the more general cases of non-cyclic permutations between similar structures, which are related by non-topological rearrangement beyond circular permutation. We present a method based on an approximation algorithm that finds sequence-order independent structural alignments that are close to optimal. We formulate the structural alignment problem as a special case of the maximum-weight independent set problem, and solve this computationally intensive problem approximately by iteratively solving relaxations of a corresponding integer programming problem. The resulting structural alignment is sequence order independent. Our method is also insensitive to insertions, deletions, and gaps.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17937816 PMCID: PMC2096629 DOI: 10.1186/1471-2105-8-388
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Circular permutation example. The cartoon illustration of three protein structures whose domains are similarly arranged in space but appear in different order in primary sequences. The location of domains A, B, C in primary sequences are shown in a layout below each structure. Their orderings are related by circular permutation [2].
Known circular permutation results
| Protein 1 | Protein 2 | Us | DaliLite | K2 | ||||
| PDB(Length) | PDB(Length) | RMSD | RMSD | RMSD | ||||
| 1rinA(180) | 2cna_(237) | 10-6 | 106 | 1.7 | 60 | 0.92 | ||
| 1glh_(214) | 1cpn_(208) | 10-5 | 156 | 0.4 | 156 | 0.41 | ||
| 1exg_(110) | 1tul_(102) | 10-4 | 63 | 4.0 | 34 | 2.26 | ||
| 1rhgA(145) | 1bcfA(158) | 10-4 | 94 | 2.3 | 81 | 1.51 | ||
| 1ihwA(52) | 1sso_(62) | 10-3 | 45 | 2.9 | 28 | 1.93 | ||
Comparison of results against DaliLite and K2. DaliLite is not expected to find sequence order independent alignments. K2 did not find the circular permutation even when the sequence order independent options was selected.
Known circular permutation results
| Protein 1 | Protein 2 | Us | MASS | OPAAS | SAMO | Topofit | ||||||
| PDB(Length) | PDB(Length) | |||||||||||
| 1rinA(180) | 2cna_(237) | 10-6 | 164 | 1.2 | 167 | 1.48 | 174 | 1.581 | 152 | 1.09 | ||
| 1glh_(214) | 1cpn_(208) | 10-5 | 206 | 0.49 | No | solution | 170 | 3.283 | 206 | 0.49 | ||
| 1exg_(110) | 1tul_(102) | 10-4 | 60 | 1.9 | No | solution | 93 | 2.88 | 52 | 1.79 | ||
| 1rhgA(145) | 1bcfA(158) | 10-4 | 106 | 1.7 | 63 | 2.12 | 126 | 2.309 | 109 | 1.4 | ||
| 1ihwA(52) | 1sso_(62) | 10-3 | 39 | 1.7 | No | solution | 48 | 2.713 | 35 | 1.47 | ||
Comparison of our alignment results with that of MASS, OPAAS, SAMO, and Topofit for known circular permutations. Each method detected the circular permutations. Our method normally returns more equivalent residues at a lower RMSD. N indicates the number of aligned residues. An * next to the number of aligned residues indicates that a circular permutation was found. R indicates the cRMSD of the alignment. p indicates the p-value of our alignment.
Figure 2Nucleoplasmin-core and auxin binding protein 1. A new circular permutation discovered between nucleoplasmin-core (1k5j, chain E, top panel), and the fragment of residues 37–127 of auxin binding protein 1 (1lrh, chain A, bottom panel). a) These two proteins superimpose well spatially, with an RMSD value of 1.36Å for an alignment length of 68 residues and a significant p-value of 2.7 × 10-5 after Bonferroni correction. b) These proteins are related by a circular permutation. The short loop connecting strand 4 and strand 5 of nucleoplasmin-core (in rectangle, top) becomes disconnected in auxin binding protein 1. The N- and C- termini of nucleoplasmin-core (in ellipse, top) become connected in auxin binding protein 1 (in ellipse, bottom). For visualization, residues in the N-to-C direction before the cut in the nucleoplasmin-core protein are colored red, and residues after the cut are colored blue. c) The topology diagram of these two proteins. In the original structure of nucleoplasmin-core, the electron density of the loop connecting strand 4 and strand 5 is missing.
Figure 3Aspartate racemase and type II 3-deydrogenate dehyralase. A new circular permutation discovered between a) aspartate racemase (1iu9, chain A, top) and type II 3-dehydrogenate dehydralase (1h0r, chain A, bottom) superimpose well spatially with an RMSD of 1.49Å between 59 residues, with a significant p-value of 4.7 × 10-4. b) These proteins are related by a circular permutation. The loop connecting helix 1 with strand 1 in aspartate racemase (in rectangle, top) becomes disconnected in type II 3-dehydrogenate dehydralase (in rectangle, bottom), but the N- and C- termini of aspartate racemase (in ellipse, top) becomes connected in dehydrogenate dehydralase (in ellipse, bottom) with an insertion (shown in green). For visualization, residues of aspartate racemase in the N-to-C direction before the cut in the dehydrogenate dehydralase are colored red, and residues after the cut are colored blue. c) The topology diagram of these two proteins. Here an ellipse represents a helix and a block arrow represents a strand.
Figure 4Microphage migration inhibition factor and C-terminal domain of arginine repressor. A new circular permutation discovered between a) the microphage migration inhibition factor (MIF, PDB ID 1uiz, chain A, top) and the C-terminal domain of arginine repressor (AR, 1xxa, chain C, bottom). a) These two proteins superimpose well spatially, with a RMSD of 1.74Å for an alignment length of 24 residues, and a p-value of 1.3 × 10-2. b.) These proteins are related by a circular permutation. The loop connecting helix 1 with strand 2 of MIF (in rectangle, top) becomes disconnected in arginine repressor, the N- and C- termini of MIF (in ellipse, top) becomes connected in arginine repressor (in ellipse, bottom). The disconnection of helix 1 from strand 2 of MIF removes some spatial constraints, allowing strand 1' in AR to swap places with strand 4'. c) The topology diagram of these two proteins. d.) The artificial topology diagram for arginine repressor, where strand 2' and strand 4' are spatially swapped back. The diagram for AR in (c) has the same topology as the diagram in (d).
Figure 5A non-cyclic permutation. A novel non-cyclic permutation discovered between AML1/Core Binding Factor (AML1/CBF, PDB ID 1e50, Chain F, top) and riboflavin synthase (PDBID 1pkv, chain A, bottom) a) These two proteins superimpose well spatially, with an RMSD of 1.23 Å and an alignment length of 42 residues, with a significant p-value of 2.8 × 10-4 after Bonferroni correction. Aligned residues are colored blue. b) These proteins are related by multiple permutations. The steps to transform the topology of AML1/CBF (top) to riboflavin (bottom) are as follows: c) Remove the the loops connecting strand 1 to helix 2, strand 4 to strand 5, and strand 5 to helix 6; d) Connect the C-terminal end of strand 4 to the original N-termini; e) Connect the C-terminal end of strand 5 to the N-terminal end of helix 2; f) Connect the original C-termini to the N-terminal end of strand 5. The N-terminal end of strand 6 becomes the new N-termini and the C-terminal end of strand 1 becomes the new C-termini. We now have the topology diagram of riboflavin synthase.
Constraints
| Interval clique inequalities: | |
| Line sweep at | |
| Line sweep at | |
| Line sweep at | |
| Line sweep at | |
| Line sweep at | |
| Line sweep at | |
| Line sweep at | |
| Interval clique inequalities: | |
| Line sweep at | |
| Line sweep at | |
| Line sweep at | |
| Line sweep at | |
| Line sweep at | |
| Line sweep at | |
| Line sweep at | |
| Consistency inequalities: | |
The constraints of the conflict graph for the set of fragments in Figure 6c.
Alignment quality
| Proteins | HOMSTRAD | FAST | US | ||||||
| PDB(PDB) | PDB(PDB) | N | RMSD | N | M% | RMSD | N | M% | RMSD |
| 57 | 2.5 | 55 | 55% | 1.2 | |||||
| 258 | 1.1 | 255 | 99% | 1.1 | |||||
| 192 | 4.3 | 187 | 89% | 2.0 | |||||
| 105 | 2.2 | 98 | 99% | 2.0 | |||||
| 403 | 5.6 | 343 | 98% | 1.7 | |||||
| 330 | 3.6 | 284 | 97% | 2.3 | |||||
| 33 | 4.7 | 28 | 100% | 1.9 | |||||
| 43 | 2.4 | 40 | 93% | 2.2 | |||||
| 220 | 1.7 | 214 | 97% | 1.5 | |||||
| 377 | 4.6 | 323 | 97% | 1.8 | |||||
| 582 | 2.3 | 546 | 97% | 1.7 | |||||
Table IV from Zhu et al. (2005) with the addition of our alignment results. Zhu et al. chose the following alignment examples to cover a broad range of structural classes. For each alignment, our method returned sequence ordered alignments. N is the number of aligned residues corresponding to each method and M% is the number of aligned residues generated by the corresponding algorithm that are equivalent to HOMSTRAD's aligned residues.
Figure 6Implementation example with vertex sweep. An illustration of the first iteration of our algorithmic approaches for BSSIΛ, : a) The cartoon representation of circularly permuted proteins Sand S; b) The problem represented as a graph where each node χ∈ Λ represents an aligned fragment pair and each edge represents two inconsistent pairs; c) An illustration how sweep lines (dashed) can identify inconsistent aligned pairs as required to generate the interval clique inequalities. A rectangle is an ordered fragment pair (e.g., the solid green rectangle is the pair χ5 = ()).
Figure 7Secondary Structure cRMSD distributions. The cRMSD distributions of a) helices of length 4 b) helices of length 5 c) helices of length 6 d) helices of length 7 e) strands of length 4 f) strands of length 5 g) strands of length 6 and h) strands of length 7.
Figure 8Similarity Score versus length. a) Linear fit between raw similarity score σ (X) (equation 8) as a function of the geometric mean of the length of the two aligned proteins (Nand Nare the number of residues in the two protein structures Sand S). The linear regression line (grey line) has a slope of 0.314. b) Linear fit of the normalized similarity score (X) (equation 9) as a function of the geometric mean of the length of the two aligned proteins. The linear regression line (grey line) has a slope of -0.0004.
Figure 9Similarity Score Distribution. The distribution of the normalized similarity scores obtained by aligning 200,000 pairs of proteins randomly selected from PDBSELECT 25% [19]. The distribution can be fit to an Extreme Value Distribution, with parameters α = 14.98 and β = 3.89.