| Literature DB >> 18523658 |
Douglas D Axe1, Brendan W Dixon, Philip Lu.
Abstract
The study of protein evolution is complicated by the vast size of protein sequence space, the huge number of possible protein folds, and the extraordinary complexity of the causal relationships between protein sequence, structure, and function. Much simpler model constructs may therefore provide an attractive complement to experimental studies in this area. Lattice models, which have long been useful in studies of protein folding, have found increasing use here. However, while these models incorporate actual sequences and structures (albeit non-biological ones), they incorporate no actual functions--relying instead on largely arbitrary structural criteria as a proxy for function. In view of the central importance of function to evolution, and the impossibility of incorporating real functional constraints without real function, it is important that protein-like models be developed around real structure-function relationships. Here we describe such a model and introduce open-source software that implements it. The model is based on the structure-function relationship in written language, where structures are two-dimensional ink paths and functions are the meanings that result when these paths form legible characters. To capture something like the hierarchical complexity of protein structure, we use the traditional characters of Chinese origin. Twenty coplanar vectors, encoded by base triplets, act like amino acids in building the character forms. This vector-world model captures many aspects of real proteins, including life-size sequences, a life-size structural repertoire, a realistic genetic code, secondary, tertiary, and quaternary structure, structural domains and motifs, operon-like genetic structures, and layered functional complexity up to a level resembling bacterial genomes and proteomes. Stylus is a full-featured implementation of the vector world for Unix systems. To demonstrate the utility of Stylus, we generated a sample set of homologous vector proteins by evolving successive lines from a single starting gene. These homologues show sequence and structure divergence resembling those of natural homologues in many respects, suggesting that the system may be sufficiently life-like for informative comparison to biology.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18523658 PMCID: PMC2405935 DOI: 10.1371/journal.pone.0002246
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Structural analogy between Han characters and protein folds.
This two-part character (identified by its hexadecimal Unicode number, U+8C58) is reminiscent of two-part protein folds like the one shown (PDB 1CQD).
Figure 2Hierarchical structure of Han characters.
Single strokes, like that shown at the bottom, are combined to form successively more complex structures (shown as ascending layers). Characters range in complexity from a single stroke to dozens of strokes.
Figure 3Monomers and genetic code for construction of model proteins.
A) The set of vector monomers, named according to compass direction and length (i.e., Nem indicating a northeast vector of medium length). To ensure that vector addition produces different results with different vector combinations, small vectors are of length 1, medium vectors of length e 1/2 (≈1.65), and long vectors of length e (≈2.72). B) A standard genetic code for specifying the monomers with nucleotide triplets. Like the natural code [15] this code incorporates several features that reduce the impact of point mutations. These include extensive use of third-position degeneracy, strong correlation of second position with a key physical property (direction), and underrepresentation of vectors that would be most disruptive as substitutes (long vectors).
Figure 4Parallels between vector-world and real-world protein synthesis.
Steps are illustrated for a vector protein (U+8C58) on the left, with analogous aspects of a real protein (PDB 1CQD) on the right. A) Codons in an open reading frame specify monomers (vectors or amino acids) that may form regular local structure (green) or irregular local structure (grey). In the vector world a simple rule determines which is the case: A vector becomes part of regular structure if and only if it forms a coherent vector triplet (indicated by green tiles below the sequence; see text). B) Vectors are joined to form paths with head and tail termini, just as amino acids are joined to form chains with amino and carboxyl termini (right panel derived from public domain images by Yassine Mrabet). C) Vector proteins consist of strokes (formed by runs of coherent vectors) joined by moves (formed by runs of incoherent vectors), in much the same way that real proteins consist of units of secondary structure joined by turns or loops. D) Final working forms, highlighting the segments shown above.
Figure 5Layered 2D representation of vector proteins.
Strokes (green) are placed on successively higher planes by rendering moves (blue) with a vertical component added to each vector.
Figure 6Chain topology in real and vector proteins.
A) Sheet regions of 1VHR (left) and 1D1Q (right) with color running from blue to red in the amino-to-carboxyl direction. B) Vector proteins that perform the function of (U+5DDE) by means of different topologies, colored blue to red in the tail-to-head direction.
Figure 7Domains as sub-structures with sub-functions.
A) Two proteins that use similar NAD-binding domains (orange). Left: α-glucosidase monomer from Thermotoga maritima (PDB 1OBB). Right: L-lactate dehydrogenase monomer from Bacillus stearothermophilus (PDB 1LDN). B) Two vector proteins that use similar domains (purple) as described in text.
Figure 8The operon-like structure of vector-world genes encoding a sentence.
Gene names shown in white, with functional notation above or below. A) The genetic structure of the histidine operon of Escherichia coli (adapted from EcoCyc, http://ecocyc.org). B) The genetic structure of a vector-world gene suite encoding a sentence-level function (see text). Genes are named according to the Unicode number of their function.
Figure 9Functional specificity of real proteins depends upon atomic-level details.
The products of the bioF and kbl genes of E. coli are virtually indistinguishable at the fold level, but the structural differences produce different functions. Left: BioF monomer (PDB 1DJE), which functions as a dimer in biotin biosynthesis. Right: Kbl monomer (1FC4), which functions as a dimer in threonine degradation.
Figure 10Building archetypes for Han characters.
Left: U+8FF4 shown in fonts STFangSong, LiSong Pro, and MS Mincho (top to bottom). Arial Unicode (center) is the chosen standard for archetypes, which are scaleable geometric specifications (right; see text).
Figure 11Assessing shape distortion of vector strokes by comparing with ideal forms.
Colors differentiate the three strokes forming a component of (U+5F35). Dots show vector boundaries. Left: strokes from a vector protein with a proficiency score of 0.4 (shown in Figure 12). Middle: the ideal structure specified by the archetype. Right: scaled vector strokes laid over their archetype forms, with bounding rectangles shaded. Shape distortion is assessed for each stroke individually, the top stroke in this example having no distortion.
Figure 12Qualitative correlation between functional proficiency of vector proteins and their legibility.
Initial genes encoding (U+4EAB), (U+5F35), and (U+684C) were generated with Inscribe and processed with Stylus (see Software). Initial proficiencies were above 0.6. Selection thresholds ran from 0.60 to 0.10 in steps of −0.05, followed by thresholds of 0.07, 0.05, 0.03, and 0.01. At each threshold, 1000 non-synonymous base substitutions were accumulated. Vector proteins shown are representative of the distortion seen at the indicated scores.
Figure 13Inscribe screenshot, showing archetype construction for (U+59D1).
Figure 14Stylus process overview.
The system design accommodates either one-at-a-time processing or batch processing on a grid.
Figure 15Stylus report screenshots.
A) Summary page, showing vector proteins at regular intervals along a line of descent. B) Structural section of a detail page for a single trial. Other sections on the same page provide details of fitness and proficiency scores, mutation history, and gene/protein sequences.
Figure 16Structural superposition of ten homologous vector proteins having the function of (U+72D7).
Structures were aligned by translation in the three coordinate directions without rotation (thereby preserving vector directions).
Figure 17Sequence alignment of ten homologous vector proteins.
The vector proteins shown in Figure 16 were aligned with ClustalW [26] without using amino-acid based information (i.e., using the identity matrix for substitution scoring and no protein-based gap penalties). Vectors were assigned single-letter amino acid codes (L = Nos; M = Nom; W = Nol; F = Nes; Y = Nem; A = Eas; S = Eam; T = Eal; D = Ses; E = Sem; H = Sos; K = Som; R = Sol; N = Sws; Q = Swm; G = Wes; C = Wem; P = Wel; V = Nws; I = Nwm) and colored according to direction (No = grey; Ne = gold; Ea = deep green; Se = teal; So = cyan; Sw = bright green; We = yellow; Nw = red). Asterisk indicates position of complete vector conservation (dots are not meaningful, being based on amino-acid similarities). Wavy underlines show stroke locations for sequence 10 (locations vary somewhat among sequences).
Figure 18Effects of non-synonymous point mutations on vector protein function.
Relative mutant proficiencies were calculated by dividing mutant proficiencies by the pre-mutation proficiency, with the resulting values binned in increments of 0.0025. Points show how many mutations fall into each bin, the vertical scale running from zero to 10,000. The line shows the proportion of mutations (zero to one) with relative proficiencies less than or equal to the value on the horizontal scale. The point representing true neutral substitutions (30,203 mutations with relative proficiency = 1) is above the range shown.
Summary of Correspondence Between Real World and Vector World
| Corresponding pairs | Comments | |||
| Vector Proteins | Natural Proteins | Primary similarities | Primary differences | Implications |
| Existing analog: |
| Real-world mappings of structure to function. Real natural histories. Similar set sizes. | Characters are 2D; proteins are 3D. Characters are geometric; proteins are physical. | Opens possibility of constructing an artificial protein model around a real and tractable structure–function relationship. Static nature of written forms precludes dynamic folding model. |
| Existing analog: |
| As real-world phenomena, both carry real, complex constraints. | Legibility is observer dependent. | Opens possibility of evolutionary simulation under realistic functional constraints, with the limitation that numerical approximation will be required. |
| Constructed analog: |
| Multiple structural aspects. Each monomer is distinct. | Vectors have only two properties, whereas amino acids have many. | Space of structural possibilities for protein-length polymers is vast for both worlds. |
| Constructed analog: |
| 64 codons mapped to 20 monomers (plus start and stop). Third-position degeneracy. | More uniform representation of vectors (2 to 4 codons). Synonymous vector codons are precisely equivalent. | Synonymous substitution rate is precisely proportional to incident mutation rate in vector world. |
| Constructed analog: |
| Identical open-reading-frame structure. Similar typical lengths. | Vector gene expression has no dynamic aspect. | Full analogy to static aspects of bacterial genetics, though not suitable for studying genetic regulation. |
| Constructed analog: |
| Polymers of similar length that perform specific functions by means of well-defined structures. | Vector proteins have no folding process; no analog of active sites. | Rich sequence-to-structure analogy, strength being static structure not structural dynamics or enzyme kinetics. |
| Constructed analog: |
| Like real proteins, vector proteins have fold-like structural hierarchy, topological complexity, and a highly many-to-one mapping of structure to function. | Vector paths are static 2D structures. Protein backbones form dynamic 3D structures. | Rich structure-to-function analogy, though limited to static aspects. |
| Constructed analog: |
| Aspects of local chain geometry. More critical (than incoherent/irregular) for forming whole structure. Small breaks in regular structure may be tolerated. | Local vector structure is autonomous, whereas local protein structure forms cooperatively. | Autonomy of local vector structure may simplify modular assembly of genes encoding new structures. |
| Constructed analog: |
| Aspects of local chain geometry. Both connect units of regular structure. Single substitutions can induce regular structure. | Irregular ‘loops’ are often involved in forming natural active sites. | Both worlds show interplay between regular and irregular structure, where boundaries tend to shift upon mutation. |
| Mainly existing analog: |
| Basic structural components found in all structures. Many show topological variation. | Vector components are structurally autonomous, whereas protein components form cooperatively. | Autonomy of sub-structures in vector world may simplify modular assembly of genes encoding new structures. |
| Existing analog: |
| In both worlds, functionally significant parts combine to produce compound functions. | Vector domains are structurally autonomous, which may or may not be the case for protein domains. | Autonomy of functional sub-structures in vector world may simplify modular assembly of genes encoding new functions. |
| Mainly existing analog: |
| Genome organization affects high-level function in both worlds. | Vector gene order is constrained by rules of syntax. Bacterial gene interactions are less dependent on gene order. | Opens possibility of evolutionary simulation within genome-level functional constraints, though the form of these constraints is simpler in the vector world. |
| Mainly existing analog: |
| In both worlds, many functions require two or more proteins to come together. | Natural protein complexes are compound structures, whereas vector protein complexes follow from gene order. | Combining proteins to form a compound function may be simpler in the vector world, requiring juxtaposition of genes rather than construction of specific binding interfaces. |
| Mainly existing analog: |
| Protein-level functions may be combined to produce higher levels of function in both worlds. | High-level biological functions often require regulated expression and transport, in addition to complex formation. | Construction of protein systems with high-level function may be simpler in the vector world, requiring only correct gene order. |