| Literature DB >> 22821011 |
Niv Sabath1, Andreas Wagner, David Karlin.
Abstract
New protein-coding genes can originate either through modification of existing genes or de novo. Recently, the importance of de novo origination has been recognized in eukaryotes, although eukaryotic genes originated de novo are relatively rare and difficult to identify. In contrast, viruses contain many de novo genes, namely those in which an existing gene has been "overprinted" by a new open reading frame, a process that generates a new protein-coding gene overlapping the ancestral gene. We analyzed the evolution of 12 experimentally validated viral genes that originated de novo and estimated their relative ages. We found that young de novo genes have a different codon usage from the rest of the genome. They evolve rapidly and are under positive or weak purifying selection. Thus, young de novo genes might have strain-specific functions, or no function, and would be difficult to detect using current genome annotation methods that rely on the sequence signature of purifying selection. In contrast to young de novo genes, older de novo genes have a codon usage that is similar to the rest of the genome. They evolve slowly and are under stronger purifying selection. Some of the oldest de novo genes evolve under stronger selection pressure than the ancestral gene they overlap, suggesting an evolutionary tug of war between the ancestral and the de novo gene.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22821011 PMCID: PMC3494269 DOI: 10.1093/molbev/mss179
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
FMonophyletic distribution of genes originated de novo. (a) A gene that originated de novo (blue arrows) will exhibit a monophyletic distribution among related taxa. However, this distribution could also be the result of divergence of the gene beyond recognition or of acquisition of the gene through horizontal gene transfer (HGT). (b) For a gene that originated de novo (blue arrows) by overprinting an ancestral reading frame (red arrows), these confounding factors can be excluded (see Introduction). Colors are displayed in the electronic version of the article.
Overlapping Genes in the Study.
| Clade | Genome Accession Number | Family | Genus | Species | Taxonomic Distribution of the Overlap | Ancestral Frame | De Novo Frame | Number of Sequences (or Sequence Pairs) in the Analyses (divergence, | Length of the Overlapping Region (nt) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | NC_007358 | Single species | PB1 | PB1-F2 | 10, 9, and 5 | 273 | |||
| 2 | NC_001366 | 3 species in same genus | Polyprotein | Protein L* | 3, 2, and 3 | 468 | |||
| 3 | NC_003045 | 2 genotype groups | Nucleocapsid (N) | Protein I | 70, 33, and 11 | 624 | |||
| 4 | NC_005899 | Whole genus | Capsid protein | p17 | 3, 1, and 3 | 382 | |||
| 5 | NC_004063 | Whole genus | Replicase | ORF69 (movement protein) | 120, 1, and 16 | 1,880 | |||
| 6 | NC_004366 | Whole genus | ORF3 (movement protein) | ORF3 (long distance movement protein) | 10, 0, and 5 | 698 | |||
| 7 | NC_003809 | Whole genus | Polymerase | Protein 2 b | 15, 6, and 6 | 308 | |||
| 8 | NC_003627 | Whole genus (contains a single species | Coat protein | p31 | 3, 0, and 3 | 451 | |||
| 9 | NC_009025 | Whole genus | Structural polyprotein | Pog | 6, 3, and 4 | 312 | |||
| 10 | NC_001498 | 2 genera in same family | Phosphoprotein (P) | Protein C | 10, 1, and 5 | 561 | |||
| 11 | NC_003977 | Whole family (contains 2 genera) | Polymerase (P) | Large envelope protein (L) | 28, 13, and 8 | 834 | |||
| 12 | NC_003448 | Whole genus | Protein A | Protein B | 10, 9, and 6 | 228 |
FStructural and functional organization of the overlapping genes we studied. Proteins encoded by overlapping genes are shown to scale. For each protein pair, the ancestral protein is shown on the bottom and the de novo protein on top. B1, base domain 1; cc, coiled coil; Le, Leader region; PA2, phospholipase A2 domain; RdRP, RNA-dependent RNA polymerase domain; tm, transmembrane segment; z, zinc-binding region.
Mean Values of Three Evolutionary Properties for Ancestral and De Novo Genes.
| Ancestral | De Novo | ||
|---|---|---|---|
| CSI | 0.66 (0.09) | 0.62 (0.11) | 5.4 × 10−5 |
| Relative divergence | 1.06 (0.52) | 1.85 (1.20) | 7.1 × 10−34 |
| Selection intensity | 0.40 (0.40) | 0.75 (0.55) | 2.9 × 10−4 |
aNumbers in parentheses are standard deviations.
FEvolutionary dynamics of ancestral (red) and de novo genes (blue). The vertical axes show (a) relative divergence and (b) selective constraint (dN/dS) for the 12 taxa. The horizontal axis represents the evolutionary distance from the origin of each de novo gene (i.e., the estimated age of genes within the clade). Regression lines are plotted for visualization of general trends. Low dN/dS values represent strong selective constraints (see text). Note that dN/dS in (b) could only be calculated for gene pairs that have less than 50% amino acid divergence at the amino acid level (see Materials and Methods). No selective constraint data could be calculated for cases 6 and 8 (bottom panel) as the sequence pairs in these clades have all diverged beyond 50%. Where neighboring groups had similar ages, we shifted their position slightly for visual clarity (groups 5 and 6).
FCodon Similarity Index (CSI) of ancestral (red) and de novo genes (blue). The horizontal axis represents the evolutionary distance from the origin of each de novo gene (as in fig. 3). Regression lines are plotted for visualization of general trends. High CSI values indicate high similarity between the codon usage of a gene and the codon usage of the rest of a genome. Colors are displayed in the electronic version of the paper.
Evidence of Expression, Function, and Fitness Effect of the De Novo Genes in the Study.
| Group | Genus | Evidence for Expression | Function(s) | Fitness Effect When the Novel Gene Is Suppressed | Description of Effect and References |
|---|---|---|---|---|---|
| 1 | Virulence factor ( | Little or no effect | Suppression of PB1-F2 neither affected viral replication nor virus loads in the lungs of mice ( | ||
| 2 | Involved in the establishment of permanent infections of the central nervous system ( | Moderate effect | Suppression of L* decreases the ability of Theiler’s virus to induce a chronic infection of the central nervous system ( | ||
| 3 | Unknown | Little or no effect | Suppression of Protein I expression lead only to a reduced plaque size, suggesting a minor effect on fitness ( | ||
| 4 | Unknown | Unknown | Unknown | ||
| 5 | Viral movement through the plant ( | Severe effect | A knock-out mutant of the movement protein replicates only at low levels in protoplasts ( | ||
| 6 | Long-distance (systemic) movement in plants ( | Severe effect | Long-distance movement is abolished in the absence of ORF4 plants ( | ||
| 7 | Unknown | Unknown | Unknown | ||
| 8 | Unknown | Unknown | Unknown | ||
| 9 | Unknown | Unknown | Unknown | ||
| 10 | Virulence factor ( | Severe effect | Suppression of C results in much milder symptoms and lower mortality in mice ( | ||
| 11 | Viral envelope glycoprotein ( | Severe effect | Deletions within the S domain of the envelope protein drastically reduce infectivity ( | ||
| 12 | Blocks RNA interference ( | Severe effect | Suppression of B2 causes a severe impairment in the intracellular accumulation of viral RNA in cell culture ( |