| Literature DB >> 26002864 |
Matthew E Andreatta1, Joshua A Levine2, Scott G Foy2, Lynette D Guzman3, Luke J Kosinski4, Matthew H J Cordes5, Joanna Masel6.
Abstract
Protein-coding sequences can arise either from duplication and divergence of existing sequences, or de novo from noncoding DNA. Unfortunately, recently evolved de novo genes can be hard to distinguish from false positives, making their study difficult. Here, we study a more tractable version of the process of conversion of noncoding sequence into coding: the co-option of short segments of noncoding sequence into the C-termini of existing proteins via the loss of a stop codon. Because we study recent additions to potentially old genes, we are able to apply a variety of stringent quality filters to our annotations of what is a true protein-coding gene, discarding the putative proteins of unknown function that are typical of recent fully de novo genes. We identify 54 examples of C-terminal extensions in Saccharomyces and 28 in Drosophila, all of them recent enough to still be polymorphic. We find one putative gene fusion that turns out, on close inspection, to be the product of replicated assembly errors, further highlighting the issue of false positives in the study of rare events. Four of the Saccharomyces C-terminal extensions (to ADH1, ARP8, TPM2, and PIS1) that survived our quality filters are predicted to lead to significant modification of a protein domain structure.Entities:
Keywords: gene birth; origin of novelty; protein structure; stop codon readthrough
Mesh:
Substances:
Year: 2015 PMID: 26002864 PMCID: PMC4494051 DOI: 10.1093/gbe/evv098
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
F(A) The stop codon position in the sister species was used as an outgroup to determine whether a SCP was caused by an addition to or a subtraction from the ancestral coding sequence. Additions can result either from point mutations eliminating the stop codon or from indels that knock the stop codon out of frame. In the latter case, we distinguish between added amino acids that increase the total length of the protein, and new amino acids that include all novel amino acids following the indel. (B) When the SCP involves more than two stop codon positions, inference is more complicated. Here, at least one addition took place, plus one event that could have been either an addition or a subtraction. (C) At least one addition and one subtraction must have occurred to explain this phylogeny. More complex cases with more than 3 stop codon positions were classified using the same logic. While it is in principle possible to use the strain phylogeny (Liti et al. 2009) to distinguish the order of events in these cases, there is enough outcrossing between strains (Ruderfer et al. 2006) such that the gene tree may not match the strain tree, and so this was not done.
Characteristics of All 55 Genes that Have Undergone Addition via Stop Codon Loss in Either S. cerevisiae or S. paradoxus, Including One Putative Gene Fusion
| Systematic Name | Standard Name | Addition Species | Outgroup Allele Length (aa) | “New” Sequence (aa) | Type of Addition | Number of Nonsingleton Alleles | Ribosomal Profile Evidence | Total Evidence | Gene Notes | |
|---|---|---|---|---|---|---|---|---|---|---|
| YAL005C | SSA1 | 643 | 18 | 49 | Frameshift | 2 | Strong | Strong | ATPase activity | |
| YBR014C | GRX7 | 204 | 15 | 22 | Frameshift | 3 | Strong | Strong | Glutathione-disulfide reductase activity, involved in oxidative stress response | |
| YBR046C | ZTA1 | 335 | 26 | 30 | Frameshift | 2 | Strong | Strong | NADPH-dependent quinone reductase | |
| YBR194W | AIM4 | 124 | 4 | 5 | Frameshift | 2 | Moderate | Strong | Unknown functional activity; protein proposed to be associated with the nuclear pore complex | |
| YBR264C | YPT10 | 200 | 10 | 15 | Frameshift | 2 | Strong | Strong | Rab family, GTPase activity | |
| YCR076C | FUB1 | 243 | 11 | 22 | Frameshift | 2 | Strong | Strong | Protein of unknown function; interacts with subunits of the 20 S proteasome | |
| YDL027C | NA | 421 | 3 | 4 | Frameshift | 2 | Moderate | Strong | Protein of unknown function; nontagged protein is detected in highly purified mitochondria | |
| YDL056W | MBP1 | 834 | 3 | 3 | Point | 3 | Strong | Strong | DNA binding transcription factor activity; involved in regulation of transcription | |
| YDL175C | AIR2 | 343 | 3 | 15 | Frameshift | 2 | Strong | Strong | RNA-binding activity | |
| YDR062W | LCB2 | 562 | 7 | 7 | Frameshift | 2 | Strong | Strong | Serine C-palmitoyltransferase activity | |
| YFR037C | RSC8 | 558 | 2 | 21 | Frameshift | 2 | Strong | Strong | Component of the RSC chromatin remodeling complex involved in DNA binding activity | |
| YGL058W | RAD6 | 171 | 26 | 29 | Frameshift | 2 | Strong | Strong | Ubiquitin-protein ligase activity | |
| YGR004W | PEX31 | 463 | 1 | 1 | Point | 2 | Moderate | Strong | Peroxisomal integral membrane protein; involved in negative regulation of peroxisome size | |
| YGR059W | SPR3 | 512 | 3 | 9 | Frameshift | 2 | None | Strong | Structural septin protein activity; involved in sporulation | |
| YGR136W | LSB1 | 242 | 19 | 48 | Frameshift | 2 | Strong | Strong | Negative regulator of actin nucleation-promoting factor acitivity | |
| YGR152C | RSR1 | 273 | 1 | 56 | Frameshift | 2 | Strong | Strong | GTPase activty | |
| YGR188C | BUB1 | 1,022 | 3 | 5 | Frameshift | 2 | Low | Strong | Protein kinase activity; involved in cell-cycle checkpoint mechanism | |
| YHR034C | PIH1 | 341 | 1 | 1 | Frameshift | 3 | Strong | Strong | Unknown functional activity; involved in RNA processing | |
| YHR043C | DOG2 | 247 | 5 | 7 | Frameshift | 3 | Strong | Strong | 2-deoxyglucose-6-phosphatase activity | |
| YHR200W | RPN10 | 268 | 18 | 24 | Frameshift | 2 | Strong | Strong | Polyubiquitin binding activity | |
| YHR206W | SKN7 | 625 | 10 | 35 | Frameshift | 2 | Strong | Strong | DNA binding transcription factor activity; oxidative stress response and osmoregulation | |
| YIL110W | HPM1 | 378 | 1 | 4 | Frameshift | 2 | Strong | Strong | S-adenosylmethionine-dependent methyltransferase activity; ribosomal protein modification | |
| YIL138C | TPM2 | 162 | 3 | 45 | Frameshift | 2 | Strong | Strong | Actin binding activity; involved in cell growth | |
| YJL035C | TAD2 | 251 | 1 | 1 | Point | 2 | Moderate | Strong | tRNA-specific adenosine deaminase activity; involved in t-RNA modification | |
| YJL186W | MNN5 | 587 | 21 | 22 | Frameshift | 2 | Strong | Strong | Alpha-1,2-mannosyltransferase activity; involved in cell wall mannan biosynthesis | |
| YJR075W | HOC1 | 397 | 24 | 63 | Frameshift | 2 | Strong | Strong | Alpha-1,6-mannosyltransferase activity; involved in cell wall mannan biosynthesis | |
| YKL040C | NFU1 | 257 | 34 | 48 | Frameshift | 2 | Strong | Strong | Involved in iron metabolism in mitochondria | |
| YKL212W | SAC1 | 624 | 5 | 7 | Frameshift | 2 | Strong | Strong | Phosphatidylinositol phosphate phosphatase activity | |
| YKR006C | MRPL13 | 265 | 3 | 10 | Frameshift | 2 | Strong | Strong | Mitochondrial ribosomal protein of the large subunit | |
| YKR069W | MET1 | 591 | 2 | 2 | Point | 2 | Low | Strong | Uroporphyrinogen III transmethylase activity; sulfate assimilation and methionine biosynthesis | |
| YLR095C | IOC2 | 816 | 11 | 38 | Frameshift | 2 | Strong | Strong | Nucleosome-stimulated ATPase activity; involved in chromatin remodeling | |
| YLR142W | PUT1 | 481 | 5 | 8 | Frameshift | 2 | None | Strong | Proline oxidase activity; involved in utilization of proline as sole nitrogen source | |
| YLR313C | SPH1 | 650 | 13 | 13 | Point | 2 | Low | Strong | Protein involved in shmoo formation and bipolar bud site selection | |
| YLR318W | EST2 | 877 | 2 | 5 | Frameshift | 4 | None | Strong | Telomerase catalytic activity | |
| YLR357W | RSC2 | 890 | 5 | 5 | Point | 3 | Strong | Strong | ATP-dependent chromatin remodeling activity; part of the RSC chromatin remodeling complex | |
| YLR359W | ADE13 | 483 | 35 | 35 | Frameshift | 3 | Strong | Strong | Adenylosuccinate lyase activity; involved in the nucleotide biosynthetic pathway | |
| YLR407W | NA | 229 | 1 | 4 | Frameshift | 2 | Strong | Strong | Putative protein of unknown function; null mutant displays elongated buds | |
| YML047C | PRM6 | 353 | 3 | 6 | Frameshift | 2 | None | Strong | Potassium ion transmembrane transporter activity; Pheromone-regulated protein | |
| YMR011W | HXT2 | 542 | 6 | 15 | Frameshift | 2 | Strong | Strong | High-affinity glucose transmembrane transporter activity | |
| YMR240C | CUS1 | 437 | 28 | 41 | Frameshift | 2 | Strong | Strong | Unknown function; required for assembly of U2 snRNP into the spliceosome | |
| YNL234W | NA | 426 | 70 | 86 | Frameshift | 2 | Low | Moderate | Protein of unknown function; may be involved in glucose signaling or metabolism | |
| YNL251C | NRD1 | 576 | 15 | 21 | Frameshift | 3 | Strong | Strong | RNA-binding protein activity; involved in the Nrd1 complex | |
| YNL294C | RIM21 | 534 | 25 | 25 | Frameshift | 2 | Low | Strong | pH sensor; involved in cell wall biosynthesis and alkaline pH response | |
| YOL058W | ARG1 | 420 | 714 | 722 | Deletion | 3 | Strong | Strong | Arginosuccinate synthetase activity; involved in the arginine biosynthesis pathway | |
| YOL086C | ADH1 | 349 | 7 | 18 | Frameshift | 2 | Strong | Strong | Alcohol dehydrogenase activity; involved with the reduction of acetaldehyde to ethanol | |
| YOL100W | PKH2 | 1,082 | 1 | 10 | Frameshift | 3 | Moderate | Strong | Serine/threonine protein kinase; involved in sphingolipid-mediated signaling pathway | |
| YOR141C | ARP8 | 882 | 14 | 46 | Frameshift | 2 | Strong | Strong | mRNA binding activity; involved in chromatin remodeling | |
| YOR260W | GCD1 | 579 | 11 | 52 | Frameshift | 2 | Strong | Strong | Translation initiation factor activity; Gamma subunit of the translation initiation factor eIF2B | |
| YOR387C | NA | 207 | 12 | 12 | Point | 3 | None | Moderate | Unknown function; regulated by Aft1p transcription factor; highly inducible in zinc-depleted conditions | |
| YPL183C | RTT10 | 1,014 | 4 | 7 | Frameshift | 2 | Strong | Strong | Cytoplasmic protein with a role in regulation of Ty1 transposition | |
| YPL204W | HRR25 | 494 | 51 | 66 | Frameshift | 2 | Strong | Strong | Protein kinase activity; regulation of vesicular trafficking and DNA repair | |
| YPL248C | GAL4 | 882 | 4 | 4 | Point | 2 | None | Strong | DNA-binding transcription factor; involved in GAL genes activation in response to galactose | |
| YPR068C | HOS1 | 471 | 5 | 5 | Point | 3 | Low | Strong | Histone deacetylase activity | |
| YPR113W | PIS1 | 221 | 59 | 61 | Frameshift | 2 | Strong | Strong | Phosphatidylinositol synthase activity | |
| YPR192W | AQY1 | 306 | 22 | 34 | Frameshift | 4 | None | Strong | Spore-specific water channel that mediates the transport of water across cell membranes |
Note.—Evidence of translation and protein function is summarized; see main text for details.
FDistribution of addition allele across strains in S. paradoxus (A) and S. cerevisiae (B). Unrooted phylogenetic trees were taken from Liti et al. (2009). As is well-known, S. paradoxus shows more population structure (appearing here as dark blue monophyletic blocks or pale blue “coherent” near-monophyletic blocks) than S. cerevisiae. The strong phylogenetic pattern further demonstrates that these additions are not mere sequencing errors. A number of additions have risen to high frequency.
FTranscription patterns do not support a gene fusion producing a single transcript. mRNA read counts from the Y12 strain S. cerevisiae (Skelly et al. 2013) were generated by sequence alignment using BFAST version 0.7a (Homer et al. 2009a, 2009b) and SAMtools version 0.1.19 (Li et al. 2009). ORF regions for YOL058W and YOL057W are shown by black bars. UTR regions (yellow bars) are based on the annotation of David et al. (2006) for the reference S. cerevisiae strain. The putative 288 bp deletion, which is expected to cause a fusion between two S. cerevisiae genes, is indicated by the red bar, whereas a smaller 33 bp deletion is indicated by the purple bar.
FThe frequencies of C-terminal extension lengths per gene within S. cerevisiae and S. paradoxus. See figure 1 for the distinction between added (A) and new (B) amino acids. The “readthrough” histogram (C) is based on the number of amino acids that would be added to a gene if the stop codon were removed and translation were to read through to the next in-frame stop codon. Genes that did not reach a stop codon prior to the end of their UTR boundary as predicted by Nagalakshmi et al. (2008) were excluded. (D) The geometric mean and 95% confidence interval for added, new, and readthrough amino acid distributions. Data were approximately normal or truncated normal following a log transformation, so this transformation was used for statistics, with figure 4D generated through a back transformation. Added sequences are shorter than readthrough controls (P = 0.035; two-tailed t-test on transformed data). The still greater length of new amino acid sequences results from a statistical artifact; for many frameshifts that created smaller numbers of new amino acids, an early stop codon, earlier than the ancestral stop codon, would have prevented inclusion in our data set.
FThree proteins with additions that may impact protein structure. (A) Alcohol dehydrogenase I from S. cerevisiae S288C, PDB ID 2HCY, chain A, residues 1–347, (B) Actin-related protein 8 from S. cerevisiae S288C, PDB ID 4AM6, chain A, residues 248-881, (C) Tropomyosin 2, homolog from O. cuniculus shown, PDB ID 2W49, chains A and B, residues 39–200. The ribbon diagram in each panel shows the portion of the protein altered by frameshift in orange, with the length of the altered region as well as the increase in sequence length indicated. Below each structure the C-terminal sequences of the reference strain and the longest version are shown, preceded by five residues of the unaltered region of sequence, shown in italics. Sequences are annotated with actual or predicted locations of α-helix (red) and β-strand (blue) secondary structures. These locations are inferred from the S. cerevisiae S288C or homologous structure in the case of the reference strain, or predicted by Jpred 3 in the case of the longest version.