Literature DB >> 26109356

Gene Model Annotations for Drosophila melanogaster: The Rule-Benders.

Madeline A Crosby¹, L Sian Gramates², Gilberto Dos Santos², Beverley B Matthews², Susan E St Pierre², Pinglei Zhou², Andrew J Schroeder², Kathleen Falls², David B Emmert², Susan M Russo², William M Gelbart².

Abstract

In the context of the FlyBase annotated gene models in Drosophila melanogaster, we describe the many exceptional cases we have curated from the literature or identified in the course of FlyBase analysis. These range from atypical but common examples such as dicistronic and polycistronic transcripts, noncanonical splices, trans-spliced transcripts, noncanonical translation starts, and stop-codon readthroughs, to single exceptional cases such as ribosomal frameshifting and HAC1-type intron processing. In FlyBase, exceptional genes and transcripts are flagged with Sequence Ontology terms and/or standardized comments. Because some of the rule-benders create problems for handlers of high-throughput data, we discuss plans for flagging these cases in bulk data downloads.

Entities: Chemical Disease Gene Species

Keywords: bicistronic; multiphasic exon; non-AUG translation start; shared promoter; stop-codon suppression

Mesh：

Substances：

Year: 2015 PMID： 26109356 PMCID： PMC4528330 DOI： 10.1534/g3.115.018937

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

The D. melanogaster genomic sequence assembly is of exceptionally high quality (Celniker ; Hoskins , 2015) and is one of the few for which gene models have been manually annotated and assessed for all protein-coding and lncRNA genes (Matthews , which is the companion to this article). This has allowed FlyBase (dos Santos ) to more easily identify and handle the rule-benders: gene models that incorporate exceptional or atypical transcription, splicing, or translation events. We summarize our current catalog of such exceptional gene models and events, including polycistronic transcripts, noncanonical splices, trans-spliced transcripts, noncanonical translation starts, stop-codon readthroughs, multiphasic coding exons, and ribosomal frameshifting. It is hoped that this extensively vetted compilation will support opportunities for further investigations into some of these rule-bending phenomena, including their biological impact, mechanistic bases, regulation, evolutionary development, and phylogenetic distribution.

Materials and Methods

The FlyBase gene model set consists entirely of manually annotated transcripts, with the exception of some classes of small noncoding RNAs. Gene model annotation guidelines and datasets informing the gene annotation process are described in Matthews . All data and gene models are available at FlyBase (http://flybase.org).

Results and Discussion

Exceptional cases are flagged at the gene and transcript levels

Atypical gene models, those that do not follow the canonical rules, can create confusion among biologists accustomed to better-behaved genes and can cause unexpected errors in bulk data assessments. Some categories, notably trans-spliced transcripts, are frequently assumed to be errors. Overlapping genes complicate the design of sequence-based reagents (Hu ) and global assessments of levels of gene expression. To flag these exceptions for users, FlyBase identifies known cases of rule-bending gene models at several levels. Appropriate Sequence Ontology (SO) terms (Eilbeck ) are associated with the gene records (Table 1). These may be found in the gene reports in the “Sequence Ontology: Class of Gene” subsection under the “Gene Model and Products” section. This is a controlled field with links from the terms to the FlyBase Vocabularies tool. A comment in the “Comments on Gene Models” field includes the relevant SO term and often additional information regarding the nature and attribution of the exception. Standardized comments have been added to individual rule-bending transcripts when appropriate (Supporting Information, Table S1); FlyBase is in the process of adding similar comments to the transcript headers in our bulk data files (Table 2). Flags of the type “translation exception” are particularly important because they break the rules by which a predicted protein is derived from an annotated transcript. Exceptional cases are also flagged in GenBank RefSeq transcript and protein entries (Table S2). Each of the following sections concludes with a description of the SO, transcript and GenBank flags used for that type of exceptional gene model.

Table 1

Gene-associated Sequence Ontology terms

SO Term	SO ID Number
gene_with_dicistronic_mRNA	SO:0000722
gene_with_polycistronic_transcript	SO:0000690
gene_with_trans_spliced_transcript	SO:0000459
gene_with_unconventional_translation_start_codon	SO:0001739
gene_with_translation_start_codon_CUG	SO:0001740
gene_with_stop_codon_redefined_as_selenocysteine	SO:0000710
gene_with_stop_codon_read_through	SO:0000697
gene_with_transcript_with_translational_frameshift	SO:0000712

Table 2

Proposed transcript-associated flags to be included in FASTA files

Proposed Flag	Type
dicistronic_mRNA	Transcript exception
polycistronic_transcript	Transcript exception
non_canonical_splice_site	Transcript exception
endonuclease_spliced_intron	Transcript exception
trans_spliced_transcript	Transcript exception
non-canonical_start_codon	Translation exception
stop_codon_redefined_as_selenocysteine	Translation exception
stop_codon_read_through	Translation exception
transcript_with_translational_frameshift	Translation exception
mitochondrial_genetic_code	Translation exception
mitochondrial_incomplete_stop_codon	Translation exception
start_codon_not_determined	Translation exception
mutation in strain	Sequence alteration
genomic sequence error or gap	Sequence alteration

Exceptional transcript structure (1): polycistronic (primarily dicistronic) transcripts are not uncommon

One of the surprising results of the first manual gene model annotation sweep of D. melanogaster performed by FlyBase (Misra ) was the number of new dicistronic transcripts. At that time nine dicistronic loci had been previously described (Pauli ; Schulz ; Andrews ; Brogna and Ashburner 1997; Ibnsouda ; Niimi ; Gray and Nicholls 2000; Liu ; Walker ). Misra expanded the number to 31 dicistronic gene pairs that were confidently identified and another 17 that were tentatively identified. Dicistronic loci are no longer a surprise: there are 159 dicistronic gene pairs in annotation release 6.04. In addition, there are nine sets with transcripts encoding more than two genes (seven tricistronic and two tetracistronic) and one complex case involving overlapping dicistronics (of , , and ). In total, 350 genes are annotated as sharing one or more transcripts with a neighboring protein-coding gene. Thus, 2.5% of protein-coding gene models include at least one polycistronic transcript. Note that FlyBase annotates each member of a dicistronic or polycistronic locus as a separate gene; this allows unambiguous association of gene-specific information, such as protein domains, molecular function, and mutant phenotypes. A listing of all polycistronic genes with additional information is found in File S1. [The polycistronic trans-splicing precursors are excluded from these totals and from this discussion; see description of trans-splicing below.] In 2002, annotation of dicistronic loci was dependent on isolation of cDNA clones that spanned both genes. Now, RNA-Seq coverage data (Graveley ; Brown ) and whole genome assessments for transcription start sites (Hoskins ; Batut ) provide independent evidence of transcript extent and structure. Originally, some polycistronic genes were missed because it was difficult to determine which of many small open reading frames (ORFs) might correspond to functional protein-coding genes. Since then, the availability of genomic sequence information for multiple Drosophila species has allowed identification of some conserved ORFs as small as 20–30 amino acids (Lin , 2011; Hayden and Bosco 2008). Polycistronic loci often encode alternative transcripts that are monocistronic. Of the 159 currently annotated dicistronic loci, there are only 45 for which polypeptides from both genes of the pair appear to be produced exclusively from dicistronic transcripts. There are also 45 loci for the opposite case, with support for alternative monocistronic transcripts for both genes. In some of these cases, the dicistronic transcript appears to be produced at a relatively low frequency or only in specific circumstances. For the remaining 69 loci, one gene has both monocistronic and dicistronic transcripts and the other is encoded only by dicistronic transcripts (example in Figure 1). The nine loci that include tricistronic and tetracistronic transcripts may encode a mix of polycistronic types and several also include an alternative monocistronic transcript (example in Figure S1). Although we cannot be certain that we have identified all alternative monocistronic transcripts, currently 181 gene models include only polycistronic transcript isoforms. Even if this is a significant overestimate, it is likely that more than 1% of protein-coding genes in D. melanogaster are encoded exclusively by polycistronic transcripts.

Figure 1

A dicistronic transcript isoform for and is produced from a stage- and tissue-specific promoter. A GBrowse view showing (top to bottom): the gene extents and the gene models; cDNAs and ESTs; transcription start site(s); unstranded RNA-Seq coverage data corresponding to a developmental series (early embryos, top, to adults, bottom); and stranded RNA-Seq coverage data (plus strand top, minus strand bottom) corresponding to testis (red), male accessory gland (magenta), ovary from virgin females (orange), and ovaries from mated females (tan). More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks#General. Genes that encode small polypeptides are over-represented among polycistronic genes. There are 10 genes for which all annotated polypeptides are less than 25 amino acids; all are polycistronic, although one is also annotated with an alternative monocistronic transcript. Of the 65 genes for which all annotated polypeptides are between 25 and 45 amino acids, 15 (23%) are polycistronic. A number of genes in this category correspond to small conserved ORFs found in the 5′ or 3′ untranslated regions (UTRs) of longer coding genes (Hayden and Bosco 2008); these are defined as separate genes in FlyBase. There are undoubtedly additional translated ORFs less than 50 amino acids that will be identified by new techniques, such as ribosomal profiling combined with proteomic validation (see Gawron for discussion of emerging approaches), so the list of polycistronic genes encoding small polypeptides is likely to grow. Polycistronic loci are not a homogenous group (File S1). The distance between the upstream stop codon and the downstream start codon ranges from negative (overlapping in different ORFs) to over a kilobase. For loci that produce only polycistronic transcripts (no alternative monocistronics), the range between upstream stop and downstream start is −34 nt to 319 nt. Alternative monocistronic transcripts for the 3′ coding region are usually transcribed from an alternative downstream (or external) promoter; monocistronic transcripts for the 5′ coding region usually terminate at an upstream polyadenylation site. However, a number of alternative monocistronic transcripts are the result of alternative splices that disrupt one of the two coding regions. Approximately 18% of polycistronic loci include similar genes; many of these may have been created by tandem duplication. Among the other 82%, a few are known to be functionally related (Liu ; Phillips ), including the highly conserved example of the large and small subunits of Molybdopterin synthase 2, and (Hayden and Bosco 2008). Ben-Shahar present preliminary evidence that the and genes are transcribed as a dicistronic pre-mRNA; they postulate that this pre-mRNA undergoes a unique cleavage event to produce two monocistronic mRNAs. This proposed mechanism differs from the polycistronic loci we define, for which polycistronic processed mRNAs are supported, and for which each cistron’s product is postulated to be translated from a single mRNA isoform. The gene models we describe as polycistronic necessitate internal initiation of translation for the downstream gene by use of an internal ribosome entry site (IRES) (reviewed in Hellen and Sarnow 2001), resumption of ribosomal scanning (Wall ), or some other mechanism. Use of an IRES may be common, because although some dicistronic pairs appear to meet the requirements of a ribosomal scanning mechanism, most do not [see the discussion in Misra and additional information tabulated in File S1].

FlyBase and GenBank flags:

Genes with dicistronic or polycistronic transcripts are identified by specific SO terms (Table 1). At the transcript level, comments are included in FlyBase transcript reports (Table S1), and a “dicistronic gene” exception is included in the GenBank RefSeq transcript entry (Table S2). The proposed bulk data flag is of the “transcript exception” type (Table 2).

Exceptional transcript structure (2): adjacent genes may share exons, noncoding and coding

There are many cases in which genes on the same strand overlap each other. The most extreme cases are the polycistronic gene models, as described above. Two categories we have not viewed as sufficiently unruly that they merit special treatment or comments: a gene that is nested in the intron of a gene on the same strand or a gene with a 3′ UTR that overlaps the 5′ UTR of the downstream gene. Intermediate cases in which two genes share an exon, usually including a shared splice site, are described below. See File S2 for a complete listing of the genes discussed in this section. The first example in the intermediate category is that of shared exons, but not shared coding DNA sequence (162 gene models flagged). Most common are cases in which two genes share a promoter and 5′ noncoding exons (130 genes); this includes 13 pairs for which one gene is coding and the other is noncoding. In some cases, the shared promoter is used by all transcripts of both genes, for example, and . In other cases, for example, and , one or both gene models include transcripts derived from unshared promoters. Genes may also share 3′ UTRs (14 genes); an extreme example is the set comprising , , , and (Cheng and Nash 2007). There are eight sets (18 genes) described as complex or atypical cases, usually in which both 5′ and 3′ UTRs overlap. The second example in the intermediate category consists of genes that share exons including a short extent of the coding sequence (in the same open reading frame), usually at the amino terminus. Historically, FlyBase categorized any transcripts that shared any amount of coding sequence as belonging to one gene. This policy has been changed for a small number of cases in which the genes in question are clearly functionally and evolutionarily distinct. Of the cases currently annotated (10 pairs and one triplet), most of the genes share a promoter, the first exon, and a translation start for at least one pair of transcript isoforms; some of these genes also have alternative transcript isoforms with different promoters that do not share coding sequences with the second gene. A well-studied case of such coding sequence (CDS) overlap is that of and (Krauss ), which may have resulted from the transposition of into an intron of . The two genes share a promoter, translation start, and 80 N-terminal residues that are similar to the N-terminus of eIF-2γ-like proteins in other species. The two genes, however, encode polypeptides with very different functions: histone methylation and translation initiation. This gene fusion encodes the only D. melanogaster orthologs for two different highly conserved eukaryotic genes. There is no transcript isoform for either gene that includes the characteristic domains of both and . The gene fusion appears to be insect-specific, with instances of re-fission in aphids (Krauss ).

FlyBase flags:

Several comments are used to flag FlyBase gene models with shared noncoding exons: “Shares 5′ UTR,” “Shares 3′ UTR,” “Shares 5′ exon(s),” and “Complex/atypical overlap,” followed by additional explanatory information. Genes that contain overlapping coding extents are flagged with gene model comments that begin “Genes with CDS overlap,” followed by additional specific information. The affected transcripts are not flagged; an exceptional translation flag is not necessary.

Exceptional transcript structure (3): overlapping genes or alternative transcripts may share multiphasic or bidirectional regions of coding sequence

Coding regions for which more than one overlapping ORF on the same strand appears to be used are described as “multiphasic.” There are 269 gene models flagged with a comment indicating that the current annotation includes transcripts that share a multiphasic region; a complete listing is available in File S3. These annotations are based on data supporting different transcript isoforms; in most cases, there are no data addressing whether both protein isoforms are biologically relevant. The majority of cases correspond to transcripts that include a variably spliced intron that results in a frameshift in the next exon, a short multiphasic extent, and an alternative stop codon. If the multiphasic extent is less than 40 nucleotides (N = 133), we flag the gene model with the comment “Alternative translation stop created by use of multiphasic reading frames within coding region.” Because the nucleotide extent includes the stop codon, this corresponds to less than 13 amino acids. In some instances, the frameshift produces a significantly truncated protein, for example, and ; in others, the resulting change in carboxy sequence appears to be minor, for example, . Transcripts with a trunctated CDS are flagged with an additional comment that concludes “results in premature stop codon and/or downstream start; may or may not produce functional polypeptide” (see Matthews ). The multiphasic extent may be longer than 40 nucleotides (N = 118) or involve the initial coding exon (N = 18); and examples include and . These are flagged with the comments “Multiphase exon postulated: exon reading frame differs in alternative transcripts” or “Multiphase exon postulated: reading frame of first coding exon differs in alternative transcripts.” For 83 genes, the annotated multiphasic overlap exceeds 63 nucleotides (>20 amino acids); the longest is the 704-nucleotide multiphasic overlap found for transcripts of the gene (Bainton ). For several genes, production and function of both alternative protein isoforms derived from a multiphasic exon have been demonstrated (including , Pankotai ; , Bainton ; and , a unique case described below). Multiphasic regions may involve two genes. There are 36 genes (18 pairs) flagged with the comment “Multiphase exon postulated: this gene shares a region of coding sequence with an overlapping gene, but different reading frames are utilized in the overlapping coding region.” Of these, 26 are encoded on polycistronic transcripts. Most overlap by less than 40 nucleotides; however, for five pairs the multiphasic overlap exceeds 63 nucleotides. The most dramatic example is the and pair, which overlaps for 553 nucleotides across two exons and for which there is evidence that both protein products are produced (using an in vitro translation assay) (Madigan ). Currently, there is a single pair of genes annotated with overlapping coding extents on opposite strands: and . We describe such a shared region as “bidirectional” and flag these genes with the comment “Bidirectional region of coding sequence postulated: a portion of the CDS of this gene overlaps a portion of the CDS of a gene on opposite strand.” Criteria for annotating bidirectional genes are discussed in Matthews . Genes with multiphasic exons or bidirectional regions are flagged with the comments described above. The affected transcripts are not flagged; an exceptional translation flag is not necessary.

Exceptional splicing (1): noncanonical splices are rare

For 99% of the ∼60,000 annotated introns, the primary canonical splice donor-acceptor pair GT-AG is used; for most of the remaining 1%, the secondary canonical splice donor-acceptor pair GC-AG is used (563 GC-AG out of 60,223 total in release 6.03). Thus, the number of introns for which other splice donor-acceptor pairs are used is very low: only 79 are annotated in current gene models (a complete listing is available in File S4). Considering the high levels of cDNA and RNA-Seq junction data that have been incorporated into the current gene model annotations, it is likely that most typically used noncanonical splices have been identified. Introns supported only by very low-frequency data are not usually included in FlyBase gene models (Matthews ); there are some additional noncanonical splices in this category, primarily representing potential alternatively spliced introns within 5′ UTRs. Introns processed by the U12 spliceosome have been well characterized (Schneider ; Alioto 2007; Lin ); remarkably, many are conserved between flies and humans (Lin ). Most of the AT-AC introns in the current annotation set are of the U12 type (Table 3) and had been previously identified.

Table 3

Introns with noncanonical splice sites and/or U12-type 5′ consensus sequence

Splice Donor-Acceptor Pair	Number in Release 6.04	Number with RNA-Seq Junction Support	Number with Similar Alternative Splice	Within Coding	Within 5′ UTR	Within lncRNA
AT-AC (U12)	9	9	1	9	0	0
AT-AC (U2)	4	4	2	4	0	0
GT-TG	23	19	22	13	9	1
GT-GG	6	5	5	3	2	1
GT-CG	8	8	8	6	2	0
GT-AT	14	11	14	11	3	0
GT-AA	3	3	3	2	1	0
GA-AG	12	12	5	8	3	1
GG-AG	0	—	—	—	—	—
GT-AC	0	—	—	—	—	—
Total	79	71	60	56	20	3
GT-AG (U12)	10	9		9	1	0
GC-AG (U12)	1	1		1	0	0

With the exception of AT-AC (and the mechanistically different HAC1-type intron splice sites, see below), the noncanonical splice sites currently annotated in D. melanogaster differ from GT-AG by only one base, with the G of the donor site being invariant. Of the eight possible pairs (excluding GT-AG and GC-AG) that fit these criteria, six are observed (Table 3). The distribution among genes is nonrandom: 10 genes are annotated with more than one intron defined by a noncanonical splice pair, despite the very low frequency of such introns (File S4). For the majority of cases there is a similar alternative canonical splice (Table 3); this has also been observed in humans (Szafranski ; Parada ). Although in some cases this may be a mechanism to increase protein variation (example in Figure 2A), similar alternative splices are also observed for splices that occur in 5′ UTRs. The key exceptions to this pattern of similar alternative splices are U12 spliceosome introns and GA-AG introns.

Figure 2

Noncanonical splices supported by RNA-Seq junction data. (A) Of three alternative splice acceptors for intron 6 of the gene, two are noncanonical TGs, including the splice acceptor used at the highest frequency (first highlighted junction). A GBrowse view showing (top to bottom): nucleotide sequence; region of the gene model showing one intron/exon boundary; EST data; RNA-Seq junction data; and unstranded RNA-Seq coverage data corresponding to a developmental series (early embryos, top, to adults, bottom). More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks#General. (B) Report for an RNA-Seq junction that corresponds to a noncanonical splice but is aligned to incorrect noncanonical sites, one of several cases that were slightly misaligned. Noncanonical splices frequently foil gene model prediction programs and complicate cDNA alignments and algorithms for deriving RNA-Seq junction data. Currently, 71 of the 79 annotated noncanonical splices have a supporting RNA-Seq junction (Table 3). For most (57 of 71), the alignment is impressively accurate; however, a number of the junctions are slightly misaligned and thus do not appear to map precisely to the annotated intron. A particular problem for unstranded RNA-Seq data are GT-AT splices, which are usually called as AT-AC splices on the opposite strand (eight of 10 cases). For exceptional junctions that correspond to annotated introns, the noncanonical acceptor/donor and any alignment inconsistencies are identified in explanatory comments in the FlyBase RNA-Seq junction reports (Figure 2B). A more significant problem for identification of noncanonical splices by RNA-Seq junction analysis is false positives; including AT-AC calls, there are more than 500 exceptional junctions from the combined Baylor (Daines ) and modENCODE datasets (Graveley ). Some of these can be culled by eliminating low-frequency junctions and those with weak confidence scores; however, exceptional junctions within the middle ranges of frequency or confidence scores need to be manually assessed. Common problems (shared with nonexceptional junctions) include spurious calls within coding sequences comprising repeated motifs (for example, and ), mismatch of acceptor and donor calls across related tandem genes (for example, the Jon99C cluster and the trypsin cluster at 47F), atypical calls within very highly expressed genes (for example, and ), and RNA-Seq data mapping to unmasked repeat elements. A number of exceptional junction calls are one nucleotide off from a well-supported canonical splice; these are assumed to be artifactual. Currently, there are approximately 40 transcripts in FlyBase that are annotated with noncanonical splices as a matter of convenience; most of these splices are unlikely to occur in vivo and are not included in the description of supported noncanonical splices above. They are used to allow representation of an intact gene model when the gene extent is interrupted by a transposable element insertion or a genomic sequence gap or is within a heterochromatic region that may vary from strain to strain. These cases are flagged with specific explanatory comments (Table S2). Genes with transcripts annotated with noncanonical splices are not flagged; there is no appropriate gene-level SO term. Comments at the transcript level are included in FlyBase transcript reports (Table S1) and as a “nonconsensus splice site” exception in the GenBank RefSeq transcript and protein entries (Table S2). The proposed bulk data flag is of the “transcript exception” type (Table 2).

Exceptional splicing (2): a single HAC1-type intron splice is highly conserved

Several rule-defying exceptional gene models are, in fact, examples of uncommon, yet very highly conserved, phenomena. One notable example is that of (Ryoo ), a gene with an alternatively spliced isoform that has a very short intron (23 nt) flanked by noncanonical splice junctions (CA-TG). This alternative isoform is translated into a protein with the carboxy terminus in a different reading frame with respect to the unspliced isoform. Although this looks like an especially egregious annotation error, it is actually an example of Ire1-mediated unconventional splicing, a process conserved from yeast to mammals as part of the unfolded protein response (UPR). In this response to ER degradation stress, the protein Ire1 acts as a sensor and mediates a signaling cascade; it also acts as an endonuclease with a single target: the HAC1 ortholog . The unconventionally spliced transcript is translated into a bzip transcription factor that upregulates genes responsive to ER stress. Like most organisms, D. melanogaster has exactly one example of the HAC1-type intron splice mechanism (Sidrauski and Walter 1997; Plongthongkum ; Hooks and Griffiths-Jones 2011). The gene report includes an explanatory comment in the gene model comment section. The report for the transcript that results from Ire1-mediated splicing includes a similar explanatory comment (Table S1) plus the flag “endonuclease_spliced_intron” (Table 2). A “nonconsensus splice site” exception appears in the GenBank RefSeq transcript and protein entries (Table S2). The proposed bulk data flag is of the “transcript exception” type (Table 2).

Exceptional splicing (3): trans-splicing is well supported for two genes

At least two genes in D. melanogaster undergo trans-splicing, a process by which a mature mRNA is created by a bimolecular splice between two independently transcribed pre-mRNAs. In both known cases, the gene encodes multiple DNA-binding proteins with a common amino BTB/POZ domain and variable carboxy zinc-finger domains. The initial and more dramatic example is (Labrador ; Dorn ), which encodes more than 30 protein isoforms, at least 18 of which appear to be trans-spliced. The second example is , a gene that encodes at least 20 protein isoforms, one of which shows evidence of being trans-spliced (Horiuchi ). In recent work, Gao have begun to elucidate the molecular mechanisms of trans-splicing in the and systems. These two examples of trans-splicing are in the category of intragenic trans-splicing, in contrast to the better-characterized category of spliced leader (SL) trans-splicing events originally observed in trypanosomes and nematodes (reviewed in Lasda and Blumenthal 2011) and in contrast to intergenic trans-splicing, which generates chimeric mRNAs. Both intragenic and intergenic trans-splicing have been observed in mammalian cells, but these events appear to be rare and in some cases are associated with neoplastic cells (reviewed in Horiuchi and Aigaki 2006). In the case of , five clusters of polycistronic transcripts encoding multiple 3′ alternative termini are supported by cDNA/EST and transcriptional start site data (Nechaev ; Batut ). These have been annotated as separate genes by FlyBase, with symbols such as for the precursor of mod(mdg4)-RT. Two of these clusters are transcribed from the strand opposite to the rest of the exons, and are what precipitated the discovery of trans-splicing in D. melanogaster. It appears likely that this locus has undergone multiple small inversions since the Drosophilidae diverged from other dipterans: the genomic pattern of trans-encoded exons is observed in other Drosophila species (Gabler ), but in mosquitoes it appears that all the 3′ alternative exons are located on the same strand (http://vectorbase.org, AGAP003439) (Krauss and Dorn 2004; Megy ). For there is clear support for at least one trans-spliced precursor, including EST and transcription start site data. It appears to be monocistronic and corresponds to the 3′-most alternative exon; it has been annotated as a separate gene in FlyBase, . Additional candidate genes subject to trans-splicing have been suggested but await definitive evidence (McManus ). For , another gene encoding multiple isoforms with a constant BTB/POZ domain and variable zinc-finger domains, there are data supporting a transcription start site 5′ of the last two variable exons (Batut ), suggesting that this gene may undergo trans-splicing. A number of complex gene model annotations include short 3′ isoforms with supported alternative downstream transcription starts; in some cases, there are also 5′ isoforms that do not overlap the 3′ isoforms (for example, , , ). It is interesting to speculate that some of these loci may also be subject to trans-splicing. Genes with trans-spliced transcripts, including the 3′ trans-splicing precursors, are identified by a SO term (Table 1). For the nine spliced transcripts derived from exons on opposite strands, a comment is included in FlyBase transcript reports (Table S1) and a “trans_splicing” comment appears in the GenBank RefSeq transcript and protein entries (Table S2). The proposed bulk data flag is of the “transcript exception” type (Table 2).

Exceptional transcript modification: A-to-I RNA editing is noted at the genome level only

Post-transcriptional modifications of pre-mRNAs that result in changes to the nucleotide sequence are described as RNA editing. A-to-I RNA editing results from modification of an adenosine to inosine, which subsequently acts as would a guanine in terms of translation, splicing, and in the formation of secondary structures (reviewed in Mallela and Nishikura 2012). In D. melanogaster, 2005 putative A-to-I RNA editing sites have been identified and mapped on the genome, overlapping the transcripts of 1307 genes (Graveley ; Rodriguez ). A more recent study (St. Laurent ) has not yet been incorporated into FlyBase. Most genes subject to A-to-I editing have at least three associated editing sites, and two genes have more than 20 overlapping editing sites ( is associated with 23, with 21). Because editing is developmentally regulated, and because the efficiency of each editing site varies, different combinations of editing are also possible for a given transcript, further adding to the vast number of possible permutations. FlyBase does not attempt to represent alternative transcript or polypeptide sequences that may result from RNA editing. Identified A-to-I editing sites are annotated on the genome and are viewable in relation to gene model annotations on GBrowse. A Sequence Feature Report for each editing site details the observed editing frequency at various developmental stages, as well as the potential impact of a given editing site on the coding sequence of the related transcript (see, for example, A-I_edit_000244, FBsf0000383608). Each of these identified genomic sites is associated with the SO term “modified_RNA_base_feature” (SO:0000250). The SO term “gene_with_edited_transcript” (SO:0000548) is used to flag genes subject to A-to-I RNA editing.

Exceptional translation (1): noncanonical translation starts have been difficult to identify

Thus far, no systematic or definitive study of non-AUG translation initiation in Drosophila has been performed. Individual cases have been discovered more or less by chance; some of the more thoroughly characterized include (Sugihara ), (de Simone and White 1993), (Boyd and Thummel 1993), (Andjelkovic ), and (Figure 3A) (Beerman and Jongens 2011). The majority of the currently annotated cases, including all based on FlyBase analysis (Table S3), are postulated due to the lack of an appropriately placed AUG codon; most are supported by assessment of conservation among Drosophila species or a favorable sequence context for translation initiation, but no additional experimental data.

Figure 3

Noncanonical terminal extensions of the CDS. (A) CUG start codon in results in a 48-aa N-terminal extension; a GBrowse view showing amino acid sequence and amino ends of annotated polypeptides. Use of this alternative start codon has been confirmed by Western blot, mutagenesis of reported constructs, and rescue constructs (Beerman and Jongens 2011). (B) For the gene model, a stop-codon readthrough annotated for dan-RB is supported by PhyloCSF analysis (conservation of protein signatures). A GBrowse view showing (top to bottom): the gene model; stop codons on the plus strand in each of the three open reading frames; and regions of protein conservation among the Drosophila species (tan extents at the bottom). More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks#General. Recently, several systematic studies have allowed a more comprehensive characterization of noncanonical translation initiation in human and mouse. Searching for evolutionary signatures of protein-coding sequences within predicted 5′ UTRs, Ivanov found 42 new and confirmed 17 previously reported non-AUG starts in humans. All are near-cognates (differing by one base) of AUG; CUG is by far the most common (42%). It may be a requirement that the second base is a pyrimidine, because AAG and AGG starts were not found. Ribosome profiling in mouse embryonic stem cells (Ingolia ) indicates that for many transcripts multiple translation starts may be used, primarily additional AUG sites, but also near-cognate codons. Again, CUG was found to be the most common noncanonical start codon. It appears that during translation a non-AUG start codon is still paired with the usual initiating tRNA, Met-tRNAi, and that the protein initiates with a methionine (Peabody 1989; Menschaert ). Thus, in FlyBase, the predicted protein sequence from a non-AUG translation start initiates with a methionine, not the amino acid that is usually associated with that codon. In FlyBase, there are currently 27 genes with transcripts annotated with a non-AUG translation start codon; 11 of these use a CUG start codon (Table S3). However, it seems likely that there are, in fact, many-fold more, especially in the category of alternative translation starts. There are a significant number of gene models for which an alternative splice or an alternative 5′ exon removes all in-frame AUG translation starts within the amino end of the putative protein. Most of these are currently annotated as short protein isoforms (using a downstream AUG), but may in fact be instances in which a non-AUG start is used. It is interesting that the amorphic y mutation is caused by a replacement of the AUG start codon with a CUG codon (Matthews ). Despite the fact that CUG is the mostly widely used noncanonical translation start, it is unable to function as a start codon in this context. Genes with transcripts annotated with noncanonical translation starts are flagged with SO terms (Table 1) in the gene model comment section. Comments at the transcript level are included in FlyBase transcript reports (Table S1); a “non-AUG translation initiation” comment and a translation exception flag appear the GenBank RefSeq transcript and protein entries (Table S2). The proposed bulk data flag is of the “translation exception” type (Table 2).

Exceptional translation (2): few selenoproteins are found in flies

Selenocysteine stop-codon readthrough is another example of rule-bending on a translational level: a UGA codon is read not as a stop signal, but rather as a 21st amino acid, selenocysteine (Sec; “U” in the single-letter code) (Castellano ). Selenoprotein synthesis requires a specialized tRNA, Sec-tRNAsec, as well as the proteins involved in the synthesis of Sec-tRNAsec. The selenocysteine UGA codon is distinguished from a UGA stop codon by the presence of downstream stem-loop structure, the Sec insertion sequence (SECIS) in the 3′ UTR of the transcript, which is recognized by SECIS-binding protein (Martin-Romero ). Selenocysteine is not only a means to allow a stop-codon readthrough; it also has a catalytic advantage over cysteine in the active site of oxidoreductases (Driscoll and Chavatte 2004). D. melanogaster has three identified selenoprotein genes: , , and , which were identified by both in silico analysis of the D. melanogaster genome (Castellano ) and by metabolic labeling with 75Se (Martin-Romero ). and have the Sec insertion quite early in the final protein; , however, extends only two amino acids past the substituted stop codon. Selenoproteins are less well conserved than some of the more rare phenomena: flies have three, humans have 25, worms have only one, and yeast and higher plants lack them altogether. Many of the mammalian selenoproteins themselves are conserved, but the invertebrate nonselenoprotein orthologs have a cysteine in place of Sec (Driscoll and Chavatte 2004). In fact, at least two of the D. melanogaster selenoprotein genes have nonselenoprotein paralogs genomically nearby (Castellano ). In contrast, prokaryotes and archaebacteria have a completely different complement of selenoproteins (Driscoll and Chavatte 2004). Selenoprotein genes are identified by a specific SO term (Table 1). At the transcript level, comments are included in FlyBase transcript reports (Table S1) and a translation exception flag appears in the GenBank RefSeq transcript and protein entries (Table S2). The proposed bulk data flag is of the “translation exception” type (Table 2).

Exceptional translation (3): stop-codon readthrough appears to be common in flies

Stop-codon readthrough is a well-documented regulatory mechanism in viruses (reviewed in Bertram ) and has been investigated in S. cerevisiae (reviewed in von der Haar and Tuite 2007). Prior to 2007, a small number of specific cases had been identified in flies, including (Xue and Cooley 1993; Robinson and Cooley 1997), (Bergstrom ), (Klagges ) and (Steneberg ). In 2007, using a genome-wide comparative analysis designed to detect regions exhibiting evolutionary signatures specific to protein-coding regions (PhyloCSF) (Lin , 2011), a surprising number of such regions was found immediately beyond annotated stop codons. Because there are other possible explanations for this observation (alternative splicing, A-to-I RNA editing, or polycistronic transcripts, for example), a more thorough analysis was performed by Jungreis . Based primarily on these comparative evolutionary analyses, 328 genes are currently annotated in FlyBase with one or more transcripts subject to stop-codon readthrough (example in Figure 3B); a complete listing is available in File S5. In 22 cases a double readthrough is supported; there even appear to be two cases of triple readthrough (, ). Two genes exhibit readthrough at two independent sites within alternative exons ( and ). The gene model is a particularly enthusiastic example of stop-codon readthrough: it has two independent sites, one of which is a double readthrough. A number of predicted cases of stop-codon readthrough have been confirmed by mass spectrometry of wild-type proteins (seven cases) or using reporter constructs in which the readthrough extensions are epitope-tagged (Jungreis ). Additionally, an independent ribosome-profiling assay using early embryos and S2 cells has confirmed 43 of the previously predicted readthrough cases (Dunn ); it is unclear if unconfirmed cases occur at low levels or only at other developmental stages. Interestingly, this ribosome profiling study identified an additional 307 cases of translation readthrough not identified by Jungreis . These novel cases of translation readthrough exhibit lower conservation scores and lower readthrough rates than those predicted by conservation; the nucleotide character of these novel readthrough regions is intermediate between coding regions and 3′ UTRs. As such, it is unclear if these extensions represent truly functional isoforms of the proteins, or simply provide fodder for the evolution of novel C-terminal variants. For this reason, poorly conserved cases of readthrough are not currently annotated by FlyBase. With very few exceptions, the annotated readthroughs result in a C-terminal extension of an annotated polypeptide sequence (as opposed to the addition of upstream coding sequences to an annotated polypeptide); note that the analysis of Jungreis was biased to detect this type of event. This stop-codon readthough phenomenon appears to be distinct from the selenocysteine system and does not require a downstream SECIS motif (Jungreis ). The annotated stop-codon readthroughs include cases for all three stop codons, with UGA being the most common. The length of the amino acid extension ranges from four amino acids to more than 1000 amino acids; most are within the range of 8–300 amino acids. Comparisons across 12 Drosophila species for individual genes subject to stop codon readthrough revealed that 97% of readthrough codons were perfectly conserved; substitution of an alternative stop codon was rare and involved only UAA and UAG (Jungreis ). This suggests that the three codons are not functionally identical in the context of readthrough events, but that UAA and UAG may be similar. Many of the longer carboxy readthrough extensions have in common a distinct pattern of conservation: regions of low complexity, variable conservation, and variable length interspersed with regions of protein sequence conservation. They show characteristics of intrinsically disordered protein regions, which confer structural flexibility and may be characteristic of many members of protein complexes (reviewed in Mészáros ). For these cases, the carboxy extension due to stop-codon translational readthrough may result in an increased repertoire of protein-protein interactions. An interesting example is , which encodes a leucine-rich transmembrane protein and is similar to . The 240-aa extension of the Caps protein expands the similarity of the two proteins into a region with this pattern of conservation that is present in the unextended region of the Trn protein (Figure S2). Jungreis also used the PhyloCSF algorithm to identify stop-codon readthrough candidates in mammals, nematodes, fungi, and other insects. Although examples of predicted nonselenocysteine readthroughs were found in other phylogenetic groups, only in insects were they found to be common. Use of the three stop codons was not uniform: in many species, only UGA readthroughs were observed and UAA readthroughs were found only in Dipterans. Using a Z-curve reading-frame bias analysis that provided an estimation of the total frequency of readthroughs, Jungreis assessed 25 species in a broad phylogenetic range. They concluded that within the animal kingdom, abundant readthough may be confined to insects and crustacea. Genes with transcripts subject to stop-codon readthrough are identified by a specific SO term (Table 1). At the transcript level, comments are included in FlyBase transcript reports and double readthroughs are identified (Table S1). A translation exception flag appears in the GenBank RefSeq transcript and protein entries (Table S2). In the sequences of the predicted proteins, an “X” appears at the position(s) of the suppressed stop codon(s). The proposed bulk data flag is of the “translation exception” type (Table 2).

Exceptional translation (4): a single translational frameshift is highly conserved

D. melanogaster has a single example of translational frameshifting, involving the gene (Ornithine decarboxylase antizyme) (Ivanov ). This event is part of a mechanism of polyamine autoregulation that is conserved from yeast through mammals (Ivanov ; Olsen and Zetter 2011). The ornithine decarboxylase antizyme regulates the activity of the enzyme ornithine decarboxylase (ODC), a key enzyme in the synthesis of polyamines. The transcript does not code for a functional antizyme without an additional regulatory step: the reading frame changes in the second exon. High polyamine concentrations drive a +1 ribosomal frameshift, thus the production of functional Oda protein is sensitive to polyamine levels. The gene report includes an appropriate SO term (Table 1), and reports for the transcripts include an explanatory comment (Table S1). A “ribosomal_slippage” flag appears in the GenBank RefSeq transcript and protein entries (Table S2). The proposed bulk data flag is of the “translation exception” type (Table 2).

Exceptional translation (5): mitochondria play by their own rules

There are 13 protein-coding genes mapped to the D. melanogaster mitochondrial genome (mitochondrial genes are flagged by the SO term “mt_gene,” SO:0000088). Because mitochondria use an alternative genetic code, all mitochondrial protein-coding transcripts are flagged as translation exceptions (see http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG5 for nonstandard codon usage specific to invertebrate mitochondria). In addition, some mitochondrial genes have an incomplete stop codon. This is a stop codon that is not entirely encoded in the genome: it is completed upon post-transcriptional 3′ polyadenylation. The transcripts of three mitochondrial genes (, , ) have this type of structure and have an additional translation exception comment referring to the stop codon exception. For one mitochondrial gene () the start codon has not been identified. Mitochondrial translation is known to use noncanonical start codons, but the sequence preceding does not contain any of the identified alternative codons in the correct ORF. The transcript for this gene is flagged with the comment “start codon not determined."

Conclusion

The annotation of the D. melanogaster genome has been greatly facilitated by access to both extensive sets of high throughput data and a large body of gene-specific research. Combined with expert manual assessment of each gene model, this has allowed FlyBase to compile a uniquely detailed and nuanced gene model annotation set, which continues to improve. Unfortunately, some of the most interesting aspects of our knowledge of the D. melanogaster genome are difficult to leverage to inform the annotations of other species. The subjects of this article, the rule-benders or exceptional cases, largely identify biological phenomena that create problems for automated gene prediction algorithms. Although some cases among the rule-benders would present more difficulties than others, it may be feasible to incorporate automated second-pass steps to identify many exceptional cases. A number of highly conserved phenomena are well defined; they affect relatively few genes, but these may be straightforward to identify. A parallel approach would be a gene model pipeline that is largely automated, but that allows incorporation of manual corrections and additions. Ideally, this would be an ongoing process as more is learned about the variability and flexibility of the genome. If an efficient system for expert review of submissions were developed, then input from a variety of sources, such as researchers interested in a group of genes across many species, undergraduate annotation projects, and even specialist crowd-sourcing, could be encouraged.

83 in total

1. moody encodes two GPCRs that regulate cocaine behaviors and blood-brain barrier permeability in Drosophila.

Authors: Roland J Bainton; Linus T-Y Tsai; Tina Schwabe; Michael DeSalvo; Ulrike Gaul; Ulrike Heberlein
Journal: Cell Date: 2005-10-07 Impact factor: 41.582

Review 2. Alternative trans-splicing: a novel mode of pre-mRNA processing.

Authors: Takayuki Horiuchi; Toshiro Aigaki
Journal: Biol Cell Date: 2006-02 Impact factor: 4.458

Review 3. The proteome under translational control.

Authors: Daria Gawron; Kris Gevaert; Petra Van Damme
Journal: Proteomics Date: 2014-11-02 Impact factor: 3.984

4. The Drosophila gene for antizyme requires ribosomal frameshifting for expression and contains an intronic gene for snRNP Sm D3 on the opposite strand.

Authors: I P Ivanov; K Simin; A Letsou; J F Atkins; R F Gesteland
Journal: Mol Cell Biol Date: 1998-03 Impact factor: 4.272

5. Effective translation of the second cistron in two Drosophila dicistronic transcripts is determined by the absence of in-frame AUG codons in the first cistron.

Authors: Adam A Wall; A Marie Phillips; Leonard E Kelly
Journal: J Biol Chem Date: 2005-06-10 Impact factor: 5.157

6. Ribosome profiling reveals pervasive and regulated stop codon readthrough in Drosophila melanogaster.

Authors: Joshua G Dunn; Catherine K Foo; Nicolette G Belletier; Elizabeth R Gavis; Jonathan S Weissman
Journal: Elife Date: 2013-12-03 Impact factor: 8.140

7. Diversity and dynamics of the Drosophila transcriptome.

Authors: James B Brown; Nathan Boley; Robert Eisman; Gemma E May; Marcus H Stoiber; Michael O Duff; Ben W Booth; Jiayu Wen; Soo Park; Ana Maria Suzuki; Kenneth H Wan; Charles Yu; Dayu Zhang; Joseph W Carlson; Lucy Cherbas; Brian D Eads; David Miller; Keithanne Mockaitis; Johnny Roberts; Carrie A Davis; Erwin Frise; Ann S Hammonds; Sara Olson; Sol Shenker; David Sturgill; Anastasia A Samsonova; Richard Weiszmann; Garret Robinson; Juan Hernandez; Justen Andrews; Peter J Bickel; Piero Carninci; Peter Cherbas; Thomas R Gingeras; Roger A Hoskins; Thomas C Kaufman; Eric C Lai; Brian Oliver; Norbert Perrimon; Brenton R Graveley; Susan E Celniker
Journal: Nature Date: 2014-08-28 Impact factor: 49.962

8. A comprehensive survey of non-canonical splice sites in the human transcriptome.

Authors: Guillermo E Parada; Roberto Munita; Cledi A Cerda; Katia Gysling
Journal: Nucleic Acids Res Date: 2014-08-14 Impact factor: 16.971

9. FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations.

Authors: Gilberto dos Santos; Andrew J Schroeder; Joshua L Goodman; Victor B Strelets; Madeline A Crosby; Jim Thurmond; David B Emmert; William M Gelbart
Journal: Nucleic Acids Res Date: 2014-11-14 Impact factor: 16.971

10. The evolution of the histone methyltransferase gene Su(var)3-9 in metazoans includes a fusion with and a re-fission from a functionally unrelated gene.

Authors: Veiko Krauss; Anne Fassl; Petra Fiebig; Ina Patties; Heinz Sass
Journal: BMC Evol Biol Date: 2006-03-02 Impact factor: 3.260

10 in total

1. UGGT1 retains proinsulin in the endoplasmic reticulum in an arginine dependent manner.

Authors: Jaeyong Cho; Masaki Hiramoto; Yuka Masaike; Satoshi Sakamoto; Yoichi Imai; Yumi Imai; Hiroshi Handa; Takeshi Imai
Journal: Biochem Biophys Res Commun Date: 2020-05-16 Impact factor: 3.575

Review 2. Using FlyBase, a Database of Drosophila Genes and Genomes.

Authors: Steven J Marygold; Madeline A Crosby; Joshua L Goodman
Journal: Methods Mol Biol Date: 2016

3. Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data.

Authors: Beverley B Matthews; Gilberto Dos Santos; Madeline A Crosby; David B Emmert; Susan E St Pierre; L Sian Gramates; Pinglei Zhou; Andrew J Schroeder; Kathleen Falls; Victor Strelets; Susan M Russo; William M Gelbart
Journal: G3 (Bethesda) Date: 2015-06-24 Impact factor: 3.154

4. Evolutionary Dynamics of Abundant Stop Codon Readthrough.

Authors: Irwin Jungreis; Clara S Chan; Robert M Waterhouse; Gabriel Fields; Michael F Lin; Manolis Kellis
Journal: Mol Biol Evol Date: 2016-09-07 Impact factor: 16.240

5. FlyBase at 25: looking to the future.

Authors: L Sian Gramates; Steven J Marygold; Gilberto Dos Santos; Jose-Maria Urbano; Giulia Antonazzo; Beverley B Matthews; Alix J Rey; Christopher J Tabone; Madeline A Crosby; David B Emmert; Kathleen Falls; Joshua L Goodman; Yanhui Hu; Laura Ponting; Andrew J Schroeder; Victor B Strelets; Jim Thurmond; Pinglei Zhou
Journal: Nucleic Acids Res Date: 2016-10-30 Impact factor: 16.971

Review 6. Non-AUG translation: a new start for protein synthesis in eukaryotes.

Authors: Michael G Kearse; Jeremy E Wilusz
Journal: Genes Dev Date: 2017-09-01 Impact factor: 11.361

7. Widespread polycistronic gene expression in green algae.

Authors: Sean D Gallaher; Rory J Craig; Iniyan Ganesan; Samuel O Purvine; Sean R McCorkle; Jane Grimwood; Daniela Strenkert; Lital Davidi; Melissa S Roth; Tim L Jeffers; Mary S Lipton; Krishna K Niyogi; Jeremy Schmutz; Steven M Theg; Crysten E Blaby-Haas; Sabeeha S Merchant
Journal: Proc Natl Acad Sci U S A Date: 2021-02-16 Impact factor: 11.205

Review 8. Deconstructing internal ribosome entry site elements: an update of structural motifs and functional divergences.

Authors: Gloria Lozano; Rosario Francisco-Velilla; Encarnacion Martinez-Salas
Journal: Open Biol Date: 2018-11-28 Impact factor: 6.411

9. Double triage to identify poorly annotated genes in maize: The missing link in community curation.

Authors: Marcela K Tello-Ruiz; Cristina F Marco; Fei-Man Hsu; Rajdeep S Khangura; Pengfei Qiao; Sirjan Sapkota; Michelle C Stitzer; Rachael Wasikowski; Hao Wu; Junpeng Zhan; Kapeel Chougule; Lindsay C Barone; Cornel Ghiban; Demitri Muna; Andrew C Olson; Liya Wang; Doreen Ware; David A Micklos
Journal: PLoS One Date: 2019-10-28 Impact factor: 3.240

10. Lipid Dynamics, Identification, and Expression Patterns of Fatty Acid Synthase Genes in an Endoparasitoid, Meteorus pulchricornis (Hymenoptera: Braconidae).

Authors: Jiao Wang; Li-Wei Shen; Xiao-Rong Xing; Yu-Qi Xie; Yi-Jiangcheng Li; Zhi-Xiang Liu; Jun Wang; Fu-An Wu; Sheng Sheng
Journal: Int J Mol Sci Date: 2020-08-28 Impact factor: 5.923

10 in total