Literature DB >> 24174546

The insect-phase gRNA transcriptome in Trypanosoma brucei.

Donna Koslowsky¹, Yanni Sun, Jordan Hindenach, Terence Theisen, Jasmin Lucas.

Abstract

One of the most striking examples of small RNA regulation of gene expression is the process of RNA editing in the mitochondria of trypanosomes. In these parasites, RNA editing involves extensive uridylate insertions and deletions within most of the mitochondrial messenger RNAs (mRNAs). Over 1200 small guide RNAs (gRNAs) are predicted to be responsible for directing the sequence changes that create start and stop codons, correct frameshifts and for many of the mRNAs generate most of the open reading frame. In addition, alternative editing creates the opportunity for unprecedented protein diversity. In Trypanosoma brucei, the vast majority of gRNAs are transcribed from minicircles, which are approximately one kilobase in size, and encode between three and four gRNAs. The large number (5000-10,000) and their concatenated structure make them difficult to sequence. To identify the complete set of gRNAs necessary for mRNA editing in T. brucei, we used Illumina deep sequencing of purified gRNAs from the procyclic stage. We report a near complete set of gRNAs needed to direct the editing of the mRNAs.

Entities: CellLine Chemical Disease Mutation Species

Mesh：

Substances：

Year: 2013 PMID： 24174546 PMCID： PMC3919587 DOI： 10.1093/nar/gkt973

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In Trypanosoma brucei, expression of the mitochondrial genome involves one of the most striking examples of small RNA directed regulation, RNA editing (1,2). In these parasites, hundreds of small RNAs direct the insertion and deletion of uridylate (U) residues needed to generate translatable mRNAs. The RNA editing process is developmentally regulated and alternative editing has been detected, creating the opportunity for unprecedented protein diversity (3,4). The small guide RNAs (gRNAs) are key components of RNA editing and are all encoded in the mitochondrial genome. This genome consists of several thousand, interlocked, circular DNA molecules organized into a disk-like structure called the kinetoplast or kDNA (5). Each cell has one mitochondrion and one kDNA network made up of maxicircles and minicircles. The kDNA maxicircle (∼22 kb) encodes 18 proteins, 2 ribosomal subunits and 2 gRNAs, and is present in ∼50 copies within the network. All other gRNAs are encoded on the minicircles that make up the bulk of the mitochondrial genome. Each network can contain from 5000–10 000 minicircles composed of ∼250 different minicircle sequence classes (6). With each minicircle encoding ∼3–5 gRNAs, this component of the genome has the capacity to encode over 1200 different gRNAs (7–9). Despite the importance of the gRNAs to the editing process, a full complement has only been described in one laboratory strain of Leishmania tarentolae (10). Editing in this strain is limited, and five of the normally pan-edited genes are not productively edited. In T. brucei, a number of studies using conventional sequencing methods have been done in the attempt to identify gRNAs (11–13). However, despite these attempts, large numbers of the gRNAs needed for the extensive editing of the protein coding genes were still unidentified. We report here the characterization of the gRNA transcriptome of the procyclic stage of T. brucei using deep sequencing of purified mitochondrial gRNAs. Within this transcriptome, we have identified the full complement of gRNAs needed to direct the editing of ATPase 6 (A6), cytochrome oxidase III (COIII), C-rich region 4 (CR4), cytochrome b (CYb) and ribosomal protein subunit 12 (RSP-12). In contrast, a full complement of gRNAs were not identified for C-rich region 3 (CR3), maxicircle unidentified reading frame II (Murf II), NADH dehydrogenase (ND) subunits 3, 7, 8 and 9. Most striking was the large variation in transcript copy number observed for the identified gRNAs.

MATERIALS AND METHODS

Parasites, isolation of mitochondria and RNA extraction

T. brucei clone IsTar from stock EATRO 164 was grown in SDM79 and harvested at a cell density of 1–3 × 107 cells/ml. Harvested trypanosomes were washed in sodium buffered glucose (SBG), resuspended in DTE buffer and disrupted using a sterile Dounce homogenizer as previously described (14). Cell lysate was then treated with RNAse-free DNAse I (5 µ/ml) and incubated on ice for 45 min. The reaction was stopped by addition of an equal volume of STE (250 mM Sucrose, 20 mM Tris (pH 7.9), 2 mM EDTA) and cells/organelles collected by centrifugation (16 000 g, 10 min). Mitochondrial vesicles were then collected using a series of differential spins. Briefly, initial pellets were resuspended in STE and cleared of large particles and cell debris using two low-speed spins (1500 g, 10 min). Mitochondrial vesicles were then collected by centrifugation at 16 000g (10 min). The enriched mitochondrial pellet was then lysed using an acidic phenol–CHCl3 extraction in the presence of 4 M guanidinium isothiocyanate and 2% (w/v) sodium-N-lauroyl sarcosinate (15). The mitochondrial RNA (mtRNA) was precipitated, washed and resuspended in RNAse-free water. Alternatively, collected parasites were immediately lysed using the acidic phenol–CHCl3 protocol and total RNA collected.

Library preparation and Illumina sequencing

Approximately 100 µg of mitochondrial or total RNA was treated with DNAse RQ1 (Promega) and then size-fractionated by denaturing 10% (w/v) polyacrylamide electrophoresis (8 M urea). A gRNA marker lane was generated by 5′ capping 10 µg of mtRNA using 32P αGTP and Vaccinia capping enzyme (BioLabs) according to manufacturer’s directions. RNAs in the gRNA size range (∼40–80 nt) were excised from the gel, passively eluted and ethanol precipitated. To preserve strand information, we used a modified Illumina ‘Small RNA’ sample preparation protocol. The gRNAs have a 5′ tri-phosphate and a 3′ hydroxyl group. This allows the direct ligation of the RNA 3′ adaptor. However, addition of the RNA 5′ adapter required phosphatase treatment followed by polynucleotide kinase (PNK) to add a single phosphate. Ligation of the 5′ adapter was followed by RT-PCR amplification and gel purification of the gRNA library. The gRNA library, as determined by an Agilent Bioanalyzer, had a narrow distribution, centered at ∼135 bp, consistent with the estimated size of the gRNAs (plus adapters). Each library (gRNAs isolated from mtRNA and gRNAs isolated from total RNA) was sequenced on a Illumina GAIIx (single read 75 base run). Approximately 30 million raw reads were obtained from each of the gRNA libraries. After removal of the Illumina adapter sequences, quality-based trimming was done with prinseq (stand alone lite version, http://prinseq.sourceforge.net/). Reads with two or more N's or an overall mean Q-score < 25 were discarded. The 3′ end was further trimmed of low quality bases (mean Q-score < 20 over a 5 base window); any reads < 20 nt after trimming were discarded. Only a small fraction of reads were discarded at this step. Input raw reads were further processed using the following criteria: (i) remove redundant reads. The number of redundant reads is kept for each unique sequence; (ii) remove reads without at least four consecutive Ts.

RESULTS

Identification of gRNAs

To identify gRNAs that direct the editing process of the known mRNAs, we aligned each transcript read to the conventionally edited mRNAs based on known base-pairing mechanisms. A legal alignment between gRNA and the edited mRNA mainly contains canonical Watson–Crick DNA base pairs and the G-U base pair. We note that a small number of other types of base pairs may also exist in the alignment; however, these were not allowed in our initial screen. In addition, we allowed no gaps in the alignment, allowing us to formulate the gRNA-mRNA alignment problem as an extended longest common substring (LCS) problem. The LCS problem outputs the LCS between two input sequences. To use LCS to identify the gRNA-mRNA alignments, we defined match and mismatch base pairs as follows: (i) match: canonical Watson–Crick and G-U base pairs; and (ii) mismatch: any other type of base pair. Based on this definition, we formulated the extended LCS problem as follows: given two sequences x and y, find the LCS between x and y with at most T mismatches. The problem was solved using dynamic programming. The sub-problem is denoted using function LCS(i, j, τ), representing the length of the LCS ending at position i and j in two input sequences x and y with exact τ mismatches (τ ≤ T). Thus, the length of the LCS between x and y with τ mismatches is max1≤i≤|x|,1≤j≤|y| [LCS(i, j, τ)]. When T = 0, the extended LCS problem is reduced to the original LCS problem. The following recursive functions are used to solve LCS(i, j, τ). xi is the ith character of x. yj is the jth character of y. Recursive functions: if xi and yj form a match LCS(i, j, τ) = LCS(i-1, j-1, τ) + 1 else LCS(i, j, τ) = LCS(i-1,j-1, τ-1) + 1 for 1 ≤ τ ≤ T We need to compute LCS(i, j, τ) for 1 ≤ i ≤ |x|, 1 ≤ j ≤ |y| and 0 ≤ τ ≤ T. The initialization is LCS(0, 0, τ) = 0. Once we record the maximum of LCS(i, j, τ) for all indexes in sequence x and y, we can easily recover the LCS itself. During our analyses, we did allow for at most three mismatches (i.e. T = 3). Thus, we have the alignments between gRNAs and edited mRNAs with 0 mismatches to 3 mismatches. However, the transcript reads that can be aligned with edited mRNAs with less number of mismatches have higher probability to be real gRNAs. Thus, when the edited sites in an mRNA can be aligned with gRNAs with τ mismatches, we did not use gRNA alignments containing τ+1 mismatches. Matched gRNAs were then scored as follows, two points for canonical Watson–Crick base pairs and one point for G-U base pairs. gRNAs with scores >45 were identified as guiding a specific region based on the identified mRNA fully edited sequence (numbered from the 5′ end). Using this criterion with the 0 mismatch data set, we found that all of the identified gRNAs had characteristics indicating that they were matched to the correct position. The matched gRNAs were sorted based on their guiding positions, and the populations analyzed and sorted into sub-populations (sequence variants that guide the same or nearly the same region). Using these data, we were able to generate two additional data files: (i) a best align file containing the highest scoring gRNA for any specific region and (ii) a coverage profile, containing the number of gRNAs that cover any specific nucleotide within the fully edited mRNA. Alignments of the identified gRNAs with the fully edited mRNA sequences indicated that we had identified a near full set of gRNA required for the editing of the mRNAs. Full coverage was obtained for A6, COIII, CR4, CYb and RSP-12. Full complements of gRNAs were not identified for CR3, Murf II or the ND subunits (ND3, ND7, ND8 and ND9). The identified gRNAs and the full gRNA-mRNA alignments for one fully edited mRNA, ATPase 6, are shown in Table 1 and in Figure 1. This mRNA is extensively edited and the data presented illustrate several key points. The identified gRNAs and the alignments for all other transedited mRNAs can be found in the supplemental data. gRNAs are designated by their guiding position on the fully edited mRNA. Both nucleotides and deletion sites in the fully edited mRNA were defined by a number, starting from the 5′ end (+1 = 0).

Table 1.

The major gRNA classes involved in the editing of ATPase 6

mRNA 5′	mRNA 3′	Copy no.	ATPase 6: Major gRNA classes
24	72	6^a	ATATAC AACGCAACCAGAGTAAATCATGAAGGGAAAGTGAAGGCATATTTGTTTT T₁₅
29	72	1630	ATATAC AACGCAACCAGAGTAAATCATGAAGGGAAAGTGAAGGCATATTT T₁₁
✰31	75	2044	AT ATAAACGTAACTGAAATGAATCACGAGAGAAAGATAAAGATATAT AT₁₂
✰31	75	143	AT ATAAACGTAACTGAAATGAATCGCGAGAGAAAGATAAAGATATAT ATTTTTGT₁₅

✰62	102	1435	ATACA ATCATACACAGTAGTACATATATAGTGATAGACGTGATTAA T₁₁

✰84	127	4^a	ATAT AAATACACAGTAGAATATGATCTAGGTTATGTATGATGATATAT T₁₄
✰86	127	2158	ATAT AAATACACAGTAGAATATGATCTAGGTTATGTATGATGATAT T₁₀

✰105	152	54^a	AC ATCAAAAATCGACATTAGATAATTGAGGTATGTGATAGAGTATAATTT T₅GT₅
✰113	152	743	ATAC ATCAAAAATCAACGTTAGACAGTTAAGATATGTGATAGAA GATAAT₁₂

✰135	183	1^a	ATATAAATCAAACAAACAGAATAGTAGAAAGTCAGAGATTGATGTTAA T₁₁
✰138	183	430	AT ATACAAATCAAACAGACAGAGTAATAGAAGGTTGAAGATTGATAT AGT₁₁
144	177	210	ATATC ATCAAACAAACAGAATAATAGAGAATCAGAGGT GAATGTTAAGT₁₅

✰158	208	14^a	ATAT ACAAACACAAACTGACGAATAGATACAGATTAAGTGAATGAAATAAT T₁₁
164	208	54	ATAT ATAAACACAAATCAACGAATAGATATAAGTCAGATAGATGG TGTATTAT₁₂A₁₁
✰176	210	36	AT AAACAAACACAAATCAGTAGACGAGTACAAGT GAGATGGACGTATAGAT₇
✰165	208	24	ATAAT ACAAACACAAACTGATAGACGAATACGAGTTAGATGGACG TAT₆

✰189	243	3^a	ATATAAATTAAACAGCATAAACTGTAGCAGTGAAGATAGATGTGAATTAATA T₁₄
✰192	243	172	ATATAAATTAAACAACATAGATTACAGTGATAGAAGTAAATGTGAATTA T₄

✰218	248	147	ATC AGACTATGTGAGTTAGATGACGTGAATTATA CTGTATAT₁₂

✰224	269	864^a	ACATAA TAATACAATAATACGAGATTAGACTATGTGAATTAAATGATATGA T₁₁G
✰226	269	808	ACATAA TAATACAATAATACGAGATTAGACTATGTGAATTAAATGATAT T₈GT₄

✰249	299	4^a	AAAT AAACAACAAATATGAGTTCGAATAAGTGATATAATGGTATAAAATT T₁₁
✰248	292	25 157	ATATA AAATACAAATTCGAGTAGGTAGTACAATGATATGAGATTA T₁₃
✰252	292	134	ATATA AAATACAAATTCGAGTAGGTAGTACAATGATATAGA TTATTAAT₇
✰255	292	244	ATATA AAATACAAATTCGAGTAGGTAGTACAATGATAT TATTATTAAT₁₅
253	298	5405	ATATAT AACAACAAATATAGATTCAAGTAAGTGATGTAGTAATATGA T₁₁

✰266	313	586	AAAAAA AAAAAAACAATACAAGATGACAGGTATAAGTTTGGATGAGTAAT T₁₂G

300	346	263^a	ATAT AAACAAAACAGAAATAGAAATGCAATATACGATAAGAAAATGGTATA T₁₂
✰301	345	647	ATAT AACAAAACAAAAGTAGAAGTGCAGTATATGATAGAAAAATGATGT CAAAT₁₁
✰301	335	125	ATAT ACAAAACAT AAATAAAAGTGCAGTATATGATAAAGAGATAATAT T₁₁

✰331	375	24 736	ATAT AATTATTAAACAAGAGAAAGTCACGTAAAAGGTAGAATGAAGATA TTTTTCT₆
✰331	375	712	ATAT AATTATTAAACAAGAGAAAGTCACGTAAAAAGTAGAATGAAGATA TTAT₅
✰332	378	8776	AT ATAAATTATTAAACAGAAAGAGATCATGTAGAAAGTGAGATAGAAAT T₁₂CT
331	371	3561	ATATAA ATTAAACAAAAAGAAATCACGTAGAAGACAGAATAGAGATA T₁₂G
331	374	302	ATAT ATTATTAAACAAAGAGAAATCATATAAGAGACAGAATGAGAATA T₉AT₅
✰332	378	387	AT ATAAATTATTAAACAGAAAAGAGTCATATAGAAAATAAGATAGAAAT T₁₂
✰332	378	144	AT ATAAATTATTAAACAGAAAGAGATCATGTAGAAAGTGAGATAAAAAT T₃

✰349	389	41	ATATAA ATCACCAACTAATAAGTTATTGAATGAGAGAAAGTTATATA T₁₂

✰360	407	181	ATATAT ACATCCATAAAATTATCATCAGTTAATAGATTGTTAAATGAAAA T₄

387	435	1428	ATATAT AACACAACAAGAAACGAATGAGAGAAGTATCTATGAGATTATT T₉CGT₃CTTCT
✰387	435	1049	ATATAT AACACAACAAGAGACGAATAGAAAAGATATCTGTGAAATTATT T₁₀ATT
✰387	437	934^a	ATATAT AAAACACAATAGAAAACGGATAAGAGAGATATTCATAGAGTTATT T₉GTTT

413	461	3^a	ATAT ATACAACAAAGAAAGACACTCTAGAAGATACAGTGAGAGATGAGTAA T₁₁
✰424	464	25 624	ATAT ATGACACAACGAGGGAAGATACTCTAAAGGACACAGTGAAA T₁₂
✰427	467	2307	ATAT ATAACGACACAATAGAGAAAGATGCTCTGAGAGATGTAATA T₁₂G
✰421	460	1864	ATAAAT TACAACAAAGAAAGATACTCTAGAAAGCACAGTGAGAAAT T₈CT₇
424	457	368	AAATTAACGACA AACAAAGAGAAATACTCTGAGAAATATGATGAAA T₁₂

✰455	491	368	ATATATAATTAC AAACAAACGCAGAGATGTCGGTAAATAATGATATAAT T₁₁
✰455	497	22^a	ATAT ATTACAAAACAGACGTAAAGATGTCGATGAATGGTGGTATAAT T₁₄

✰487	528	1^a	ATAC ACATCAACAATAGAAGATGGGATGATAATAGATTGTGAGATA T₂₇
✰487	526	8723	ATACAA ATCAACAATAGAAGATGGGATGATAATAGATTGTGAGATA T₁₆

✰521	567	232	AA AAAAAAAAAAAACAAAAATAGAATAAAGAAAGTCAGAGAATGTTAAT T₅

✰546	593	15^a	AATAAATCGATAACAAAGAACACTGTAAAAAAAGAGAATGAGAGTAAA TATAT₄
✰549	593	2587	AC AATAAATCAATAACAGAGAATATCATAGAGAGGAAAGATAGAAAT T₁₂GTTTGTACTT
✰549	592	181	ATAT ATAAATCAATGACAAGAAGCACTGTAGAAAAAGAGAGTGAAAAT TTTTAT₈
✰557	593	69 619	AATAAATCGATAACAAAGAACACTGTAAAAGAGAGAA TGAGAGTAAATAT₉

568	611	670	ATACT AAACACAAAAATGAATAAAATAAGTCAGTGATAGAAGATATTAT T₁₂

583	629	2^a	AT AAATAATAAACAGAAACAGAGCATAGAAGTAAGTAGAGTGAATTAAT T₁₁
589	629	854	AT AAATAATAAACAGAAACGGAATACGAGAATAAGTAAAGTGA TTTAAT₁₃

612	657	3^a	AT ATAAATCCAACAAGTATAAGAACATATAGAATAGTAGGTGAAAATA T₆A
613	654	618	ATATAT AATCCAACAGATATAAGAGCATGTAAAATAGTAAGTGAAAAT T₁₀AT
✰613	657	183	AT ATAAATCCAACAAGTATAAGAACATATAGAATAGTAGGTGAAAAT T₇CT₄

✰638	689	5^a	ATAT ATAAATAACTGTAGTATGGTGGTAGATGAGTTTGATAGATATAAA T₉
✰640	689	39 063	ATAT ATAAATAACTGTAGTATGGTGGTAGATGAGTTTGATAGATATA T1₂
✰647	689	678	ATAT ATAAATAACTGTAGTATGGTGGTAGATGAGTTTGAT T₁₁
✰640	689	131	ATAT ATAAATAACTGTAGTATGGCGGTAGATGAGTTTGATAGATATA T₁₁
✰654	689	234	ATAT ATAAATAACTGTAGTATGGTGGTAGATGA TTTTGATAGATATAT₁₂

✰672	716	119^a	ACACA ATCAACTGCAGAATTATATTACAGAGAGTGAGTAATTGTAA AAT₁₂
✰680	714	7581	AAAATA CAACTGCAAGATCGTGTTATAGAGGATAAGTGATT TAAT₁₃
✰680	714	105	AAATA CAACTGCAAGATCGTGTTGTAGAGGATAAGTGATT TAAT₁₁
✰680	719	1291	ATATAA ATTATCAACTGTGAGATTATATTACAAGGAATAAGTGATT T₁₁AT

✰686	728	12^a	ATATATT AAAATCCATTATCGATTGTAGAGTTATGTTATAGAGAATAA TAT₂₁
✰698	728	740	ATT AAAATCCATTATCGATTGTAGAGTTATGT GATAGAGAATAAT₁₁

✰715	760	2^a	AT ATATAAAACTAAACAAATAGCAAAGACAGTGAGAGATTCGTTAT AAAT₁₃
✰715	755	2272	ATATATAT AAACTAAACAAATAGCAGAGACAGTGAGAGATTCGTTAT AAT₁₃

✰720	767	4588	AT AAATCAAATACAGAACTGAATAGACGATAAAGATAGTGAGAAATTT T₁₀G
✰720	767	165	AT AAATCAAATACAGAACTAGATGAACAATAGAGATAGTGAGAAATTT TTTTTCT₆
✰728	765	920	AT ATCAAATACAAAACTGAGCAGATGACAGAGATAGTAAA TGATTTAT₁₁G

✰747	789	13	ATAAAT ACAACAATATAATAACTGTCGAAGGTTGAATATGAGATTAAAT T₁₁

770	822	1	GGA CTATAACTCCGATAACGAATCAGATTTTGACAGTGATATGATAATTATT TCCCT₃CTTCTC

✰774^b	822	8663	ATA CTATAACTCCAATGACGAAATCAGTTTTACAGTGATATGATAA T₁₄

The gRNA is identified by its complementarity to the 5′ (column 1) and 3’ (column 2) number of the fully edited mRNA (+1 = 0). The gRNAs were sorted based on both mRNA regions covered and on guiding sequence class. Sequence variations observed in both the 5′ non-complementary region and the 3′ U-tail were ignored in assigning sequence classes. Transcript copy numbers (column 3) were determined by adding all gRNAs of the same sequence class. Major sequence classes were defined as containing greater than 100 transcript copies. The exact sequence shown is of the most abundant transcript in each sequence class. In the case of rare gRNA transcripts, the identified gRNAs are shown regardless of copy number. The asterisk indicates the highest scoring gRNAs identified for each population. Starred (✰) gRNAs indicate novel gRNAs not found in the KISS database (http://splicer.unibe.ch/kiss). A total of 84% of the gRNAs identified in this study are ‘novel gRNAs’, not previously identified.

aIdentified highest scoring gRNA (‘Best align’).

bmRNA 5′ border based on alternative sequence.

Figure 1.

The gRNA-mRNA sequence alignment for fully edited ATPase 6. The cDNA sequence of the most abundant gRNA in its sequence class is shown aligned beneath the fully edited mRNA. Lowercase u’s indicate uridines added by editing, asterisks indicate encoded uridines deleted during editing. Nucleotides and deletion sites in the fully edited mRNA were numbered starting from the 5′ end (+1 = 0). gRNAs are colored (on-line version only) based on transcript abundance as follows: Blue < 100; Green < 1000; Purple < 10 000; Orange < 100 000; Red > 100 000; Black = not quantified. Watson-Crick (|) and G:U base pairs (:) are indicated. Mismatches are indicated by the number sign (#) and shown in a contrasting color. The potential mRNA sequence generated by a more abundant alternative A6 initiating gRNA is also shown in Figure 1. Although this gRNA would introduce a number of sequence changes (indicated in red online) downstream of the stop codon (underlined), the generated anchor sequence for the next gRNA is maintained. The major gRNA classes involved in the editing of ATPase 6 The gRNA is identified by its complementarity to the 5′ (column 1) and 3’ (column 2) number of the fully edited mRNA (+1 = 0). The gRNAs were sorted based on both mRNA regions covered and on guiding sequence class. Sequence variations observed in both the 5′ non-complementary region and the 3′ U-tail were ignored in assigning sequence classes. Transcript copy numbers (column 3) were determined by adding all gRNAs of the same sequence class. Major sequence classes were defined as containing greater than 100 transcript copies. The exact sequence shown is of the most abundant transcript in each sequence class. In the case of rare gRNA transcripts, the identified gRNAs are shown regardless of copy number. The asterisk indicates the highest scoring gRNAs identified for each population. Starred (✰) gRNAs indicate novel gRNAs not found in the KISS database (http://splicer.unibe.ch/kiss). A total of 84% of the gRNAs identified in this study are ‘novel gRNAs’, not previously identified. aIdentified highest scoring gRNA (‘Best align’). bmRNA 5′ border based on alternative sequence.

Overall characteristics of the gRNA populations

Analyses of the gRNA populations for the extensively edited (pan-edited) mRNAs indicate that editing involved a large number of gRNA populations for full editing. For example, sequence analyses identified 31 distinct gRNA populations involved in the editing of A6 and 40 involved in the editing of COIII (Table 1, Supplementary Table S3). In addition, most of the major populations (population defined as guiding the same or near same region of the mRNA) contained multiple sequence classes. In our initial sorting of the gRNA populations, it was often difficult to initially assign gRNAs to a specific population group, as the gRNA sequence classes have significant border variations at both the 5′ and 3′ ends. Variation at the 5′ end was often due to truncation of the sequence, suggestive of a distinct 5′–3′ exonuclease activity (Figure 2). Variation in the U-tail addition site was also often observed. For example, gA6 (224–269) and gA6 (226–269) differ only by the presence of a GA that may be due to differences in polyU site selection (Table 1). Similarly, although there are 14 major sequence classes that guide the COIII 699–753 region, they can be sorted into two distinct populations (Table 2). gCOIII (699–748) and gCOIII (701–748) have near identical guiding regions, differing by a single A residue that again may be due to differences in polyU site selection. The other main population guides the editing of a region within the mRNA that is shifted downstream by only 5 nt (706–753). In this population, there are six major sequence groups. Within each group, the different sequence classes are defined by differences at the 3′ polyU site. The different groups, however, are defined by distinct differences in the sequence of the guiding region. Although groups 2 and 3 differ from group 1 by single nucleotide change (bold and underlined), other groups show multiple differences in gene sequence, all R to R or Y to Y changes, allowing multiple gRNAs to guide the generation of the same mRNA sequence.

Figure 2.

Table 2.

Major gRNA classes for the COIII 699–753 region

mRNA 5′	mRNA 3′	Copy number	Major gRNA sequence classes for the COIII 699–753 region
699	748	977	ATATA TAATAAATCCAATGAAGATAAAGTAGAGTCAGAGATATTATGATTT TTTTTTTTTT
701	748	2726	ATA TAATAAATCCAATGAAGATAAAGTAGAGTCAGAGATATTATGAT ATTTTTTTTTTTTT

706¹	753	31 331	ATATAT AAATGTAATAGATCTGATGAAAGTGAGGTAGAATTGAGAATATT TTTTTTTTTT
707¹	753	8037	ATATAT AAATGTAATAGATCTGATGAAAGTGAGGTAGAATTGAGAATAT ATTTTTTTTTTTTTT
713¹	753	595	ATATAT AAATGTAATAGATCTGATGAAAGTGAGGTAGAATTGAGAAT TTTTTTTTTAACCC
715¹	753	182	ATATAT AAATGTAATAGATCTGATGAAAGTGAGGTAGAATTGAGA TTTATTTTTTTT
719²	753	131	ATATAT AAATGTAATAGATCTGATGAAAGTGAGGTAGAATT TAGAATATATTTTTTTT
707³	753	206	ATATAT AAATGTAATAGATCTGATAAAAGTGAGGTAGAATTGAGAATAT ATTTTTTATTTTTT
706⁴	753	6744	ATATAT AAATGTAATAGATCCAATGAAGGTAAGATAGAACTGAGAATATT TTTGTTTTT
707⁴	753	3791	ATATAT AAATGTAATAGATCCAATGAAGGTAAGATAGAACTGAGAATAT AATTATTTTTTT
713⁴	753	154	ATATAT AAATGTAATAGATCCAATGAAGGTAAGATAGAACTGAGAAT TTTTTTTGTTTTTT
706⁵	753	1214	ATATAT AAATGTAATAGATTCAATGAAGGTAAGATAGAACTGAGAATATT TTTTCTTTT
707⁵	753	1124	ATATAT AAATGTAATAGATTCAATGAAGGTAAGATAGAACTGAGAATAT AATTTTTTTTTTTTTT
707⁶	752	848	ATATAT AATGTAATAAATCTAATAGAGATAAGATAGAACTGAGGATAT ATTTTTTTTTTTT

Fourteen major sequence classes that fall into two distinct populations were identified that could guide the editing of the 699–753 region. The two populations are shifted by 5 nt in their guiding regions and have 13 nt differences in the overlap region (all R to R and Y to Y changes allowing them to guide the generation of the same sequence). The 706–753 population can be further divided into six major sequence groups as indicated. Within each group the different sequence classes differ in the position of the polyU tail (changing the mRNA 5′ border). The different groups, however, are defined by distinct differences in the sequence of the guiding region (nucleotide changes shown in bold and underlined).

Example of sequential 5′ end truncations suggestive of a 5′–3′ exonuclease activity. The gA6 (248–292) sequence class was large (∼25 000 transcripts, containing a large number of transcripts with both 5′ truncations and sequence variations in the U-tail (length of U-tail and U-tail punctuated with other nucleotides). The transcript numbers reported are for the specific sequence shown (i.e. 6788 cDNAs with a T-13 tail). Major gRNA classes for the COIII 699–753 region Fourteen major sequence classes that fall into two distinct populations were identified that could guide the editing of the 699–753 region. The two populations are shifted by 5 nt in their guiding regions and have 13 nt differences in the overlap region (all R to R and Y to Y changes allowing them to guide the generation of the same sequence). The 706–753 population can be further divided into six major sequence groups as indicated. Within each group the different sequence classes differ in the position of the polyU tail (changing the mRNA 5′ border). The different groups, however, are defined by distinct differences in the sequence of the guiding region (nucleotide changes shown in bold and underlined). Analyses of the gRNAs indicate a number of other interesting features. The shortest and longest gRNAs identified in our search had 24 nt and 61 nt of complementarity to their edited mRNA, respectively. Most of the gRNAs (64%), however, had 38–48 nt of complementarity (Figure 3). In addition, most of the gRNAs had few ‘extra’ nt 5′ or 3′ to the anchor and guiding regions (Figure 4A and B). This was most striking at the 3′ end, where over 50% of the sequence classes contained no non-guiding nucleotides prior to the post-transcriptionally added U-tail. At the 5′ end, 84% of the transcripts had six or fewer nucleotides 5′ to the anchor sequence. In addition, for most of the gRNAs with a large leader sequence, the end of the anchor match was defined by a point mutation, with nucleotides 5′ to the mutation able to base pair with the anchor binding site. Although for some of the gRNAs, a mismatch within a large anchor may be tolerated, for others it signals a possibility for editing anomalies. An example of this is gND7 (152–190), which directs editing of the same region as gND7 (147–199), the initiating gRNA for the 5′ editing domain. gND7 (152–190) has a single T insertion 14 nt from the 5′ end that defines the 5′ anchor border of the gRNA (Figure 5). Analyses of the nucleotides upstream of the T insertion indicate that they can pair with 11 consecutive nucleotides within the HR3 (homology region 3) and possibly also initiate editing of the 5′ domain, directing the generation of a substantially different sequence. The two gRNAs were found in approximately equal numbers.

Figure 3.

Figure 4.

gRNA characteristics: number of non-matched nucleotides found 5′ to the anchor (A) or 3′ to the guiding region, excluding the U-tail (B). Most of the identified gRNAs had few non-complementary nucleotides.

Figure 5.

Long 5′ non-matched extensions often signal editing anomalies. gND7 (147–199) and gND7 (152–190) were both identified as initiating gRNAs for the 5′ editing domain of ND7. (A) The sequence of the two gRNAs is shown aligned below the edited ND7 mRNA sequence. A single T insertion (bold and underlined) disrupts the anchor of gND7 (152–190). (B) The putative alternative sequence generated if editing initiates with gND7 (152–190). Watson–Crick (|) and G:U base pairs (:) are indicated. The number sign (#) indicates C:A base pairs required for generation of this possible sequence.

Length of gRNA complementarity to fully edited mRNAs. The shortest and longest gRNAs identified had 24 nt and 61 nt of complementarity to their fully edited mRNA. The bulk of the sequence classes (64%) had 38–48 nt of complementarity. gRNA characteristics: number of non-matched nucleotides found 5′ to the anchor (A) or 3′ to the guiding region, excluding the U-tail (B). Most of the identified gRNAs had few non-complementary nucleotides. Long 5′ non-matched extensions often signal editing anomalies. gND7 (147–199) and gND7 (152–190) were both identified as initiating gRNAs for the 5′ editing domain of ND7. (A) The sequence of the two gRNAs is shown aligned below the edited ND7 mRNA sequence. A single T insertion (bold and underlined) disrupts the anchor of gND7 (152–190). (B) The putative alternative sequence generated if editing initiates with gND7 (152–190). Watson–Crick (|) and G:U base pairs (:) are indicated. The number sign (#) indicates C:A base pairs required for generation of this possible sequence. In our initial alignments, we identified and aligned the highest scoring gRNAs for each editing region (2 points for canonical Watson–Crick base pairs and 1 point for G-U base pairs). However, population analyses indicate that these ‘best align’ gRNAs (indicated with superscript ‘a’ in Table 1) were often rare transcripts, with the most abundant transcripts having shorter guiding regions.

gRNAs show a strong ATATA initiation bias

Previous characterization of gRNAs suggested that transcript initiation tends to occur 31–32 bp from an imperfect 18 bp inverted repeat, initiating with a 5′ RYAYA motif (7). However, other transcription initiation sequences had been observed (16). This analysis of over 3.5 million transcripts indicates a strong ATATA initiation bias, with over 74% of the transcripts initiating with this sequence. Of the ∼600 major sequence classes identified, the two most common initiation sequences were ATATAT (35%) and ATATAA (21%). Sequence classes initiating with ATATAC and ATATAG were much less common (4.3 and 2.4%, respectively, see Table 3). Unexpectedly, the third most common initiation sequence class was 5′ AAAAAA, with these gRNAs often initiating with a long 5′ adenylate-run (Figure 6A). The number of 5′ A residues involved in anchoring the gRNA varied from 0 [gCR4(504–548)] to 12 [gA6(521-567)]. These gRNAs are interesting because it is difficult to understand how they are selective. For example, the anchor binding sites for both gA6 (521–567) and gCR4 (405–458) are almost identical (Figure 6B and C). For both of these gRNAs, only a single C:G pair is involved in the initial interaction. Interestingly, gA6 (521–567) can continue the editing of two alternative sequences (described fully in the Characteristics of gRNAs for specific mRNAs section, found later in the text).

Table 3.

Most common gRNA initiation sequences

Initiating Sequence	Number of sequence classes	%	Number of transcripts	%
5′ ATATAT	214	35.2%	1 320 726	37.4%
5′ ATATAA	128	21.1%	870 728	24.7%
5′ AAAAAA	28	4.6%	45 540	1.3%
5′ ATATAC	26	4.3%	134 862	3.8%
5′ ATACAA	17	2.8%	60 428	1.7%
5′ ATATTA	16	2.6%	31 779	0.9%
5′ ATATAG	15	2.4%	269 306	7.6%
5′ ATAAAT	15	2.6%	23 669	0.7%
5′ ATACAT	13	2.2%	80 453	2.2%
5′ ATAAAA	12	2.1%	38 907	1.1%
5′ ATAAAG	9	1.5%	176 446	5.0%
5′ ATACTA	8	1.4%	58 797	1.7%

The major sequence classes identified were grouped based on the first 6 nt and sorted based on both the number of sequence classes and the total number of transcripts found. Most transcripts (∼74%) initiated with ATATA (includes ATATAT (37.4%)), ATATAA (24.7%, ATATAC (3.8%)) and ATATAG (7.6%).

Figure 6.

Identification of gRNAs that initiate with long A-runs. (A) Examples of identified gRNAs initiating with long A-runs. Sequence complementary to the fully edited mRNA is underlined. (B) Alignment of gCR4(405–548) with its corresponding edited mRNA. (C) Alignment of gA6(521–567) with its corresponding edited mRNA. The gRNA cDNA sequence is shown aligned beneath the fully edited mRNA as described in Figure 1. The partial sequence of the downstream gRNA that directs the creation of the anchor-binding site is also shown. For both of the these gRNAs, the anchor interaction (bold font) involves a single G:C base pair. Two downstream gRNAs were identified that direct editing of the A6 550–570 region. gA6 (549–593) would direct the insertion of 12 U-residues, while gA6(557–593) would direct the insertion of 11 U-residues. gA6(557–593) is much more abundant (∼70 000 versus ∼2500 transcripts identified). (D) Comparison of the protein sequences generated by the conventional 12U-edited (top line) and alternatively 11U-edited A6 transcripts. The alternative protein sequence (double underlined) is 11 AA shorter.

gRNA population numbers

Analyses of the gRNA populations show significant differences in the number of identified gRNA transcripts that guide a specific region. Because the zero mismatch data contained only correctly matched gRNAs, we were able to quantify the total number of identified gRNA transcripts that covered any 1 nt in the fully edited sequence. The data for A6, CR3 and RSP12, which clearly show the large variation in identified gRNAs responsible for the editing of specific regions, are illustrated in Figure 7A–C. The data for all other transcripts can be found in the supplementary files. Although some editing sites are covered by a single gRNA (for example, the initiating gRNA for A6 (gA6-770–822) is represented by a single transcript), other editing sites are covered by hundreds of thousands of gRNA transcripts. The edited RSP12 sequence, nucleotides 203–246, had the highest gRNA transcript coverage, with over 350 000 identified transcripts (Figure 7C and supplementary data). Approximately 340 000 of those transcripts were found in a single sequence class. However, a total of 27 different sequence classes covered this region. In this analysis, the number of gRNAs in any sequence class was determined by sorting gRNAs based on sequence within only the anchor and guiding regions. The 5′ end differences (due to truncation) and differences in the polyU tail length and the interruption of the polyU tail with other nucleotides were ignored. All gRNAs with identical sequence within this region were then summed to determine the total number within the sequence class. In our initial analysis, identical sequences (redundant reads) were collapsed and the number of identical reads recorded. In the sorting of the raw data, it became clear that some gRNA sequences were abundant (highest copy number for any one sequence was 48 097). Because library preparation does include a PCR amplification step, we do note that the some gRNAs may be preferentially amplified. However, in determining population numbers, abundant gRNAs were most often represented by multiple reads with differences in both the 5′ and 3′ regions outside of the anchor/guiding region. In addition, the abundant gRNAs were most often found in large populations with multiple related sequence classes. This suggests that the most abundant gRNAs were associated with high copy number minicircles. More surprising was the number of edited regions covered by low gRNA numbers, especially for those transcripts that are constitutively edited. Interestingly, we note that the initiating gRNA for most of the transcripts were found in low copy numbers. The initiating gRNAs for A6 (1 transcript identified), ND7 (6 transcripts) and ND8 (27 transcripts) were rare (<100 identified transcripts). Although the initiating gRNAs for COIII (111 transcripts), CR4 (511 transcripts), ND3 (545), ND9 (274 transcripts) and RSP12 (128 transcripts) were slightly more abundant, they were still not found in the numbers expected.

Figure 7.

Abundance of gRNA transcripts (y-axis, note log scale) that align to a respective nucleotide in the fully edited mRNA. Both nucleotides and deletion sites in the fully edited mRNA were numbered starting from the 5′ end (+1 = 0). Shaded boxes indicate identified gRNAs that cover specific editing sites. Data include only the identified conventional gRNAs. (A) Identified gRNAs for the A6 edited mRNA. (B) Identified gRNAs for CR3. (C) Identified gRNAs for ribosomal protein subunit 12 (RSP12). All individual data points were designated with open circles. Close overlapping of individual data points generate solid black lines.

Characteristics of gRNA populations for specific mRNAs

ATPase 6

A total of 32 gRNA populations that could guide the editing of A6 were identified. Although the minimum overlap observed was 8 nt, the average overlap was 19 nt, indicative of the extensive overlap observed for many of the gRNA populations (Figure 1). Most of the gRNA populations were reasonably abundant, with two notable exceptions; gA6 (770–822), the initiating gRNA, and gA6 (747–789), which directs the editing just upstream of the initiating guide. The sequence for gA6 (770–822) has been previously identified as gA6-14 (17). A limited search of our mismatch databases did identify an alternative gRNA that could initiate editing for A6 (see Figure 1). Significantly, although the alternative sequence does introduce a number of sequence changes (all downstream of the stop codon), the generated anchor sequence for the next gRNA is maintained. A second region with two identified gRNAs that can generate a sequence anomaly was identified at position 556–567. Four different sequence classes were identified that direct the editing in this region. Three of the sequence classes (gA6 (546–593), gA6(549–593) and gA6(549–592), direct the ‘correct’ insertion of 12 U into this site. However, the most abundant gRNA, gA6 (557–593), would in fact direct the insertion of 11 U instead of the described 12 U in the 556–567 editing site, but correctly edit (insertion of 5 U) the next upstream site (550–554) (Figure 6C). The gRNA that initiates editing in the alternatively edited site is unusual, in that its’ anchor consists of an ‘A’ run with a single G-C pair (gA6 (521–567)). The 11 U sequence decreases the anchor for gA6 (521–567) by a single A-U base pair, suggesting that it could anchor and continue editing for both generated sequences. An analysis of the 11 U open reading frame indicates that the frame shift would generate a new carboxyl terminus that is only 11 AA shorter (Figure 6D).

Cytochrome oxidase III

Complete gRNA coverage for the conventionally edited COIII transcript was obtained with the identification of 40 gRNA populations. Similar to A6, the average overlap of the gRNAs was ∼19 nt (maximum overlap = 36 nt; minimum overlap = 8 nt). Although alternative editing of COIII has been reported, we did not identify a gRNA that could direct editing of the alternative sequence. In COIII, alternative editing involves a gRNA that directs the insertion of two U-residues instead of the conventional three between nucleotides G458 and A462 (18). The sequence changes directed by the alternative gRNA links the open reading frame of the edited 3′ end to an ORF found in the 5′ pre-edited sequence allowing the production of a different protein. In our sequence data, we identified a number of different sequence classes that direct editing of the transition site. gCOIII (456–499) matches the sequence of the gRNA previously identified as directing the conventional COIII editing (insertion of three U-residues). The three most abundant gRNAs in this region, however, all direct editing through nucleotide U461, and would direct the insertion of a single U-residue into the alternatively edited site. No gRNA that could direct the alternative sequence was identified. The previously identified alternative gRNA, which deviates from the conventional sequence only near its 3′ end, was not found in our library.

C-rich regions 3 and 4

The CR3 transcript is small, with extensive editing generating a transcript of ∼310 nt. Interestingly, we were only able to identify gRNAs that direct the editing of the 5′ end of this transcript. gRNAs that matched the published sequence downstream of nucleotide 196 were rare and no transcripts were identified that could direct the editing between nucleotides 275 and 292. We were able to identify a gRNA that could initiate editing of the CR3 transcript, but it would generate a sequence distinctly different from that published (see supplementary data). In contrast, a full complement of gRNAs (18 populations) was identified for the CR4 transcript (average overlap = 16 nt, maximum = 37, minimum = 8). In this study, we used the consensus-edited sequence found for the blood form stage of the parasite (19). In procyclic forms, although the 3′ portion of the transcript is identical to that found in blood forms, the consensus sequence diverges at nucleotide 312, and no consensus was determined upstream of nucleotide 256. It has been previously reported that gRNAs for the developmentally regulated mRNAs are present in both life cycle stages (16). We do note that most (10 of 18) of the populations identified were low abundant populations (<1000 transcripts identified).

Cytochrome B and Murf II

RNA editing of both the CYb and Murf II is limited, occurring only near the 5′ ends of the transcripts (20,21). These transcripts require only two gRNAs for complete editing and both have some interesting and unusual characteristics. In Murf II, one of the gRNAs is known to be maxicircle encoded. The gene is located near the 5′ border of the ND4 gene and appears to be independently transcribed (22). This gRNA guides almost all of the editing required. The search of our database did identify a single gRNA sequence class involved in editing this region [gMurf II (30–79)] that matches the identified maxicircle gene. We did not identify the initiating gRNA for Murf II, which must direct the insertion of a single U-residue and also, we presume, the first few editing site to generate the anchor binding site for gMurfII-2. Because the amount of editing by this gRNA is so limited, it may be that both our size selection and the stringency of our selection precluded our ability to detect this transcript. We did identify both gRNA populations involved in the editing of CYb. gRNAs that can initiate editing of CYb had been previously identified (gCYb-558 and gCYb560A and B) and are unusual because they are not flanked by the 18 bp inverted repeats characteristic of most gRNA genes (7,8,11,23). Interestingly, although the initiating gRNAs identified in our gRNA library are similar to those previously described, only one of the major sequence classes (gCYb (54–91) matches one previously published (gCYb-560A), and it does differ at its 5′ end, in that it initiates with a run of A-residues (Figure 8). Most of the initiating gRNA sequence classes did in fact initiate with a run of A’s. In contrast to the initiating gRNA (over 31 000 transcripts detected), the sequential gRNA (gCYb (32–64) was much less abundant (∼1000 total transcripts detected). It also initiates with a run of A-residues.

Figure 8.

gCYb (54–91) transcript alignment with minicircle MCP23. The minicircle sequence is shown on top with both gCYb(54–91) and the previously identified gCYb-560A aligned underneath.

NADH dehydrogenase subunits 3, 7, 8 and 9

Full complements of gRNAs were not identified for any of the ND subunits. For most of these transcripts, editing is developmentally regulated, with full editing only observed in the bloodsteam stages (24–27). gRNAs that covered all of the fully edited nucleotides for both ND7 and ND9 were identified. Some of the identified gRNAs, however, had distinct mismatches, and it is unclear if these gRNAs would generate the correct sequence (see supplementary data). The ND3 mRNA transcript is edited in two domains with only the large 5′ domain edited to a single consensus sequence (27). The much smaller 3′ domain (nucleotides 375–395) shows several editing patterns, and no gRNAs that span this variable region were identified. We did not, however, search for gRNAs using all of the reported sequence variations. Twenty-five gRNA populations that could direct the editing of ND8 were identified, with good coverage of the edited sequence except for nucleotides 540–554. In addition, a number of the identified gRNAs have mismatches. For example, the rare gND8 (161–198) transcript (<100 transcripts identified) is a perfect match to the published sequence. A near identical gRNA, gND8 (161–187) is abundant (>100 000 transcripts identified) and has a single A:T transversion that introduces a mismatch into the long (20 bp) anchor region. This region is also covered by another abundant gRNA (gND8(158–186)(>150 000 transcripts)). Although this gRNA has several mismatches to the published sequence, the edited sequence it would guide introduces a single amino acid change. Importantly, editing 5′ to the alternative sequence is not affected, maintaining the anchor-binding site for the sequential gRNA (Figure 9).

Figure 9.

Abundant mismatched gRNA generates edited sequence with a single AA change. (A) Two gRNAs that edited the same region as the rare gND8(161–198) were identified that contain mismatches to the conventional edited sequence of ND8 (mismatches in bold and underlined). The gRNAs are shown aligned beneath the fully edited mRNA as described in Figure 1. (B) gND8(158–186) is the most abundant of the three transcripts, and would generate an edited sequence with a single amino acid change.

Ribosomal Protein S12

A total of 12 gRNA sequence populations are involved in the editing of RSP12 (28). These include one of the most abundant populations [gRSP12(200–246)] identified, with ∼350 000 transcripts in 27 different sequence classes. In contrast to this region, few transcripts were identified that covered nucleotides 122–168. This region is interesting because it does contain a high percentage of C-residues, and the few gRNA transcripts identified all contained C:A mismatches. Because we did not allow for C:A basepairs, it may be that the bulk of the gRNAs that guide this region were not identified.

DISCUSSION

Deep sequencing of the gRNA transcriptome has allowed the identification of a near full complement of gRNAs needed for the extensive editing observed in T. brucei. A total of 642 different major sequence classes were identified, 84% of which are novel (not previously identified). Characterization of this population has identified a number of interesting and unusual features. These include the extreme population differences in the identified gRNAs and the identification of gRNAs that initiate with long A runs. Generation of the gRNA library did include a limited PCR that may exaggerate differences in population number. However, the association of abundant gRNAs with populations containing multiple sequence classes suggests that these represent gRNAs transcribed from amplified abundant minicircles. More surprising to us was the number of identified gRNAs found in low copy number. The high stringency of our initial screen suggests that the editing of these regions may be directed by gRNAs not identified in our initial screen. This postulate is supported by the identification of alternative gRNAs or gRNAs with internal mismatches for some of the low coverage regions. For example, alternative gRNAs were identified for the initiating gRNAs of both A6 and CR3 and a limited search of our mismatch databases did identify a number of abundant gRNAs with internal mismatches (see Figure 9). Unfortunately, the mismatch files are large and difficult to work with. Preliminary inspection of the files indicates that they identify hundreds of thousands of gRNAs with characteristics that suggest they are misaligned. These characteristics include gRNA/mRNA matched regions that contain mostly G:U base pairs, the presence of long 5′ and 3′ extensions outside of the matched regions and the lack of a defined 5′ anchor (the characteristic bias towards Watson-Crick base pairing in the anchor region). We are currently working to refine our search parameters to more heavily weight-specific gRNA characteristics, to sort these files into manageable units. The large numbers of low copy number gRNAs, however, are suggestive of high plasticity within the gRNA encoding minicircles. Previous studies in Leishmania have shown that homologous minicircle sequence class frequencies are extremely variable, even between different isolates from the same strain taken after several years of culture (29,30). In addition, previous studies have indicated that genes encoding highly edited RNAs accumulated mutations at a higher frequency than their unedited homologs in closely related species (31). The rapid evolution of the gRNA populations would explain the rapid changes observed in the protein coding genes. We have in fact identified a number of gRNAs that would generate an mRNA sequence that differs from the consensus sequence determined in the early 1990s. Currently, we are working to determine if we can detect these alternative sequences in the mRNA population. Deep sequencing of the gRNA population also detected other interesting characteristics. Although the bulk of the gRNAs did initiate with the strong ATATA initiation sequence bias previously reported, a new unexpected class of gRNAs that initiate with long runs of adenylate residues was identified. These gRNAs are interesting because it is difficult to understand how they are selective. The sequential dependence of each gRNA on downstream editing indicates that the overall efficiency is dependent on the efficiency of each gRNA-targeting event. For example, a 90% efficiency rate for each COIII gRNA (40 gRNAs required) would result in <2% fully edited transcripts. We had hypothesized that this evolutionary pressure would select for gRNAs with specific and efficient gRNA targeting characteristics. The identification of a potentially alternatively edited site in A6 (see Figure 6) does suggest that this class of gRNAs may play a significant biological role. The long adenylate anchor of gA6 (521–567) allows it to target both 11U and 12U transcripts, suggesting that both editing events can lead to fully mature and translatable mRNAs. The presence of distinct 5′ truncated gRNAs makes it impossible to determine if the large variability observed in the number of 5′ adenylate residues found in these transcripts is due to 5′ exonuclease activity or to its mechanism of synthesis. Minicircle encoded genes have not been identified for most of the A-run gRNAs. gCYb54-91, however, is a close match to a CYb gRNA encoded on minicircle MCP-23 (23). Interestingly, examination of the primer extension data in the original publication does suggest that this gRNA initiates with multiple adenylates in vivo. The A-run aligns with an A4 stretch on the minicircle, suggesting that it could be produced by a slippage (stuttering) mechanism (32,33). We do note that the existence of 5′–3′ RNA degradation activities in the mitochondria of trypanosomes has not previously been described (34). However, the large numbers of 5′ serial truncated transcripts is suggestive of 5′–3′ exonuclease activity. Although we cannot rule out a contaminating exonuclease from the cytoplasm, a 5′–3′ exonuclease for gRNA recycling may be evolutionarily favorable, as it would remove the anchor targeting sequence first, possibly preventing partially degraded gRNAs from initiating any mRNA editing. Because of the U4 requirement in our initial filter, we would not have identified gRNAs with 3′ truncations. This is the first study of trypanosomal gRNAs using high-throughput sequencing. In this work, we have defined a near complete set of the gRNAs required for the extensive editing found in T. brucei. The identification of this comprehensive set of gRNAs will allow the characterization of the sequence and structural features important for efficient targeting and should provide insight into the evolution of small RNA targeting strategies.

ACCESSION NUMBERS

SAMN02204165 NCBI's Sequence Read Archive.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Institutes of Health [R03AI0902 to D.K.]; all undergraduates funded on NSF 04-546 (to J.H., T.T. and J.L.); Interdisciplinary Training for Undergraduates in Biological and Mathematical Sciences. Funding for open access charge: NIH. Conflict of interest statement. None declared.

33 in total

1. Genomic organization of Trypanosoma brucei kinetoplast DNA minicircles.

Authors: Min Hong; Larry Simpson
Journal: Protist Date: 2003-07

2. Identification and characterization of trypanosome RNA-editing complex components.

Authors: Kenneth Stuart; Aswini K Panigrahi; Achim Schnaufer
Journal: Methods Mol Biol Date: 2004

3. An extensively edited mitochondrial transcript in kinetoplastids encodes a protein homologous to ATPase subunit 6.

Authors: G J Bhat; D J Koslowsky; J E Feagin; B L Smiley; K Stuart
Journal: Cell Date: 1990-06-01 Impact factor: 41.582

4. Creation of AUG initiation codons by addition of uridines within cytochrome b transcripts of kinetoplastids.

Authors: J E Feagin; J M Shaw; L Simpson; K Stuart
Journal: Proc Natl Acad Sci U S A Date: 1988-01 Impact factor: 11.205

5. Single-step method of RNA isolation by acid guanidinium thiocyanate-phenol-chloroform extraction.

Authors: P Chomczynski; N Sacchi
Journal: Anal Biochem Date: 1987-04 Impact factor: 3.365

6. Sequence heterogeneity in kinetoplast DNA: reassociation kinetics.

Authors: M Steinert; S Van Assel
Journal: Plasmid Date: 1980-01 Impact factor: 3.466

Review 7. Uridine insertion/deletion editing in trypanosomes: a playground for RNA-guided information transfer.

Authors: Ruslan Aphasizhev; Inna Aphasizheva
Journal: Wiley Interdiscip Rev RNA Date: 2011-03-23 Impact factor: 9.957

Review 8. Evolution of RNA editing in trypanosome mitochondria.

Authors: L Simpson; O H Thiemann; N J Savill; J D Alfonzo; D A Maslov
Journal: Proc Natl Acad Sci U S A Date: 2000-06-20 Impact factor: 11.205

9. An intragenic guide RNA location suggests a complex mechanism for mitochondrial gene expression in Trypanosoma brucei.

Authors: Sandra L Clement; Melissa K Mingler; Donna J Koslowsky
Journal: Eukaryot Cell Date: 2004-08

10. Editing of kinetoplastid mitochondrial mRNAs by uridine addition and deletion generates conserved amino acid sequences and AUG initiation codons.

Authors: J M Shaw; J E Feagin; K Stuart; L Simpson
Journal: Cell Date: 1988-05-06 Impact factor: 41.582

36 in total

Review 1. High throughput sequencing revolution reveals conserved fundamentals of U-indel editing.

Authors: Sara L Zimmer; Rachel M Simpson; Laurie K Read
Journal: Wiley Interdiscip Rev RNA Date: 2018-06-11 Impact factor: 9.957

Review 2. Constructive edge of uridylation-induced RNA degradation.

Authors: Ruslan Aphasizhev; Takuma Suematsu; Liye Zhang; Inna Aphasizheva
Journal: RNA Biol Date: 2016-10-07 Impact factor: 4.652

3. RNA binding and core complexes constitute the U-insertion/deletion editosome.

Authors: Inna Aphasizheva; Liye Zhang; Xiaorong Wang; Robyn M Kaake; Lan Huang; Stefano Monti; Ruslan Aphasizhev
Journal: Mol Cell Biol Date: 2014-09-15 Impact factor: 4.272

4. Trypanosomatid mitochondrial RNA editing: dramatically complex transcript repertoires revealed with a dedicated mapping tool.

Authors: Evgeny S Gerasimov; Anna A Gasparyan; Iosif Kaurov; Boris Tichý; Maria D Logacheva; Alexander A Kolesnikov; Julius Lukeš; Vyacheslav Yurchenko; Sara L Zimmer; Pavel Flegontov
Journal: Nucleic Acids Res Date: 2018-01-25 Impact factor: 16.971

5. Trypanosome RNA Editing Mediator Complex proteins have distinct functions in gRNA utilization.

Authors: Rachel M Simpson; Andrew E Bruno; Runpu Chen; Kaylen Lott; Brianna L Tylec; Jonathan E Bard; Yijun Sun; Michael J Buck; Laurie K Read
Journal: Nucleic Acids Res Date: 2017-07-27 Impact factor: 16.971

Review 6. U-Insertion/Deletion mRNA-Editing Holoenzyme: Definition in Sight.

Authors: Inna Aphasizheva; Ruslan Aphasizhev
Journal: Trends Parasitol Date: 2015-11-10

Review 7. Lexis and Grammar of Mitochondrial RNA Processing in Trypanosomes.

Authors: Inna Aphasizheva; Juan Alfonzo; Jason Carnes; Igor Cestari; Jorge Cruz-Reyes; H Ulrich Göringer; Stephen Hajduk; Julius Lukeš; Susan Madison-Antenucci; Dmitri A Maslov; Suzanne M McDermott; Torsten Ochsenreiter; Laurie K Read; Reza Salavati; Achim Schnaufer; André Schneider; Larry Simpson; Kenneth Stuart; Vyacheslav Yurchenko; Z Hong Zhou; Alena Zíková; Liye Zhang; Sara Zimmer; Ruslan Aphasizhev
Journal: Trends Parasitol Date: 2020-02-28

Review 8. Trypanosome RNA editing: the complexity of getting U in and taking U out.

Authors: Laurie K Read; Julius Lukeš; Hassan Hashimi
Journal: Wiley Interdiscip Rev RNA Date: 2015-11-02 Impact factor: 9.957

9. Assembly and annotation of the mitochondrial minicircle genome of a differentiation-competent strain of Trypanosoma brucei.

Authors: Sinclair Cooper; Elizabeth S Wadsworth; Torsten Ochsenreiter; Alasdair Ivens; Nicholas J Savill; Achim Schnaufer
Journal: Nucleic Acids Res Date: 2019-12-02 Impact factor: 16.971

Review 10. Dynamic RNA holo-editosomes with subcomplex variants: Insights into the control of trypanosome editing.

Authors: Jorge Cruz-Reyes; Blaine H M Mooers; Pawan K Doharey; Joshua Meehan; Shelly Gulati
Journal: Wiley Interdiscip Rev RNA Date: 2018-08-12 Impact factor: 9.957