Literature DB >> 25569156

The majority of transcripts in the squid nervous system are extensively recoded by A-to-I RNA editing.

Shahar Alon¹, Sandra C Garrett², Erez Y Levanon³, Sara Olson², Brenton R Graveley², Joshua J C Rosenthal⁴, Eli Eisenberg⁵.

Abstract

RNA editing by adenosine deamination alters genetic information from the genomic blueprint. When it recodes mRNAs, it gives organisms the option to express diverse, functionally distinct, protein isoforms. All eumetazoans, from cnidarians to humans, express RNA editing enzymes. However, transcriptome-wide screens have only uncovered about 25 transcripts harboring conserved recoding RNA editing sites in mammals and several hundred recoding sites in Drosophila. These studies on few established models have led to the general assumption that recoding by RNA editing is extremely rare. Here we employ a novel bioinformatic approach with extensive validation to show that the squid Doryteuthis pealeii recodes proteins by RNA editing to an unprecedented extent. We identify 57,108 recoding sites in the nervous system, affecting the majority of the proteins studied. Recoding is tissue-dependent, and enriched in genes with neuronal and cytoskeletal functions, suggesting it plays an important role in brain physiology.

Entities: Chemical

Keywords: Doryteuthis pealeii; RNA editing; evolutionary biology; genomics; neuroscience; recoding

Mesh：

Substances：

Year: 2015 PMID： 25569156 PMCID： PMC4384741 DOI： 10.7554/eLife.05198

Source DB: PubMed Journal: Elife ISSN： 2050-084X Impact factor: 8.140

Introduction

The central dogma of biology maintains that genetic information passes faithfully from DNA to RNA to proteins; however, with the help of tools such as alternative splicing, organisms use RNA as a canvas to modify and enrich this flow of information. RNA editing by deamination of adenosine to inosine (A-to-I) is another process used to alter genetic information (Nishikura, 2010). Unlike alternative splicing, which shuffles relatively large regions of RNA, editing targets single bases in order to fine-tune protein function. Because inosine is interpreted as guanosine by the cellular machinery, this process can recode codons (Basilio et al., 1962). A-to-I RNA editing is catalyzed by the ADAR (adenosine deaminase that acts on RNA) family of enzymes. All eumetazoans, from cnidarians to mammals, express ADARs but the extent to which they use them to recode has been explored in few representatives (Nishikura, 2010). Recent advances in DNA sequencing and the supporting computational analyses have permitted transcriptome-wide screens for RNA editing events. So far, such studies have been limited to organisms with a sequenced genome (Ramaswami et al., 2012, 2013). In general, these screens have looked for variation in RNA at positions that are invariant in the genome. In humans, inosine is abundant in RNA (Paul and Bass, 1998; Bazak et al., 2014), but almost all of it lies within transcribed repetitive elements in untranslated regions or introns (Nishikura, 2010). A compilation of recoding sites in human transcriptomes revealed 1183 events (Xu and Zhang, 2014), but most were observed in only a single sample. Individual searches (Danecek et al., 2012; Ramaswami et al., 2013) uncovered only 115 (non-repetitive) recoding events, and 53 in mice; 34 recoding sites are conserved across mammals (Pinto et al., 2014). In Drosophila, an order of magnitude more recoding sites have been identified, residing in about 3% of all messages (St Laurent et al., 2013). Although individual editing sites are clearly essential (Brusa et al., 1995), these data suggest that RNA editing is not a broadly used mechanism for proteome diversification. However, anecdotal data suggest this assumption might not apply across the animal kingdom. For example, using traditional cloning methods, scores of recoding sites have been uncovered in a small number of squid and octopus transcripts encoding potassium channels, ADARs, and ion pumps (Garrett and Rosenthal, 2012a). As for most organisms, there are no genomes available for cephalopods. Here we apply a novel approach for editing site detection in the absence of a sequenced genome. We use it to comprehensively identify editing sites in the squid giant axon system and other areas of the nervous system. Surprisingly, almost 60% of all mRNAs studied harbor recoding events, and most at multiple sites. These data show orders of magnitude more recoding in the squid proteome than in any other species studied to date. In squid, editing is so pervasive that the central dogma should be modified to include this process. Our results open the possibility that extensive recoding is common in many organisms, rivaling alternative splicing as a means of creating functional diversity.

Results and discussion

To detect RNA editing sites in the squid nervous system, we generated millions of RNA and genomic DNA reads from an individual squid. Our method differed from previous approaches by using a de novo transcriptome as the point of reference instead of a genome (Figure 1A). The transcriptome was assembled from RNA-seq reads, and each nucleotide within it represents the consensus of many reads. If the majority of RNA reads were edited (‘strong’ editing sites), the transcriptome would differ from the genomic DNA and read ‘G’ where gDNA reads would show ‘A’ (the sequencing process identifies inosines as guanosines). We detected such sites by aligning DNA-seq reads to the transcriptome (Figure 1B). At positions where editing occurred in the minority of RNA-seq reads (‘weak’ editing sites), however, the transcriptome and the genomic DNA would be identical. These sites were detected by identifying variability in RNA-seq, but not DNA-seq, reads (Figure 1B). This general approach is applicable to all organisms that lack a sequenced genome.

Figure 1.

A general approach to detect RNA editing sites in organisms that lack a sequenced genome.

DOI: http://dx.doi.org/10.7554/eLife.05198.003

A general approach to detect RNA editing sites in organisms that lack a sequenced genome.

(A) Squid RNA-seq data is used to create a de novo transcriptome followed by the detection of conserved ORFs. (B) ‘Weak’ and ‘strong’ editing sites are detected by comparing RNA and DNA reads from the same animal to the ORFs from the transcriptome. ‘Weak’ editing sites were detected by observing the minority of the RNA reads to differ from the consensus transcriptome nucleotide. ‘Strong’ editing sites, where the consensus transcriptome includes the edited nucleotide, were detected by observing all DNA reads to differ from the transcriptome nucleotide. DOI: http://dx.doi.org/10.7554/eLife.05198.003 We sequenced cDNA from the giant axon system (giant fiber lobe: GFL), the optic lobes (OL) and matching germline gDNA isolated from the same animal. cDNA was also sequenced from the vertical lobe (VL), buccal ganglion (BG) and the Stellate Ganglion (SG) from another animal (The SG and GFL are parts of the peripheral nervous system; all the rest are from the central nervous system). The GFL and OL RNA-seq reads were used to construct a transcriptome model (Grabherr et al., 2011). To focus on editing sites inside bona fide coding regions, we retained only transcript-fragments with open reading frames (ORFs) homologous to known proteins (UniProt Consortium, 2014) (Figure 1A) and used the editing detection procedures outlined in Figure 1B. Surprisingly, our pipeline identified 81,930 weak sites, and 5644 strong sites, due to A-to-G transitions (Figure 2A). Only 12,403 weak sites and 219 strong sites were identified for the other 11 possible types of modifications. These numbers suggest false-positive rates of 15% and 4% respectively, mainly due to transcriptome assembly problems, SNPs, somatic mutations and systematic mis-alignments. Note that these false-positive rates are considerably lower than those for similar searches for editing within human coding sequences, where a genome reference was employed (Ramaswami et al., 2013).

Figure 2.

High number of RNA editing sites in squid translates into an extraordinary number of recoding events.

(A) The number of nucleotide modifications observed in the squid nervous system for each possible substitution type (in blue, 87% of all detected modifications were A-to-G). A similar analysis of human and Rhesus macaque sequencing data (green and brown, respectively) shows low levels, and no enrichment, of A-to-I editing in coding regions, as reported previously. In the inset, the distribution of nucleotide modifications observed in squid mitochondria-encoded genes, used here as a negative control. The ADAR enzymes have no reported activity in the mitochondria and, accordingly, no A-to-G overrepresentation is observed. Also see Figure 2—figure supplement 1. (B) Scope of recoding due to RNA editing in squid, both in the total number of recoding events and the total number of genes affected, is orders of magnitude higher than human, mouse, and fly (numbers for other organisms are based on recent publications using RNA-seq datasets comparable to the one used here [Danecek et al., 2012; Ramaswami et al., 2013; St Laurent et al., 2013]).

DOI: http://dx.doi.org/10.7554/eLife.05198.004

(A) The number of modifications observed for each possible modification type, considering only modifications that appear in clusters. About half of the A-to-G modifications appear in clusters of at least three consecutive same-type mismatches, in accordance with the expected properties of A-to-I editing sites, found in other organisms (Morse et al., 2002; Levanon et al., 2004). (B) The number of reads with 3, 4, and 5 consecutive identical mismatches for each possible modification type. Most of these reads contained A-to-G modifications. (C) The sequence surrounding of the observed A-to-G modifications, compared with that surrounding random adenosines in our model transcriptome. The sequence surrounding the ‘weak’ sites and the ‘strong’ sites (Figure 1B) are similar to each other and to what is known for other species (Kleinberger and Eisenberg, 2010).

DOI: http://dx.doi.org/10.7554/eLife.05198.005

(A) For the A-to-G sites, different tissues show varying levels of editing, globally, as well as site-specific tissue-dependent regulation. For example, higher modification levels (red) are observed in the GFL tissue whereas low levels (green) are observed in the VL tissue. Yet, some sites (top rows) are edited more strongly in VL. (B) Hierarchical clustering of the non A-to-G modification levels in the five different neuronal tissues. Consistently with the modifications being due to genomic polymorphisms, data cluster according to the animal it was taken from: modification levels in the GFL and OL tissues, which were taken from one individual animal, form one cluster, as do the VL, BG and SG tissues, taken from another individual animal. Modification levels are, by and large, uniform across tissues coming from the same individual animal. Note that in both panels only sites with significantly variable modification levels are presented (binomial analysis was performed with Bonferroni-corrected p-value of 0.05 as a cutoff), each row represents one modification site. Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL), Buccal ganglia (BG), and Stellate ganglion (SG).

DOI: http://dx.doi.org/10.7554/eLife.05198.006

(A) An example of editing sites verified using Sanger sequencing in the squid protein piccolo. Arrowheads mark the locations of the editing sites. (B) Editing levels measured by Sanger sequencing for the 40 sites correlate with RNA sequencing results.

DOI: http://dx.doi.org/10.7554/eLife.05198.007

(A) The distribution of the quality scores for all the sites used (all the positions inside all the analyzed reads), A-to-G modifications, and non A-to-G modifications. No difference is observed between these three groups. Note that sites with Q < 30 were excluded. (B) The number of mismatches detected as a function of the position inside the read. Non A-to-G mismatches tend to occur at reads' ends, suggesting alignment artifacts (which tend to affect reads' ends) are responsible to some of these mismatches (Ramaswami et al., 2012). A-to-G mismatches do not show such tendency. (C) The distribution of modification levels for A-to-G and non A-to-G sites, for the GFL and OL tissues. The increased number of non A-to-G sites with ∼50% modification level hint at some genomic polymorphisms (SNPs), that were not represented in our DNA reads due to the limited coverage, are included among the non A-to-G mismatches. Consistently, 51% of the sites with non A-to-G modification levels between 40–60% recur in both tissues (coming from the same individual animal), compared to only 22% of the A-to-G modifications in the same range. Similarly, 50% of non A-to-G modification levels higher than 90% recur in both tissues (coming from the same individual animal), compared to only 21% for A-to-G modifications in the same range. These two ranges are the only ones in which such difference is observed. Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), quality score (Q).

DOI: http://dx.doi.org/10.7554/eLife.05198.008

DOI: http://dx.doi.org/10.7554/eLife.05198.009

High number of RNA editing sites in squid translates into an extraordinary number of recoding events.

Figure 2—figure supplement 1.

A-to-G modifications appear in clusters of consecutive identical mismatches and show distinctive 5′ and 3′ neighbor preferences.

DOI: http://dx.doi.org/10.7554/eLife.05198.005

DOI: http://dx.doi.org/10.7554/eLife.05198.004

A-to-G modifications appear in clusters of consecutive identical mismatches and show distinctive 5′ and 3′ neighbor preferences.

Hierarchical clustering reveals tissue selectivity in the modification levels of the A-to-G sites, but not in the non A-to-G sites.

Validation of editing using Sanger sequencing.

Quality controls for the A-to-G modifications and the non A-to-G modifications.

Number of modification sites detected as a function of the amount of DNA and RNA reads.

(RNA-seq from the GFL and OL tissues only, originating from the animal whose DNA was sequenced). (A) The number of A-to-G sites detected increases with the number of RNA reads, with no sign for saturation. Thus, we expect the number of editing sites to be much larger than the one reported here. (B) The number of A-to-G sites detected in each gene correlates with the gene's RNA coverage, demonstrating that with much larger RNA-seq data, the number of detected editing sites could be as high as ∼200,000 (expected number of 17 sites per protein, on average, for each of the ∼12K ORFs in our model transcriptome). (C) The number of modification sites detected as a function of the number of DNA reads. Detection of modification sites is based on mismatches between cDNA reads and the consensus. However, one of the main sources for such mismatches, which masks the signal due to RNA editing, is heterozygosity of the genome. The more DNA reads available, the better one can identify and exclude genomically heterozygous sites (SNPs) and improve signal-to-noise ratio. (i) ‘weak’ sites detection (ii) ‘strong’ sites detection - here exclusion of SNPs is part of the detection scheme itself (see Methods) and thus the number of detected sites (and not only the signal-to-noise ratio) increases with gDNA coverage (iii) ‘Weak’ and ‘strong’ sites, combined. Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Standard Error (SE). DOI: http://dx.doi.org/10.7554/eLife.05198.009 Although the number of A-to-G discrepancies was unexpectedly large, subsequent analyses support the idea that they are caused by RNA editing rather than other sources of error. First, we applied our pipeline to similarly sized data sets from a human blood sample and from the rhesus macaque brain, each containing matching RNA and DNA sequence reads. As expected for mammals, the quantity of AG mismatches in coding regions were similar to those from non-AG mismatches, and both were quantitatively indistinguishable from the noise determined from the squid data (Figure 2A and Supplementary file 1A,B). These controls demonstrate that the enormous number of AG mismatches in the squid data is not an artefact of our analysis pipeline. Other features point to the biological origin of our AG mismatches. Similar to A-to-I editing sites in other organisms (Morse et al., 2002; Levanon et al., 2004; Kleinberger and Eisenberg, 2010), those identified here tend to cluster and show distinctive 5′ and 3′ neighbor preferences (Figure 2—figure supplement 1). In addition, hierarchical clustering of results from five tissues reveals that A-to-G modifications, but not other types, exhibit clear tissue-specificity, suggesting they do not result from genomic polymorphisms and mapping artifacts (Figure 2—figure supplement 2). No A-to-G overrepresentation is observed in mitochondria-encoded genes (Figure 2A), in agreement with the absence of ADARs, and by extension A-to-I editing, in the mitochondria. Finally, direct Sanger sequencing from a second individual confirmed editing at 40/40 A-to-G sites, and deep-sequencing validated 120/143 A-to-G sites but none of the 12 non A-to-G sites tested (Figure 2—figure supplement 3, Supplementary file 1C–G). Taken together, the overrepresentation of A-to-G modifications over all other types, the motifs surrounding the A-to-G sites, the tissue-specific modification levels, and the validation experiments, provide evidence that the majority of the A-to-G modifications are true editing events, while most non A-to-G modifications are likely technical artifacts or genomic variations (Zaranek et al., 2010; Ramaswami et al., 2012) (Figure 2—figure supplement 4).

Figure 2—figure supplement 2.

Hierarchical clustering reveals tissue selectivity in the modification levels of the A-to-G sites, but not in the non A-to-G sites.

DOI: http://dx.doi.org/10.7554/eLife.05198.006

Figure 2—figure supplement 3.

Validation of editing using Sanger sequencing.

DOI: http://dx.doi.org/10.7554/eLife.05198.007

Figure 2—figure supplement 4.

Quality controls for the A-to-G modifications and the non A-to-G modifications.

DOI: http://dx.doi.org/10.7554/eLife.05198.008

Unlike with humans, the large number of A-to-I editing events translates into a large number of recoding events: Overall, 57,108 recoding events were detected in 6991/12,039 ORFs. These numbers are orders of magnitude higher than any other species studied (Danecek et al., 2012; Ramaswami et al., 2013; St Laurent et al., 2013; Pinto et al., 2014) (Figure 2B). Moreover, a large fraction of the proteins are recoded at multiple sites (Figure 3A): about 1/3 harbor ≥3 sites and 10% harbor ≥10 sites. Even when focusing only on recoding sites with editing levels >10%, about 10% of the squid proteins harbor ≥5 sites (Figure 3A). On the extreme end of the spectrum, the ORFs encoding α Spectrin and Piccolo have 247 and 182 recoding sites, respectively (Figure 3B and Figure 3—figure supplement 1). It should be noted that only annotated ORFs were examined in our pipeline, and the number of editing sites did not saturate with respect to the number of sequence reads (Figure 2—figure supplement 5). Moreover, incompleteness of the de novo transcriptome, as well as incorrect assembly of paralogs and splice variants, may cause our pipeline to miss many additional sites (Supplementary file 1B). Therefore there are probably many more recoding sites in the squid transcriptome.

Figure 3.

Editing often recodes multiple amino acids in the same protein.

(A) The fraction of the squid genes that harbor multiple recoding events. About a third of the squid proteins harbor three or more recoding sites and more than 10% harbor 10 or more recoding sites. (B) Homology-modelling of the α Spectrin protein in which 10% of the amino acids (247/2412) are recoded by editing. Amino acids 1602 to 1918 of the squid α Spectrin protein are included in the 3-D model. Recoding sites are highlighted in green. Recoding sites with tissue-dependent levels are highlighted in red and the corresponding editing levels are indicated in the table. Also see Figure 3—figure supplement 1.

DOI: http://dx.doi.org/10.7554/eLife.05198.010

DOI: http://dx.doi.org/10.7554/eLife.05198.011

Figure 3—figure supplement 1.

Homology-modelling of the squid Piccolo protein in which 9% of the amino acids (182/2098) are recoded by editing.

DOI: http://dx.doi.org/10.7554/eLife.05198.011

Figure 2—figure supplement 5.

Number of modification sites detected as a function of the amount of DNA and RNA reads.

DOI: http://dx.doi.org/10.7554/eLife.05198.009

Editing often recodes multiple amino acids in the same protein.

Homology-modelling of the squid Piccolo protein in which 9% of the amino acids (182/2098) are recoded by editing.

Amino acids 1830 to 1966 of the squid Piccolo protein are included in the 3-D model. Recoding sites with tissue-dependent and -independent editing levels are highlighted in red and green, respectively. Aspartate residues involved in Ca2+ binding are highlighted in yellow. Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL), Buccal ganglia (BG), and Stellate ganglion (SG). DOI: http://dx.doi.org/10.7554/eLife.05198.011 Consistent with other organisms (Stapleton et al., 2006), recoding events are enriched in genes with neuronal and cytoskeletal functions (Figure 4—figure supplement 1A and Supplementary file 1H). To gain insight on the effected pathways, squid ORFs were mapped to all human KEGG pathways (Kanehisa and Goto, 2000). Editing has a global effect on most pathways (Supplementary file 1I), and those related to the nervous system are even more affected. For example, of the 27 proteins in the ‘Synaptic vesicle cycle’ pathway, 22 are edited and 14 heavily so (Figure 4A). Similarly, of the 39 proteins in the ‘Axon guidance’ pathway, 33 are edited and 19 heavily so. Other notable pathways are ‘Regulation of actin cytoskeleton’ and ‘Circadian rhythm’ (Figure 4—figure supplement 1B). By contrast, proteins in the pathways ‘Ribosome’ and ‘RNA polymerase’ are edited less than average (Supplementary file 1I), demonstrating that some pathways may be protected from editing. Consistently, editing levels observed in non-nervous system tissues are considerably lower (Alon et al., 2015).

Figure 4—figure supplement 1.

Recoding events are enriched in genes with neuronal and cytoskeletal functions and globally affect molecular pathways.

(A) The top-scoring Gene Ontology (GO) terms (rated by false discovery rate, FDR), enriched in a list of squid ORFs ranked by the cumulative recoding level, that is the editing level summed over all recoding sites (Eden et al., 2009). (B) All of the identifiable squid proteins present in the KEGG pathway ‘Circadian rhythm’ are edited, and many are heavily edited. We define ‘heavily edited proteins’ as those for which the cumulative recoding level exceeds unity (i.e., each copy of the protein is expected to have at least one modified amino acid, on average). These are marked red, other edited proteins in magenta, and proteins not identifiable in the squid transcriptome in green. This figure was created using the KEGG (Kanehisa and Goto, 2000) pathway database website (http://www.genome.jp/kegg/pathway.html). Editing levels were calculated using data from the Giant fiber lobe (GFL) and Optic lobe (OL) tissues combined.

DOI: http://dx.doi.org/10.7554/eLife.05198.013

Figure 4.

Recoding due to RNA editing affects complete molecular pathways and is likely to be more advantageous in sites with high editing levels.

(A) All the squid proteins present in the KEGG ‘Synaptic vesicle cycle’ pathway are edited, and most are heavily edited. We define ‘heavily edited proteins’ as those for which the cumulative recoding level, that is the editing level summed over all recoding sites, exceeds unity. These are marked red, other edited proteins yellow, and proteins not identifiable in the squid transcriptome are shown in green. Also see Figure 4—figure supplement 1. (B) The fraction of nonsynonymous codon changes as a function of the editing levels, using data from the GFL and OL tissues combined. The higher the editing level, the higher the fraction of nonsynonymous codon changes. The fraction expected by chance is shown in red. A similar relationship is also true for every tissue separately (Figure 4—figure supplement 2A). Asterisks mark p-value <0.001 estimated using 1000 bootstrap runs.

DOI: http://dx.doi.org/10.7554/eLife.05198.012

DOI: http://dx.doi.org/10.7554/eLife.05198.013

(A) For high editing levels, the fraction of nonsynonymous codon changes is significantly different from the fraction expected by chance for all the neuronal tissues examined. The fraction expected by chance (see ‘Materials and methods’) is shown in red. (B), The observed distribution of recoding types for sites with editing levels lower than 10%, between 10–50%, and higher than 50%. Highly edited sites favor the creation of glycine and arginine, mainly at the expense of lysine, in a statistically significant manner. Expected values and error bars were calculated by using the mean values and standard deviation of 100 bootstrap runs, respectively, generated by randomly modifying adenosine in a way that preserves the editing sequence preference and the number of events. (C) and (D): Amino acid targeted by the editing and created by the editing in both squid and Drosophila (St Laurent et al., 2013). The most frequent target for removal is lysine, and glycine and arginine are frequently created due to editing. Editing levels were calculated using data from the GFL and OL tissues combined. Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL), Buccal ganglia (BG), and Stellate ganglion (SG).

DOI: http://dx.doi.org/10.7554/eLife.05198.014

DOI: http://dx.doi.org/10.7554/eLife.05198.015

Recoding due to RNA editing affects complete molecular pathways and is likely to be more advantageous in sites with high editing levels.

Figure 4—figure supplement 2.

The fraction of nonsynonymous codon changes as a function of editing levels and the amino acid modifications due to editing.

DOI: http://dx.doi.org/10.7554/eLife.05198.014

DOI: http://dx.doi.org/10.7554/eLife.05198.012

Recoding events are enriched in genes with neuronal and cytoskeletal functions and globally affect molecular pathways.

The fraction of nonsynonymous codon changes as a function of editing levels and the amino acid modifications due to editing.

Editing tends to avoid potentially deleterious recoding events.

Each squid ORF was aligned against the conserved domains in the Conserved Domain Database (CDD) (Marchler-Bauer et al., 2013), and the score for substituting each amino acid by all other types of amino acids was calculated (Boratyn et al., 2012). The substitution score is a positive or negative integer, reflecting amino acid substitution which, compared to chance, occur frequently or infrequently in the alignment of the conserved domains, respectively. (A) The average editing levels, using data from the GFL and OL tissues combined, as a function of the amino acid substitution score. The average editing levels for negative substitution scores is significantly lower compared to what is expected by chance. (B) The distribution of the recoding sites as a function of the amino acid substitution score. Recoding sites tend to avoid large negative substitution scores compared with random changes. (C) The average substitution score as a function of the editing levels, using data from the GFL and OL tissues combined. The higher the editing levels, the higher the average substitution score, indicating that highly edited sites are more likely to recode to amino acids that occur frequently in other species. Expected values and error bars were calculated by using the mean values and standard deviation of 10,000 bootstrap runs, respectively. For A and C, the editing levels in all the sites with the same recoding type were randomly shuffled. For B, adenosines were randomly modified in a way that preserves the sequence preference and the total number of editing events. One asterisk mark p-value <0.05, two mark p-value<1e-4. Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL). DOI: http://dx.doi.org/10.7554/eLife.05198.015 Recently, it was suggested that RNA editing is generally not advantageous in humans (Xu and Zhang, 2014), as nonsynonymous events are less frequent than expected by chance (Xu and Zhang, 2014). Strikingly, for sites with high editing levels in squid, the opposite is true (Figure 4B and Figure 4—figure supplement 2A). Recoding events favor creation of glycine and arginine, mainly at the expense of lysine (Figure 4—figure supplement 2B–D). Moreover, highly edited sites within conserved domains tend to recode to amino acids that occur frequently in other species at the same position (Figure 4—figure supplement 3), suggesting selection towards functional substitutions and against deleterious ones.

Figure 4—figure supplement 3.

Editing tends to avoid potentially deleterious recoding events.

DOI: http://dx.doi.org/10.7554/eLife.05198.015

The squid giant axon has been one of the most important models for neurophysiology. Studies using this preparation serve as a foundation for our current understanding of excitability, ion homeostasis, and axonal transport. Accordingly, we examined the extent to which RNA editing might affect these processes (Supplementary file 1J). In total, 87 GFL ORFs that encode voltage and neurotransmitter gated ion channels, ion transporters, synaptic release machinery and molecular motors were identified in our transcriptome. In agreement with the overall editing frequency in squid, 70% harbored editing sites. Unexpectedly, however, 54% were heavily edited, many more than the 24% expected. Thus even in a background of hyper RNA editing, squid, like other organisms, preferentially edit nervous system targets. These data suggest that editing in squid has fundamentally different underpinnings and consequences. Have squid ADARs evolved novel structures that account for the high-level editing? A past study showed that a squid ADAR2 ortholog can be expressed as two isoforms due to alternative splicing (Palavicini et al., 2009): one, having two double-stranded RNA (dsRNA) binding motifs, resembles vertebrate ADAR2s. A second, however, contains an ‘extra’ dsRNA binding motif at its N-terminus. This non-canonical isoform edits RNA more efficiently, edits more sites, and binds dsRNA with a far higher affinity (Palavicini et al., 2009, 2012). Further Squid ADAR2 messages themselves contain many editing sites, leading to many subtly different isoforms. An ADAR1 isoform is also present in our transcriptome and is the focus of an ongoing study. An equally intriguing question is why squid edit to this extent? The process clearly creates tremendous protein diversity, and this may in part explain the behavioral sophistication of these complex invertebrates. A recent study showed that editing can be used for temperature adaptation in octopus (Garrett and Rosenthal, 2012b) and this makes sense based on the codon changes that it catalyzes (Garrett and Rosenthal, 2012a) (Figure 4—figure supplement 2C–D). In Drosophila, editing can respond to acute temperature changes (Savva et al., 2012). The large number of sites in squid suggests that editing is well positioned to respond to environmental variation. Most model organisms studied so far are mammals which are homeotherms. Future studies of more diverse species are needed to reveal the extent to which cold-blooded organisms might utilize extensive editing to respond to temperature changes and other environmental variables.

Materials and methods

Specimen collection and dissection

Specimens of the squid Doryteuthis pealeii were collected by trawl in the Vineyard Sound by the animal collection department of the Marine Biological Laboratory in Woods Hole, Massachusetts during the month of July. The giant fiber lobe (GFL) tissue of the Stellate ganglion, the optic lobe (OL) tissue and a portion of the sperm sack were manually dissected from a single adult male immersed in chilled, filtered seawater. The Buccal ganglia (BG), Stellate ganglion with the giant fiber lobe removed (SG) and Vertical lobe (VL) tissues were dissected from a second adult male individual. Tissues were also dissected from non-neuronal regions: the branchial heart, the Gill, the ventral epithelial layer on the pen, the marginal epithelial layer on the pen, the iridophore layer of the skin, and the chromatophore layer of the skin. Each of these six tissues originated from a different animal. RNA from all tissues was extracted with the RNAqueous kit (Life Technologies, Carlsbad, CA), and genomic DNA was extracted from the sperm sack using Genomic Tip Columns (Qiagen, Venlo, Limburg, The Netherlands).

Library preparation for sequencing

The genomic DNA sequencing library was prepared using the TruSeq DNA Sample Prep kit, as described by the manufacturer (Illumina, San Diego, CA), and sequenced using three lanes of Illumina HiSeq 2000 instrument. The RNA-Seq libraries for all the samples were prepared using the TruSeq Stranded mRNA Sample Prep Kit, as described by the manufacturer (Illumina), and were sequenced using Illumina HiSeq 2000 instrument. The GFL and OL libraries were sequenced together on a single lane, the same for the VL and BG libraries (with one unrelated library) and the SG library was sequenced on a single lane (with one unrelated library).

Detection of editing sites in the squid nervous system

Illumina sequencing was utilized to generate ∼87 million paired-end, 100 nt reads, using RNA from GFL tissue, the same number of reads using RNA from OL tissue, and ∼440 million paired-end, 100 nt reads, using germline DNA. The Trinity transcript assembly tool (Grabherr et al., 2011) was used with the default parameters (except for ‘min_kmer_cov’ which was set to 2 instead of 1) to construct the genes sequences from the GFL and the OL sequencing data combined; giving 99,226 putative gene-fragments (termed ‘components’). Most of the generated components were short (median of 379 bases). As Trinity attempts to identify the different transcripts (isoforms) of the components, 14,643 components were represented by more than one sequence (putative transcript-fragment). For these components, the longest sequence was selected and further used. The average length of the representing sequences for all components was 589 bases, bringing the total size of the recovered squid transcriptome to about 58 million bases. DNA reads and RNA reads were separately aligned against the RNA components using Bowtie2 with local alignment configuration and default parameters (Langmead and Salzberg, 2012). Only uniquely aligned reads were used (taking only reads with the maximal ‘mapping quality’). As most of the components were short, we didn't demand that both mate pairs be aligned to the same component. Instead, each read was separately analyzed. Overall, 77% (∼268 million out of ∼350 million) of the GFL and OL RNA reads combined were uniquely aligned to the components. However, as expected, only 3.5% of the DNA reads (∼30 million out of ∼880 million) were uniquely aligned to the components. To focus on editing sites inside coding regions, and avoid repetitive elements that are prone to assembly and alignment errors, we retained only those components that were found to be significant similar (Blastx E-value<1e-6) to the Swiss-Prot proteins dataset (UniProt Consortium, 2014). In these components, the alignment to the homolog Swiss-Prot protein covered most of the squid sequence (63% on average). The detected ORFs were extended in both directions until a stop codon or the end of a component was reached. Overall, 12,039 ORFs, with average length of 1370 bases, were detected. These ORFs represent 8276 Swiss-Prot proteins with distinct names (two or more different ORFs may be two squid paralogs of the same Swiss-Prot protein, or fragments of the same squid gene aligned to different regions of the Swiss-Prot protein). The total size of the detected coding sequences in the squid was ∼16 million bases, 28% of the total transcriptome length recovered using Trinity (the constructed squid coding sequences are given in a text file in a fasta format, Alon et al. (2015), available via Dryad digital repository). The coverage of these ORFs was high; in the GFL and OL data combined, ∼199 million RNA reads were aligned to these Swiss-Prot ORFs, with an average RNA coverage of 1206×. The average DNA coverage was 106× as ∼21 million DNA reads, with average length of 83 bases were aligned to the coding sequences. The above alignment data and the ORFs information were used to find the locations of all DNA-RNA mismatches inside the coding regions. In the following, bases called with quality score Q < 30 were discarded. Note, however, that Trinity consensus sequence does take into account these bases, as well as reads that might have not been uniquely aligned to the transcriptome. It is often customary to filter out reads' ends when analyzing RNA-DNA mismatches. A main reason for that is the common mismatches at reads' ends due to alignment artifacts when a splicing junction occurs near the ends. In our case, as alignment is done to the transcriptome, we did not observe any increase in AG mismatch rate near reads' ends (Figure 2—figure supplement 4), and thus no such filter was used. Two procedures were used to detect editing sites: (A) ‘weak’ editing sites procedure: a binomial test was applied to find the significant modifications between RNA reads and the Swiss-Prot ORFs. The binomial statistics uses the number of successes (the number of reads with a mismatch of a given type in a given position), the number of trials (the total number of RNA reads aligned to the given position) and the error probability. The probability of having a sequencing error (the error probability) was estimated using the sequence quality score. We counted only mismatches with >=30 quality score, and therefore the expected error probability was set to 0.1%. The binomial test was applied to every position inside the Swiss-Prot ORFs. The p-value for each location was corrected for multiple testing using a Benjamini-Hochberg false-discovery rate of 10%. Furthermore, in order to exclude RNA variability due to genomic polymorphisms, we filtered out all modification sites in which any of the DNA reads aligned to the site does not agree with the transcriptome. This procedure assumes the RNA consensus in the site is identical to the gDNA reads, and thus is not suited to detect sites in which the editing appear in most of the RNA reads and therefore also in the Trinity-generated ORFs (‘strong’ editing sites). (B) ‘strong’ editing sites procedure: the locations in which all DNA reads showed a different base than the ORF. The probability of such sites to be not a result of editing but rather a single nucleotide polymorphism (SNP) was estimated by (1/2)(#DNA reads) (no allele-specific expression) multiplied by the prior probability of a SNP which was taken to be 0.001. Here too, the p-value for each location was corrected for multiple testing using a Benjamini-Hochberg false-discovery rate of 10%. Overall, 81,930 weak and 5649 strong A-to-G modification sites were detected in 7776 ORFs, and only 12,403 weak and 254 strong non A-to-G modifications sites (the modification sites and their number of supporting reads in all the tissues studied are tabulated in Alon et al. (2015) available via Dryad digital repository). Interestingly, 2905 of the A-to-G modification sites reside in 268 out of the 475 ORFs with only non-metazoans homologs. The number of weak A-to-G sites detected (but not strong ones) is likely to increase with RNA coverage, as we demonstrated by sampling parts of the sequencing data and re-calculating the number of A-to-G sites (Figure 2—figure supplement 5A). Another way to look at the dependence between the RNA coverage and the detected number of weak A-to-G sites is by recording the number of A-to-G sites detected in each ORF as a function of the ORF RNA coverage. The sorted RNA coverage was divided into ten equal bins and the mean number of A-to-G sites in each bin was calculated. As expected, higher RNA coverage is correlated with high number of A-to-G sites (Figure 2—figure supplement 5B). As with the RNA reads, the dependence of the detection procedures on the DNA reads was examined by sampling parts of the DNA sequencing data. Increasing the DNA coverage increased the precision of the ‘weak’ site detection procedure (as more SNPs are removed from the observed modifications) and strongly increased the number of detected ‘strong’ sites (Figure 2—figure supplement 5C). We did not apply additional read-number filters, as these seem to only marginally increase accuracy while reducing the number of detected sites. However, we did exclude five strong sites for which there were no uniquely-aligned, high-quality, supporting RNA reads. This brings the number of strong A-to-G and non A-to-G modifications to 5644 and 219, respectively. The full list of weak and strong sites (Alon et al. (2015) available via Dryad data repository) provides the number of DNA and RNA reads per site. We examined clusters of mismatches of the same type (several consecutive identical mismatches), revealing a high number of ORFs with A-to-G mismatch clusters (Figure 2—figure supplement 1A). For example, examining only ORFs with three (and above) consecutive identical mismatches, 45,199 A-to-G sites in 4265 ORFs were detected, and only 470 non A-to-G sites. Interestingly, the clusters of A-to-G sites also appear at the level of the individual reads; for example, examining only reads with four and above consecutive identical mismatches, 85% of the reads contained A-to-G modifications (Figure 2—figure supplement 1B). This data implies that single RNA molecules contain A-to-G clusters. We note that similar results were obtained using an algorithm which analyzes only reads with several consecutive mismatches, including reads that cannot be mapped by standard alignment tools (Porath et al., 2014). Using this algorithm 23,737 A-to-G modification sites were found in the squid coding regions, compared to only 2933 non A-to-G modifications. Importantly, only 728 A-to-G modifications were found in the coding regions of human using this algorithm (Porath et al., 2014), even though a much larger RNA-seq dataset was used (∼5 billion reads of Illumina Human BodyMap 2.0), thus supporting the dramatic difference in editing between squid and other organisms. The sequence surrounding the A-to-G modifications sites is similar for ‘weak’ sites and ‘strong’ sites, and both differ from what is expected by chance (Figure 2—figure supplement 1C). As expected for bona fide A-to-I editing sites (Kleinberger and Eisenberg, 2010), G is significantly underrepresented in the nucleotide before the editing site (6922 out of 87,574 sites detected) and over represented in the nucleotide that follows the editing site (38,564 out of 87,574 sites). A and T are over represented in the nucleotide before the editing site (42,475 and 25,061, respectively, out of 87,574 sites).

Characterization of the modifications sites detected

We characterized the differences between the A-to-G modifications and all the other types of modifications (non A-to-G). One possible source of apparent modifications is sequencing errors. We thus compared the quality score between A-to-G and non A-to-G modification sites. However, no significant difference in the distribution of the quality scores was observed (Figure 2—figure supplement 4A). Another technical explanation for the non A-to-G modifications can be read-end artifacts: the ends of the reads are more likely to be misaligned due to splicing, or to contain errors generated in the RNA sequencing protocol (Ramaswami et al., 2012). Indeed, non A-to-G modifications (unlike A-to-G ones) tend to be located in the reads-end (Figure 2—figure supplement 4B), suggesting a larger fraction of these sites is likely to be a result of technical artifacts. In addition, we studied the modification level distribution (the fraction of cDNA reads exhibiting the modification in a given position) for both types of modifications (Figure 2—figure supplement 4C). A higher fraction of non A-to-G sites show ∼50% modification levels (compared to the A-to-G sites), indicating a higher fraction of missed genomic polymorphisms. Consistently, 50% of the sites with non A-to-G modification levels between 40–60% or between 90–100% recur in both the GFL and the OL tissue (coming from the same individual animal), compared to only 21% of the A-to-G modifications in the same ranges. To find statistically significant differences in the modification levels between the GFL and the OL tissues, a binomial analysis was performed with Bonferroni-corrected p-value of 0.05 as a cutoff. Overall, 19% (16,425 out of 87,574) of the detected A-to-G modifications differ significantly between the GFL and OL tissues. Most of these sites (84%, 13,731 out of 16,425) have higher modification levels in the GFL tissue. In contrast, only 6% (783 out of 12,614) of the non A-to-G modifications are significantly different for the same two tissues, with no clear tissue preference: 45% and 55% of these sites have higher modification levels in the GFL and the OL tissue, respectively. As the same technical artifacts and genomic polymorphisms are expected to replicate in both tissues, these data increase our confidence that most of the A-to-G sites are bona fide editing sites. To further characterize the tissue-dependence of modifications levels, Illumina sequencing was again utilized to generate ∼54, ∼62 and ∼42 million paired-end, 100 nt reads, using RNA from Buccal ganglia (BG), Stellate ganglion with the giant fiber lobe removed (SG), and Vertical lobe (VL), respectively, collected from a different animal. The same alignment procedure was applied to quantify the modification level at the previously described sites for each of the additional samples. Statistically significant differences in the modification levels between all the five neuronal tissues were detected using a binomial analysis with Bonferroni-corrected p-value of 0.05 as a cutoff. The sites with significantly variable A-to-G modification levels were subjected for hierarchical clustering, revealing clear tissue selectivity with higher modification levels in the GFL tissue and low levels in the VL tissue (Figure 2—figure supplement 2A). The same analysis for the non A-to-G modification levels demonstrates consistency between the individual animals as the modification levels in the GFL and OL tissues form one cluster, and the VL, BG and SG tissues form a second cluster, again suggesting that many of the non A-to-G modifications are due to genomic polymorphisms (Figure 2—figure supplement 2B). To characterize the modification levels in non-neuronal tissues, Illumina sequencing was again utilized to generate ∼23, ∼23, ∼19, ∼26, ∼19 and ∼14 million paired-end, 150 nt reads, using RNA from the branchial heart, the Gill, the ventral epithelial layer on the pen, the marginal epithelial layer on the pen, the iridophore layer of the skin, and the chromatophore layer of the skin. The same alignment procedure was applied to quantify the modification level at the previously described sites for each of the additional samples, revealing considerably lower editing levels in the non-neuronal tissues (Alon et al. (2015), available via Dryad data repository).

A-to-I editing verified by Sanger sequencing and deep sequencing

Direct validation of editing was performed on a subset of the detected A-to-G sites using Sanger sequencing and deep sequencing. To reduce the chance for RNA contamination in the DNA or vice versa, primers were designed to differentially amplify the gDNA, by residing in introns, or cDNA, by spanning exon–exon boundaries. To identify intronic sequence and exon–exon junctions the following steps were performed: (a) we recorded all the cases in which the beginning or the end of the DNA read was trimmed during the local alignment against the Trinity sequences. (b) If the flanking region could be aligned against the Trinity sequences, even if aligned separately without the rest of the read, it was discarded. (c) If at least three DNA reads showed the same flanking sequence starting from the same position inside the ORFs, it was considered to be an intron fragment. This procedure revealed the positions of part of the exon–exon junctions and part of the intron sequences that correspond to the junctions. Overall, this procedure resulted in about 2100 regions, containing ∼5% of all the detected editing sites, in which the gDNA and the cDNA could potentially be differentially amplified. For the Sanger sequencing, primers were designed to differentially amplify the gDNA and cDNA of 19 ORFs (Supplementary file 1C). To allow better detection of editing, all the sites tested using Sanger sequencing were chosen to have >=20% modification levels (in our original GFL analysis). For the Sanger validation experiment the GFL tissue from a single animal was used (different from the animals used for the HiSeq sequencing experiments). All the sites tested (40 out of 40) were validated using Sanger sequencing (Figure 2—figure supplement 3 and Supplementary file 1D). For the deep sequencing validations, three groups of targets were tested (Supplementary file 1E,F): (a) twenty ‘interesting’ genes, that is, squid components with homology to genes known to be implicated in human diseases or in other important pathways, selected from the dataset of ∼2100 regions described above. (b) 20 regions randomly-selected from the above dataset of ∼2100 regions. (c) 20 regions randomly-selected from the set of putatively edited regions for which we could not design unique gDNA primers, and thus gDNA and the cDNA could not be differentially amplified. For this group the same primers were used to amplify the gDNA and the cDNA. As with the Sanger validation experiment, the deep sequencing validations were done using GFL tissue from a single animal (different from the animals used for the HiSeq and Sanger sequencing experiments). All primer sets were designed with an overhang so that sequencing and indexing primers could be added to the amplicons in a nested PCR reaction. Samples from gDNA and cDNA were distinguished by unique sequencing indexes. After nested PCR, amplicons were pooled, purified by Ampure XP (Beckman Coulter, Danvers, MA), and sequenced on an Illumina MiSeq instrument. All the cDNA targets were amplified, but only 49 of the 60 gDNA targets amplified well enough for analysis (Supplementary file 1E,F). For the other eleven targets, the presence of undetected introns could have disrupted amplification. The DNA and RNA reads were analyzed using the same detection procedure described above (Figure 1B), with the sole exception that sequence variations below 0.1% (the expected sequencing error rate) were allowed in the DNA reads (as mandated by the much larger DNA coverage in this validation study). Altogether, 84% (120 out of 143) of the A-to-G sites examined were validated (Supplementary file 1F,G). In contrast, none of the 12 non A-to-G sites examined were validated. Moreover, 170 additional A-to-G sites (86% of all novel detected sites, a similar fraction to the HiSeq data used in the original detection, Figure 2A) were identified, more than doubling the number of A-to-G sites detected. Similar results were observed for the three groups examined. Finally, for the validated A-to-G sites, high correlation was obtained between the editing levels (the fraction of cDNA reads exhibiting the editing in a given position) measured using the HiSeq data and the MiSeq data (Pearson's r of 0.86, p-value = 1e-35) (Supplementary file 1G).

The effect of the A-to-I editing on the proteome

To demonstrate the extent of massive recoding on two examples, homology-modelling of the squid proteins α Spectrin and Piccolo was performed with SWISS-MODEL using default parameters (Biasini et al., 2014) and visualized using UCSF Chimera package http://www.cgl.ucsf.edu/chimera (Figure 3B and Figure 3—figure supplement 1). To find if the recoding events are enriched in genes with specific functions, we have calculated the cumulative recoding level, that is, the editing level summed over all recoding sites within each squid ORF. This gives a single score representing the extent of recoding in the whole protein. The squid ORFs list was ranked using the cumulative recoding level and all the Swiss-Prot annotations were converted to human Swiss-Prot annotations (when possible) for consistency. The GO analysis tool GOrilla was used to find enriched GO annotations in the ranked list (Eden et al., 2009). In order to control for possible detection bias in highly expressed genes, the same list was ranked using the gene expression level (FPKM) and was also analyzed using GOrilla. As expected, the enriched GO annotations in the control list (that is, genes ranked by expression levels) were mainly connected to the ribosome (translational elongation, structural constituent of ribosome, RNA binding and so on). In contrast, the list ranked by the cumulative recoding level gave enriched GO annotations which are mainly connected to neuronal and cytoskeleton functions (Figure 4—figure supplement 1A and Supplementary file 1H). Trying to detect enriched GO annotations at the bottom of the list (ranked by the cumulative recoding level) did not produce any significant results. The extensive recoding due to RNA editing can affect many molecular pathways. In order to appreciate the extent of this phenomenon, squid ORFs were mapped to all human KEGG pathways (Kanehisa and Goto, 2000), revealing that RNA editing has a global effect on the majority of the squid pathways (Supplementary file 1I). In this analysis, the homologs to the human proteins can be: (a) ‘heavily edited’, defined as proteins for which the cumulative recoding level, that is the editing level summed over all recoding sites, exceeds unity, (b) ‘edited’, if the protein has at least one recoding site, (c) not edited, or (d) not identifiable in the squid transcriptome. Editing levels were calculated using data from the GFL and OL tissues combined. On average, 74% and 22% of the identifiable proteins in each pathway are edited or heavily edited, respectively. Pathways related to the nervous system are even more extensively edited: 75% and 35% of the identifiable proteins in these pathways are edited and heavily edited, respectively. On the flip side, in the pathways ‘Ribosome’ and ‘RNA polymerase’ only 50% and 33% of the identifiable proteins are edited and 0% and 4% of the identifiable proteins are heavily edited, respectively (Supplementary file 1I), demonstrating that some crucial pathways are protected from editing. We have also specifically examined the extent of recoding in squid ORFs that encode voltage and neurotransmitter gated ion channels, ion transporters, synaptic release machinery and molecular motors. Overall, we examined 87 ORFs that are homologous to the following proteins: Voltage-gated potassium channel alpha subunit, Voltage-dependent sodium channel alpha subunit, Voltage-dependent calcium channels, Ionotropic glutamate receptors, Synaptotagmin, Synaptobrevin, Syntaxin, Synapsin, Snap 25, Sodium/potassium-transporting ATPase alpha subunit, Sodium/calcium exchanger (SLC8), Sodium bicarbonate exchanger (SLC4), Sodium/hydrogen exchanger (SLC9), Sodium/potassium/calcium exchanger (SLC24), Dynein, and Kinesin (Supplementary file 1J). In the GFL tissue, 47 out of the 87 proteins (54%) are heavily edited, more than twofold higher than expected by chance (p-value<1e-6). In fact, in all the examined neuronal tissues, this group of 87 proteins is heavily edited, significantly higher than expected by chance (Supplementary file 1J). An important question is whether editing in the squid ORFs tends to avoid recoding by preferring synonymous modifications or, alternatively, tends to create more recoding sites than expected by chance. In squid, the expected fraction of nonsynonymous changes is 0.65, estimated by random changes of adenosines in the ORFs, accounting for the observed local sequence preference of the editing sites (Figure 2—figure supplement 1C), as follows: (a) we examined the sequence surrounding sites detected as edited (the base preceding and the following the site), and counted the number of times each one of the 16 possible combinations of upstream and downstream nucleotide appears. These numbers were normalized by the number of times each combination appears for all the A bases in all the ORFs, to produce the observed probability of being targeted by editing given a certain combination of 5′ and 3′ nucleotides. (b) all the locations inside the ORFs were screened in a random order until an A base was encountered. (c) a random number was generated, if it was below the normalized probability corresponding to the sequence surrounding the site in question, this site was selected for further use. (d) the potential effect on the codon due to changing the A base to G was examined, in particular, whether the change is nonsynonymous or synonymous. (e) steps (b–d) were repeated until the number of A bases changed matched the observed number of editing sites. The described randomization procedure was also used to calculate the expected codon changes due to editing (Figure 4—figure supplement 2B). We found that the higher the editing levels, the higher the fraction of nonsynonymous changes, and for editing levels >20% the nonsynonymous fraction is significantly higher than the expected fraction of 0.65 (Figure 4B and Figure 4—figure supplement 2A). A recoding event which creates an amino acid rarely found in homologous proteins may indicate that the editing is deleterious. Therefore, each squid ORF was aligned against the conserved domains in the Conserved Domain Database (CDD) using DELTA-BLAST (Boratyn et al., 2012; Marchler-Bauer et al., 2013). DELTA-BLAST calculates the position-specific score matrices (PSSM) for each possible amino acid substitution. Positive scores indicate that a certain amino acid substitution occurs more frequently in the alignment against the conserved domains than expected by chance, while negative scores indicate that the substitution occurs less frequently than expected. The substitution score of each editing event was recorded as well as the editing level in the same position, using data from the GFL and OL tissues combined. As a control, the editing levels in all the sites with the same recoding type were randomly shuffled. This accounts for the fact that editing creates specific amino acid changes (Figure 4—figure supplement 2B–D), and in turn, these specific changes may be correlated with average editing levels and with substitution scores. Thus, the shuffled dataset preserves both the distribution of recoding types and the distribution of editing levels for each recoding type. This analysis revealed that the average editing levels in sites with large negative substitution scores, which may be deleterious, are significantly lower than what is expected by chance (Figure 4—figure supplement 3A). The distribution of recoding sites for each substitution score was calculated and compared with random changes that preserve the sequence preference and the total number of editing events (Figure 2—figure supplement 1C). Consistent with the above finding, there are significantly less recoding sites with large negative substitution score than expected by chance (Figure 4—figure supplement 3B). Finally, on average, the higher the editing level in a given site, the higher the average substitution score, above what is expected by chance (Figure 4—figure supplement 3C). Thus, this analysis indicates that recoding by editing tends to avoid potentially deleterious sites and that editing sites with high editing levels might be more important functionally than editing sites with low editing levels.

Applying our detection procedure for human and rhesus macaque sequencing data

We searched publicly available datasets and found two datasets of matched DNA- and RNA-seq data which are comparable in size and read-lengths to our squid data: (i) Human RNA-seq and DNA-seq data of blood samples (Chen et al., 2012) (ii) Macaque RNA-seq and DNA-seq brain data (Chen et al., 2014). The same pipeline as described above was applied to the data, with one single exception: while comparing the transcriptome model with SwissProt we used only non-vertebrate SwissProt sequences, in order to mimic the situation for squid in which there are no SwissProt entries from closely related species. The analysis of the human and macaque data was done for two purposes: (A) to show that the enormous number of AG mismatches in the squid data is real and not an artefact of the analysis pipeline, and (B) to demonstrate that our pipeline could identify RNA editing sites established by previous studies. The results of the analysis are summarized in Figure 2A and Supplementary file 1A. For non-AG mismatches, we obtained roughly the same numbers as those for squid. However, as expected, the AG mismatches in the human and macaque samples did not show any enrichment, and their abundance (in coding regions) was similar to those for other mismatches. In other words, the sensitivity of our method allows the detection of the super-strong editing signal of squid, but cannot separate the rare editing sites in mammals (in coding regions) from noise. Note that the task of detecting recoding sites in mammals is highly non-trivial even when one takes advantage of all information available, including the accurate genome sequence. The best efforts, so far, have yielded enrichment of AG mismatches, but still most detected sites are non-AG (i.e., most-probably, false-positives) (Ramaswami et al., 2013). In order to assess the precision of our method in recalling true editing sites we have looked at the validated editing sites found in the two studies above, and checked whether they have been picked up by our algorithm as well. The full data is presented in Supplementary file 1B. Overall, about half of the sites were picked up by our pipeline. The sites that were not identified, mostly resided in regions poorly described by our Trinity transcriptome assembly (some were just very weakly edited). Thus, one may conclude that the recall rate of our method is lower than 0.5, and the true extent of squid recoding is even much larger than we report.

Data Deposit

The data reported in this paper was deposited to the Sequence Read Archive (SRA), under accession SRP044717. The constructed squid coding sequences and all the A-to-G modification sites detected in the coding regions of the squid are available via Dryad digital repository (Alon et al., 2015; http://dx.doi.org/10.5061/dryad.2hv7d). eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers. Thank you for sending your work entitled “The majority of transcripts in the squid nervous system are extensively recoded by A-to-I RNA editing” for consideration at eLife. Your article has been favorably evaluated by Chris Ponting (Senior editor) and 2 reviewers, one of whom is a member of our Board of Reviewing Editors. The Reviewing editor and the other reviewer discussed their comments before we reached this decision, and the Reviewing editor has assembled the following comments to help you prepare a revised submission. The paper by Alon et al. is a well-performed, concise study that shows extensive RNA editing in the squid genome. Since the extent of editing is several orders of magnitude larger than that reported in the human genome, and actually in any other metazoan species, the results are obviously of biological relevance, and therefore appropriate for eLife. The manuscript also describes a novel bioinformatic method to identify editing sites. Overall, the manuscript is well written and both Methods and Results are clearly explained. Despite being unexpected, the results appear to be quite robust: the control in primates shows that they are not an artifact from novel pipeline employed to predict editing sites in absence of assembled genome sequences. The strong bias to AG mismatches, the clustering pattern in genes, the neighbor preferences, the tissue specificity (irrespective of biological origin), the resulting recoding events towards the common amino-acid, etc., all these strongly support the editing sites found by the authors. 1) The de novo transcriptome assembly is a very trivial computational issue. Many false positives are expected at least in complex mammalian transcriptomes. Paralogs could affect the reconstruction of real isoforms leading to a sort of chimeric transcripts. In addition, alternative splicing may complicate transcript reconstruction. Are there estimations about the impact of alternative splicing and paralogs in squid? Any impact of this on the results should be discussed in the text. Also, the text should clarify that this is not a completely de novo method since genomic sequences are generated. 2) The strategy is biased towards the RNA editing prediction in protein coding regions (CDS). Can RNA editing events be detected also in non-CDS regions by the method? If not, this should be clarified in the text. Related to this, evidence of RNA editing in repetitive regions in squid could potentially be interesting, probably revealing an opposite trend than mammals. 3) Regarding methodology, can the statistical binomial test detect any significant change in the non-AG positions? If yes, how do you explain this finding? The average RNA and DNA coverage is high but regarding RNA editing candidates, are there filters to exclude low covered sites? What is the minimal coverage for RNA and DNA? Did you apply filters to RNA and DNA reads? I mean reads with low quality and positions at read ends. 4) Have the authors considered the possibility that their results arise from somatic genomic editing, rather than RNA editing? While for the human and macaque control, the RNA and DNA samples are from the same tissues, in the case of squid, RNA samples are from the tissues from the nervous system, while DNA is from the sperm sack. To unequivocally conclude that the observations are indeed from RNA editing, I guess that DNA and RNA need to be from the same biological source. Maybe the investigation of the distribution of the relative proportion of reads supporting and not supporting the edit could help here. 5) Related to the above, the authors used RNA only from tissues from the nervous system. Therefore, it is not possible to assess whether the phenomenon observed is characteristic of this system, or it is actually systemic in the entire organism. I think that sequencing RNA from some other non-nervous tissue could help to distinguish between the two hypotheses. 6) Regarding the characterization of RNA editing events, events tend to be tissue specific. Are there events showing tissue specific levels? That is, cases in which the gene locus in expressed at the same level in all tissues but editing levels are different. 7) It is a little bit disappointing that there is limited investigation in the potential mechanisms behind the extensive editing observed. The authors could have at least investigated ADAR with some additional detail. The RNA (and DNA sequence) helps to delineate the ADAR sequence, and the RNA reads to estimate expression levels. Are there multiple copies of ADAR in the squid genome? Is ADAR expressed at comparatively higher expression levels than in organism with low editing levels (they can use the mouse and human samples to make this comparison? Has the ADAR sequence in squid diverged faster than expected? In specific domains? All these questions are quite simple to answer. 8) The authors also provide an adaptive explanation to the high levels of editing observed in the squid genome, and hypothesize that, in contrast to current assumptions, that extensive editing is common as a way to cope with temperature adaption, except in mammals that, as homeotherms, would not require such a process. This is, by the way, reminiscent of the isochore theory by Bernardi that would separate homeotherm vertebrates from “cold-blooded” (poikilotherm) vertebrates (to which, by the way, the authors may want to cite). If the authors were correct that would indeed be a quite relevant result. They could easily employ their pipeline in available vertebrate RNAseq data (for instance, http://www.sciencemag.org/content/338/6114/1587.full) to test this hypothesis. Before we address your report point-by-point, we would like to reiterate an important issue that is touched upon in many of your queries: our editing detection pipeline, relying on a de novo transcriptome assembly, is inferior to established methods which align RNA reads to a genomic reference. All tools for de novo transcript assembly have limited accuracy, particularly when they try to predict close paralogs and splice variants from short cDNA reads. These shortcomings may lead to false-positives in our pipeline, which add to errors caused by the more familiar somatic mutations, SNPs, and alignment errors that are common to all editing detection schemes. As is the case for all editing detection pipelines, the challenge is to minimize noise level to the extent that true editing sites are not buried in the noise. Here, we are able to apply our pipeline to squid despite the above described errors because the editing signal is enormous. We are confident that the vast majority of A-to-G discrepancies that we have identified reflect true RNA editing events for many reasons. First, all of the sources of error outlined above are not expected to be biased towards A-to-G mismatches. Even if they were, they should not be biased towards the expressed strand of the DNA (i.e. one should expect equal numbers of A-to-G and T-to-C mismatches on the expressed strand if they were true errors). Accordingly, we conclude that the level of non-AG mismatches detected, which we used to estimate the false-discovery rate, already includes the sum of all of the above types of error (and any additional ones of unknown source). The dramatic overrepresentation of A-to-G mismatches over non-AG ones (including T-to-C) is therefore strong evidence that most of the detected A-to-G modifications are due to bona fide ADAR editing. Our conclusions are further supported by the fact that edited adenosines are surrounded by ADAR’s preferred nucleotide motif, that, as with other organisms, editing sites are clustered, and through direct validation of a randomly selected subset of the sites. That being said, we do expect a false discovery rate of 10-15% in our list of ∼90,000 predicted sites, and these thousands of false-positives are probably due to inaccurate assembly, SNPs, somatic mutations and misalignments. 1) The de novo transcriptome assembly is a very trivial computational issue. Many false positives are expected at least in complex mammalian transcriptomes. Paralogs could affect the reconstruction of real isoforms leading to a sort of chimeric transcripts. In addition, alternative splicing may complicate transcript reconstruction. Are there estimations about the impact of alternative splicing and paralogs in squid? Any impact of this on the results should be discussed in the text. We completely agree that de novo assembly of complex transcriptomes is bound to result in many misidentifications of paralogs and splice variants. This may impact our results by leading to some false detection of “editing“ sites as well as by missing true sites. As mentioned above, the total false discovery rate can be estimated by non-AG mismatches, and is rather low. In fact, it is much lower than genome-aware methods in model organisms, which identify only ∼40% A-to-G mismatches in coding sequences (see e.g. Supplementary Table 3 of Ramaswami et al., 2013). The reason is not that our pipeline is better than the genome-aware methods; it's just that the squid recoding signal is very strong. It is difficult to estimate the number of false positives due to mis-assembly, but one may use the total number of T-to-C mismatches (∼5000) to approximate the total false discovery rate, which is an upper bound for false-discovery rate due to mis-assembly. As discussed in the text, it seems that SNPs are a major contributor to false discoveries, so not all of these ∼5000 sites are due to mis-assembly. Moreover, by applying our pipeline to primate datasets of similar size we identify ≤1000 sites for each mismatch type. Thus one might roughly estimate the number of false positives due to mis-assembly at a few thousands at most. As for false-negatives, an incomplete transcriptome may lead to missing true recoding sites. The scope of this may be demonstrated by our primate analysis (Table B), where the major reason for not identifying about half of the known recoding sites is the absence of these sites in the de novo transcriptome, or lack of resolution of different splice variants (see Table B and footnotes). Accordingly, the full recoding repertoire in the squid may be roughly twice what we report here. In summary, we estimate the number of false-positives due to mis-assembly to be up to a few thousands. This source of error is included in our false discovery rate, and is far lower than the true signal. As for false-negatives, we can only extrapolate from the primate data, and roughly estimate their number to be sizable, making the full recoding repertoire of the squid up to twice the size reported here. We have amended the text to the following: “This suggests false-positive rates of 15% and 4% respectively, mainly due to transcriptome assembly problems, SNPs, somatic mutations and systematic mis-alignments. Note that these false-positive rates are considerably lower than those for similar searches for editing within human coding sequences, where a genome reference was employed.” And also: “Moreover, incompleteness of the de novo transcriptome, as well as incorrect assembly of paralogs and splice variants, may cause our pipeline to miss many additional sites (Table B).” Also, the text should clarify that this is not a completely de novo method since genomic sequences are generated. Our transcriptome assembly does not use the DNA-reads in any way, and thus is indeed 'de novo'. Editing detection does use DNA-reads, and accordingly we do not refer to our pipeline as a de novo method for editing detection. 2) The strategy is biased towards the RNA editing prediction in protein coding regions (CDS). Can RNA editing events be detected also in non-CDS regions by the method? If not, this should be clarified in the text. Related to this, evidence of RNA editing in repetitive regions in squid could potentially be interesting, probably revealing an opposite trend than mammals. The question of editing in squid repeats is indeed interesting; however without a fully-assembled genome available, it is difficult to attack at this point. The main reason is that non-coding regions in the squid genome tend to be rich in repeats that dramatically affect the quality of the de novo transcriptome analysis. Thus, while the de novo transcriptome does include some non-coding sequences, and our method may, in principle, apply (hence it is not essentially biased towards CDS), we chose to focus only at CDS in order to improve our signal-to-noise ratio. We have amended the text to read: “To focus on editing sites inside coding regions, and avoid repetitive elements that are prone to assembly and alignment errors, we retained only those components that were found to be significant similar (Blastx E-value<1e-6) to the Swiss-Prot proteins dataset.” 3) Regarding methodology, can the statistical binomial test detect any significant change in the non-AG positions? If yes, how do you explain this finding? The statistical test is designed to filter out sequencing errors. Thus, many mismatches that are not editing sites pass these tests, such as (i) SNPs not identified due to insufficient DNA-seq reads coverage (ii) somatic mutations (iii) mis-alignments (of DNA and/or RNA reads) and (iv) errors in the underlying transcriptome, due to the issues mentioned in point #1. In addition, even for the sequencing errors, the multiple-correction scheme we use (Benjamini-Hochberg) does not guarantee zero false-positives. All of the above lead to the thousands of non-AG mismatches observed. As mentioned above, the data presented in Figure 2–figure supplement 2 suggests that SNPs are the major source of error. The average RNA and DNA coverage is high but regarding RNA editing candidates, are there filters to exclude low covered sites? What is the minimal coverage for RNA and DNA? We thank the referees for raising this important issue that was not properly discussed in our manuscript. In order to minimize the number of arbitrary parameters in our pipeline, we did not apply any additional DNA and RNA-coverage filter, except for the implicit requirement for enough read coverage to attain significance in the statistical tests. Such additional filters do improve the quality of the results (in terms of signal-to-noise), but we still feel they are not necessary (except for one minor modification, see below) as we now explain: Weak sites detection is based on a binomial test applied to the RNA reads mapped to the transcriptome. The minimal number of RNA reads to significance in this case is two reads, both of which are mismatched. There are only two such cases among the 81930 weak sites. One may wonder how come a site with 2 reads reading “G” could have an “A” in the consensus (as is the case for weak sites). The answer is that while our alignments and statistical tests take into account only uniquely aligned reads and quality scores Q≥30, Trinity does not apply these criteria. We chose not to fiddle with Trinity transcriptome assembly and use it as is, and thus there are quite a few cases where the consensus nucleotide of Trinity differs from the majority of the reads we considered in our downstream analysis. Another possible filter is the number of DNA reads. As mentioned above we did not apply any filter of this number. In fact, among our 81930 weak sites there are 1108 sites with no DNA reads at all, and 1148, 1581, 2110 and 2753 sites with 1,2,3,4 supporting DNA reads, respectively. At first, one could have thought that these sites should be discarded, as they might very well be just genomic SNPs. However, looking at the non-AG mismatches, and repeating our analysis with the additional filter of minimum-N-DNA-reads, we get the following results: *Note that the counts of AG sites are not exactly the same as the numbers quoted above for the current dataset. This is because each change of the filters influences the downstream false discovery rate significance calculation. So, while adding the DNA reads filter does marginally improve accuracy, the number of AG sites lost (roughly 1800 sites per one additional DNA read required) is an order of magnitude larger than the false-positive sites weeded out (quantified by the number of TC sites). We therefore choose not to imply the additional filter, and keep the thousands of sites with low DNA reads coverage. Notably, eight of the 143 sites tested in our MiSeq validation were taken from this group. All of these eight sites were confirmed to be editing sites (see Table G). In addition, most of these sites show evidence of editing in the additional tissues tested, supporting them being bona-fide editing sites (Supplemental Table 1, available via Dryad data repository). Lastly, these sites show a clear sequence motif resembling the known ADAR motif: depletion of G/C in 5' (only 8% and 12% of these sites have G and C in the 5’, respectively) and enrichment of G in 3' (41% of these sites have G in the 3’). Strong sites detection is based on a p-value calculated by the number of DNA reads. The minimal number of DNA reads to achieve significance in this case is five DNA reads. We did not apply another filter, which is the number of supporting RNA reads showing G. Trinity decision to call the consensus site G was considered strong enough evidence for G in the RNA reads. As explained above, Trinity might call a base even in the absence of uniquely aligned, high quality reads, so the low number of reads supporting the “G” call, or even the absence of any such reads, does not mean the site is not an editing sites. Looking more closely at the sites with low RNA-reads coverage, we find among our 5649 strong sites 5, 8 and 35 sites with 0, 1, 2 supporting RNA reads showing “G”, respectively. These numbers should be compared to the number of similar mismatches of the second most common modification type, which are 7(GA), 2(CT) and 5 (TC), respectively. We therefore agree that, despite our wish to minimize parameters and avoid arbitrary cutoffs, requiring at least one (uniquely aligned, high quality) RNA read to support editing is reasonable. We thus add this additional filter, and remove the five strong sites from the dataset. Sites with even a single read showing “G” are included (indeed, all of these eight sites, but only two out of the five we excluded above, show evidence for editing in the additional three nervous-system tissues tested). The omission of these five sites (three recoding sites) has no visible effect on the figures, and thus the figures need not be replaced. Numbers have been modified throughout the paper, when needed, but the changes are always insignificant. Note that the table of editing sites provides the full information on the number of DNA and RNA reads, enabling the user to gauge the confidence of the specific site of interest. We have added to the Methods section the following sentences: “We did not apply additional read-number filters, as these seem to only marginally increase accuracy while reducing the number of detected sites. However, we did exclude five strong sites for which there were no uniquely-aligned, high-quality, supporting RNA reads. This brings the number of strong A-to-G and non A-to-G modifications to 5,644 and 219, respectively. The full list of weak and strong sites (Supplementary Table 1, available via Dryad data repository) provides the number of DNA and RNA reads per site.” Did you apply filters to RNA and DNA reads? I mean reads with low quality and positions at read ends. As mentioned in the Methods section, our alignments considered only uniquely aligned reads and sites with quality-score Q≥30. However, Trinity de novo assembly takes into account all reads and treats the quality score differently. We did not remove read ends—our alignment was done to the transcriptome, and thus the splicing-junction-related misalignments that are known to introduce errors at the reads' ends are not expected. In addition, we used the local alignment configuration of Bowtie2, which allows reads’ end “trimming” to optimize alignment. Indeed, as shown in Figure 2–figure supplement 4, A-to-G mismatches are not overrepresented at reads' ends. The following text has been added to the Methods section: “In the following, bases called with quality score Q<30 were discarded. Note, however, that Trinity consensus sequence does take into account these bases, as well as reads that might have not been uniquely aligned to the transcriptome. It is often customary to filter out reads' ends when analyzing RNA-DNA mismatches. A main reason for that is the common mismatches at reads' ends due to alignment artifacts when a splicing junction occurs near the ends. In our case, as alignment is done to the transcriptome, we did not observe any increase in AG mismatch rate near reads' ends (Figure 2–figure supplement 4), and thus no such filter was used.” 4) Have the authors considered the possibility that their results arise from somatic genomic editing, rather than RNA editing? While for the human and macaque control, the RNA and DNA samples are from the same tissues, in the case of squid, RNA samples are from the tissues from the nervous system, while DNA is from the sperm sack. To unequivocally conclude that the observations are indeed from RNA editing, I guess that DNA and RNA need to be from the same biological source. Maybe the investigation of the distribution of the relative proportion of reads supporting and not supporting the edit could help here. As explained above, somatic mutations might indeed be a partial explanation of our false discovery rate, but, as they should not be biased towards A-to-G (and certainly should not introduce more A-to-G than T-to-C mismatches as they do not have strand preference), their contribution is already accounted for in our estimated false discovery rate (based on the non-AG mismatches), and they cannot account for a sizable fraction of our A-to-G sites. In principle, one might have speculated that an endogenous squid-specific DNA-editing process leads to specific A-to-G mismatches at a level much higher than the random mutations seen in other organisms. But: (i) we have no evidence for such a DNA deaminase enzyme in the squid (or any other organism) while we do know ADAR enzymes are present; (ii) even such a putative DNA deaminase is not expected to select the coding strand over the other strand, so there is no explanation for the enrichment of A-to-G over T-to-C. In addition, we can exclude somatic mutations as a major contributor to our putative editing sites list based on the following: The same sites re-occur in different tissues and different animals (see Figure 2–figure supplement 2). Randomly sampled sites were validated in additional different animals. Sites appear in clusters, as expected for editing sites but not somatic mutations. Nearby sites are not fully correlated (in the same read). Typically, for a pair of adjacent sites we find reads showing all four combinations (A in both, G in both, A and G, G and A), with only partial correlation. This is not expected for DNA-originated mutations. The sites show a clear sequence motif resembling the known ADAR motif (depletion of G/C in 5'; enrichment of G in 3'). 5) Related to the above, the authors used RNA only from tissues from the nervous system. Therefore, it is not possible to assess whether the phenomenon observed is characteristic of this system, or it is actually systemic in the entire organism. I think that sequencing RNA from some other non-nervous tissue could help to distinguish between the two hypotheses. We thank the referees for this important comment. We have now checked the editing levels in six additional non-nervous-system tissues at the sites already identified. Interestingly, the editing level in these tissues is an order of magnitude lower than in the five nervous system tissues previously studied. The data has been added to Supplementary Table 1 (available via Dryad digital repository). We have added to the main text the statement: “Consistently, editing levels observed in non-nervous-system tissues are considerably lower (Supplementary Table 1)”. We have also added to the Methods section information of the six additional tissues tested: “Tissues were also dissected from non-neuronal regions: the branchial heart, the Gill, the ventral epithelial layer on the pen, the marginal epithelial layer on the pen, the iridophore layer of the skin, and the chromatophore layer of the skin. Each of these six tissues originated from a different animal. RNA from all tissues was extracted with the RNAqueous kit (Life Technologies), and genomic DNA was extracted from the sperm sack using Genomic Tip Columns (Qiagen).” And: “To characterize the modifications levels in non-neuronal tissues, Illumina sequencing was again utilized to generate ∼23, ∼23, ∼19, ∼26, ∼19 and ∼14 million paired-end, 150 nt reads, using RNA from the branchial heart, the Gill, the ventral epithelial layer on the pen, the marginal epithelial layer on the pen, the iridophore layer of the skin, and the chromatophore layer of the skin. The same alignment procedure was applied to quantify the modification level at the previously described sites for each of the additional samples, revealing considerably lower editing levels in the non-neuronal tissues (Supplementary Table 1, available via Dryad data repository).” 6) Regarding the characterization of RNA editing events, events tend to be tissue specific. Are there events showing tissue specific levels? That is, cases in which the gene locus in expressed at the same level in all tissues but editing levels are different. Supplementary Table 1 (available via Dryad digital repository) provides all data required to search for such cases (A and G reads for each tissue). Indeed, one finds many cases in which the expression level is similar but the editing level is very different. For example, comparing OL and GFL tissues, there are 20,394 sites for which we obtained at least 100 reads per tissue, and the number of reads did not change more than 1.5-fold between tissues. Interestingly, the average editing level in these sites is 1.5-fold higher in the GFL tissue (9.4%) as compared to the OL tissue (6.1%). Looking for extreme tissue-specificity among these similar-expression sites, we find 13 cases in which the GFL sample exhibits editing levels >50% while the OL level at the same site are <5%, but only 3 cases of the opposite scenario. In summary, we do see tissue specificity even when expression levels are similar. The data provided enables the reader to find such sites using his/her own parameters. 7) It is a little bit disappointing that there is limited investigation in the potential mechanisms behind the extensive editing observed. The authors could have at least investigated ADAR with some additional detail. The RNA (and DNA sequence) helps to delineate the ADAR sequence, and the RNA reads to estimate expression levels. Are there multiple copies of ADAR in the squid genome? Is ADAR expressed at comparatively higher expression levels than in organism with low editing levels (they can use the mouse and human samples to make this comparison? Has the ADAR sequence in squid diverged faster than expected? In specific domains? All these questions are quite simple to answer. We agree that, in light of these findings, the structure and function of squid ADARs could be very intriguing. Our de novo transcriptome identified three ADAR-like enzymes. One which is similar to ADAR1, one similar to ADAR2, and a third one that we would predict to be inactive because it contains mutations at key residues which are involved in catalysis. Vertebrates also contain an inactive ADAR (ADAR3), however the squid’s “inactive” ADAR appears equally similar to all three vertebrate ADARs. The squid ADAR2 sequence has been well characterized and is the subject of two previous publications by the Rosenthal lab. The squid ADAR1 sequence is the subject of an ongoing investigation by the Rosenthal lab. It is reasonable to hypothesize that high level of editing in squid is due to comparatively high ADAR expression. However, it is difficult to compare the expression levels in terms of reads per kilobase of exon per million mapped reads (RPKM) in the absence of a full genome, because we can’t say whether unmapped reads are actually mappable to poorly expressed, or poorly resolved, transcripts. That being said, we did make a rough estimate of ADAR expression to answer this query. By ranking each component from our transcriptome we found that squid ADAR1 ranks at 0.92 and squid ADAR2 ranks at 0.48 for the GFL tissue, where 1 is the most expressed component, and 0 the least expressed component. These numbers are not very different from the situation in the human brain, where (based on the normal total brain tissue in Human Body Map dataset), ADAR1 is ranked at ∼0.9 and ADAR2 ∼0.5 (numbers are approximate, as different exons behave differently, and identifying the expression levels of the different splice variants is not trivial). This is only a rough estimate, and there are several delicate problems in this comparison; therefore we prefer not to make the claim that the expression levels are indeed similar in the text. However, we are working under the hypothesis that other differences between squid and mammalian ADARs (see below) are more likely to be of importance here. Much is known about squid ADAR2 and it does indeed have several interesting features that may explain, in part, high level editing. We’d like to stress, however, that it is only part of the puzzle. Vertebrate ADAR2’s have an invariant domain structure: they are composed of 2 N terminal double stranded RNA binding motifs (dsRBMs) followed by a conserved catalytic domain. Squid ADAR2 can have an additional dsRBM at its N terminus which is included in about half the transcripts through splicing (Palavicini et al., 2009). Thus, there is a canonical ADAR2 with 2 dsRBMS and a non-canonical one with three. When produced recombinantly in yeast and tested in vitro on a squid K channel mRNA substrate, the non-canonical version edits far more sites. This version also has a much higher affinity for dsRNA (Palavicini et al., 2012). Another interesting feature of the squid ADAR2 is that its own messages are abundantly edited, leading to multiple isoforms. The specificities of the individual isoforms have yet to be tested. We have also cloned and expressed Squid ADAR1. As with squid ADAR2, this enzyme has notable differences when compared with its vertebrate counterparts. Vertebrate ADAR1s are normally composed of three dsRBMs followed by a catalytic domain. Squid ADAR1 only has 1 dsRBM followed by the catalytic domain. At its N-terminus, however, it contains a highly basic domain that contains scores of phosphorylation sites. Squid ADAR1 messages are also highly edited. A graduate student in the Rosenthal lab has produced recombinant Squid ADAR1 and is currently characterizing its function for her doctoral project. Because it contains many differences from vertebrate orthologs, a full-functional characterization is a complex undertaking. Although its structure is clearly interesting, we prefer not to include Squid ADAR1 sequence data in this manuscript because only through structure-function studies can we assess whether its unique features might contribute to high-level editing. We would like to assure the reviewer, however, that these issues are very much on our minds and that we will be publishing detailed accountings shortly. We have added the following to the main text: “Have squid ADARs evolved novel structure that account for the high-level editing? A past study showed that a squid ADAR2 ortholog can be expressed as two isoforms due to alternative splicing: one, having 2 double-stranded RNA (dsRNA) binding motifs, resembles vertebrate ADAR2s. A second, however, contains an “extra” dsRNA binding motif at its N-terminus. This non-canonical isoform edits RNA more efficiently, edits more sites, and binds dsRNA with a far higher affinity. Further Squid ADAR2 messages themselves contain many editing sites, leading to many subtly different isoforms. An ADAR1 isoform is also present in our transcriptome and is the focus of an ongoing study.” 8) The authors also provide an adaptive explanation to the high levels of editing observed in the squid genome, and hypothesize that, in contrast to current assumptions, that extensive editing is common as a way to cope with temperature adaption, except in mammals that, as homeotherms, would not require such a process. This is, by the way, reminiscent of the isochore theory by Bernardi that would separate homeotherm vertebrates from “cold-blooded” (poikilotherm) vertebrates (to which, by the way, the authors may want to cite). If the authors were correct that would indeed be a quite relevant result. They could easily employ their pipeline in available vertebrate RNAseq data (for instance, ) to test this hypothesis. We thank the referees for this comment. The statement at the closing sentence of the main text was indeed a bit too strong, and we would like to rephrase and clarify our claim. We do not claim that all cold-blooded animals have extensive recoding. Published data for Drosophila, the leaf-cutting ant Acromyrmex echinatior, and C. elegans all show much less recoding activity than we report for the squid. What we do intend to suggest is that in those species where extensive editing does happen, it can be utilized for temperature adaptation. Thus, testing another cold-blooded species is not expected to help much, we presume most species will not reproduce the exceptional level found in squid. We do intend to explore, in future studies, the editing profile of species closely related to squid, in order to better understand how this extensive phenomenon has evolved. Parenthetically, the data in the reference suggested above cannot be used for our purposes, as no matching DNA-reads are provided. We have modified the closing sentence to read: “Most model organisms studied so far are mammals which are homeotherms. Future studies of more diverse species are needed to reveal the extent to which cold-blooded organisms might utilize extensive editing to respond to temperature changes and other environmental variables”.

Minimum DNA reads	#AG sites	%AG sites	#TC (2nd most abundant mismatch)
0 (No filter)	81930	86.9%	4793
1	80750	87.5%	4620
2	79410	87.9%	4460
3	77731	88.3%	4279
4	75390	88.6%	4064
5	72509	88.9%	3879

33 in total

1. An extra double-stranded RNA binding domain confers high activity to a squid RNA editing enzyme.

Authors: Juan Pablo Palavicini; Mary A O'Connell; Joshua J C Rosenthal
Journal: RNA Date: 2009-04-23 Impact factor: 4.942

2. Genome-wide analysis of A-to-I RNA editing by single-molecule sequencing in Drosophila.

Authors: Georges St Laurent; Michael R Tackett; Sergey Nechkin; Dmitry Shtokalo; Denis Antonets; Yiannis A Savva; Rachel Maloney; Philipp Kapranov; Charles E Lawrence; Robert A Reenan
Journal: Nat Struct Mol Biol Date: 2013-09-29 Impact factor: 15.369

3. Inosine exists in mRNA at tissue-specific levels and is most abundant in brain mRNA.

Authors: M S Paul; B L Bass
Journal: EMBO J Date: 1998-02-16 Impact factor: 11.598

Review 4. Functions and regulation of RNA editing by ADAR deaminases.

Authors: Kazuko Nishikura
Journal: Annu Rev Biochem Date: 2010 Impact factor: 23.643

5. RNA editing underlies temperature adaptation in K+ channels from polar octopuses.

Authors: Sandra Garrett; Joshua J C Rosenthal
Journal: Science Date: 2012-01-05 Impact factor: 47.728

6. Accurate identification of human Alu and non-Alu RNA editing sites.

Authors: Gokul Ramaswami; Wei Lin; Robert Piskol; Meng How Tan; Carrie Davis; Jin Billy Li
Journal: Nat Methods Date: 2012-04-04 Impact factor: 28.547

Review 7. A role for A-to-I RNA editing in temperature adaptation.

Authors: Sandra C Garrett; Joshua J C Rosenthal
Journal: Physiology (Bethesda) Date: 2012-12

8. Personal omics profiling reveals dynamic molecular and medical phenotypes.

Authors: Rui Chen; George I Mias; Jennifer Li-Pook-Than; Lihua Jiang; Hugo Y K Lam; Rong Chen; Elana Miriami; Konrad J Karczewski; Manoj Hariharan; Frederick E Dewey; Yong Cheng; Michael J Clark; Hogune Im; Lukas Habegger; Suganthi Balasubramanian; Maeve O'Huallachain; Joel T Dudley; Sara Hillenmeyer; Rajini Haraksingh; Donald Sharon; Ghia Euskirchen; Phil Lacroute; Keith Bettinger; Alan P Boyle; Maya Kasowski; Fabian Grubert; Scott Seki; Marco Garcia; Michelle Whirl-Carrillo; Mercedes Gallardo; Maria A Blasco; Peter L Greenberg; Phyllis Snyder; Teri E Klein; Russ B Altman; Atul J Butte; Euan A Ashley; Mark Gerstein; Kari C Nadeau; Hua Tang; Michael Snyder
Journal: Cell Date: 2012-03-16 Impact factor: 41.582