Shahar Alon1, Sandra C Garrett2, Erez Y Levanon3, Sara Olson2, Brenton R Graveley2, Joshua J C Rosenthal4, Eli Eisenberg5. 1. George S Wise Faculty of Life Sciences, Department of Neurobiology, Tel Aviv University, Tel Aviv, Israel. 2. Department of Genetics and Developmental Biology, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, United States. 3. Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan, Israel. 4. Institute of Neurobiology, University of Puerto Rico Medical Sciences Campus, San Juan, Puerto Rico. 5. Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel.
Abstract
RNA editing by adenosine deamination alters genetic information from the genomic blueprint. When it recodes mRNAs, it gives organisms the option to express diverse, functionally distinct, protein isoforms. All eumetazoans, from cnidarians to humans, express RNA editing enzymes. However, transcriptome-wide screens have only uncovered about 25 transcripts harboring conserved recoding RNA editing sites in mammals and several hundred recoding sites in Drosophila. These studies on few established models have led to the general assumption that recoding by RNA editing is extremely rare. Here we employ a novel bioinformatic approach with extensive validation to show that the squid Doryteuthis pealeii recodes proteins by RNA editing to an unprecedented extent. We identify 57,108 recoding sites in the nervous system, affecting the majority of the proteins studied. Recoding is tissue-dependent, and enriched in genes with neuronal and cytoskeletal functions, suggesting it plays an important role in brain physiology.
RNA editing by adenosine deamination alters genetic information from the genomic blueprint. When it recodes mRNAs, it gives organisms the option to express diverse, functionally distinct, protein isoforms. All eumetazoans, from cnidarians to humans, express RNA editing enzymes. However, transcriptome-wide screens have only uncovered about 25 transcripts harboring conserved recoding RNA editing sites in mammals and several hundred recoding sites in Drosophila. These studies on few established models have led to the general assumption that recoding by RNA editing is extremely rare. Here we employ a novel bioinformatic approach with extensive validation to show that the squid Doryteuthis pealeii recodes proteins by RNA editing to an unprecedented extent. We identify 57,108 recoding sites in the nervous system, affecting the majority of the proteins studied. Recoding is tissue-dependent, and enriched in genes with neuronal and cytoskeletal functions, suggesting it plays an important role in brain physiology.
The central dogma of biology maintains that genetic information passes faithfully from
DNA to RNA to proteins; however, with the help of tools such as alternative splicing,
organisms use RNA as a canvas to modify and enrich this flow of information. RNA editing
by deamination of adenosine to inosine (A-to-I) is another process used to alter genetic
information (Nishikura, 2010). Unlike
alternative splicing, which shuffles relatively large regions of RNA, editing targets
single bases in order to fine-tune protein function. Because inosine is interpreted as
guanosine by the cellular machinery, this process can recode codons (Basilio et al., 1962). A-to-I RNA editing is
catalyzed by the ADAR (adenosine deaminase that acts on RNA) family of enzymes. All
eumetazoans, from cnidarians to mammals, express ADARs but the extent to which they use
them to recode has been explored in few representatives (Nishikura, 2010).Recent advances in DNA sequencing and the supporting computational analyses have
permitted transcriptome-wide screens for RNA editing events. So far, such studies have
been limited to organisms with a sequenced genome (Ramaswami et al., 2012, 2013). In
general, these screens have looked for variation in RNA at positions that are invariant
in the genome. In humans, inosine is abundant in RNA (Paul and Bass, 1998; Bazak et al.,
2014), but almost all of it lies within transcribed repetitive elements in
untranslated regions or introns (Nishikura,
2010). A compilation of recoding sites in human transcriptomes revealed 1183
events (Xu and Zhang, 2014), but most were
observed in only a single sample. Individual searches (Danecek et al., 2012; Ramaswami et al.,
2013) uncovered only 115 (non-repetitive) recoding events, and 53 in mice; 34
recoding sites are conserved across mammals (Pinto et
al., 2014). In Drosophila, an order of magnitude more
recoding sites have been identified, residing in about 3% of all messages (St Laurent et al., 2013). Although individual
editing sites are clearly essential (Brusa et al.,
1995), these data suggest that RNA editing is not a broadly used mechanism for
proteome diversification.However, anecdotal data suggest this assumption might not apply across the animal
kingdom. For example, using traditional cloning methods, scores of recoding sites have
been uncovered in a small number of squid and octopus transcripts encoding potassium
channels, ADARs, and ion pumps (Garrett and Rosenthal,
2012a). As for most organisms, there are no genomes available for cephalopods.
Here we apply a novel approach for editing site detection in the absence of a sequenced
genome. We use it to comprehensively identify editing sites in the squid giant axon
system and other areas of the nervous system. Surprisingly, almost 60% of all mRNAs
studied harbor recoding events, and most at multiple sites. These data show orders of
magnitude more recoding in the squid proteome than in any other species studied to date.
In squid, editing is so pervasive that the central dogma should be modified to include
this process. Our results open the possibility that extensive recoding is common in many
organisms, rivaling alternative splicing as a means of creating functional
diversity.
Results and discussion
To detect RNA editing sites in the squid nervous system, we generated millions of RNA
and genomic DNA reads from an individual squid. Our method differed from previous
approaches by using a de novo transcriptome as the point of reference instead of a
genome (Figure 1A). The transcriptome was
assembled from RNA-seq reads, and each nucleotide within it represents the consensus of
many reads. If the majority of RNA reads were edited (‘strong’ editing
sites), the transcriptome would differ from the genomic DNA and read ‘G’
where gDNA reads would show ‘A’ (the sequencing process identifies
inosines as guanosines). We detected such sites by aligning DNA-seq reads to the
transcriptome (Figure 1B). At positions where
editing occurred in the minority of RNA-seq reads (‘weak’ editing sites),
however, the transcriptome and the genomic DNA would be identical. These sites were
detected by identifying variability in RNA-seq, but not DNA-seq, reads (Figure 1B). This general approach is applicable to
all organisms that lack a sequenced genome.
Figure 1.
A general approach to detect RNA editing sites in organisms that lack a
sequenced genome.
(A) Squid RNA-seq data is used to create a de novo transcriptome
followed by the detection of conserved ORFs. (B)
‘Weak’ and ‘strong’ editing sites are detected by
comparing RNA and DNA reads from the same animal to the ORFs from the
transcriptome. ‘Weak’ editing sites were detected by observing
the minority of the RNA reads to differ from the consensus transcriptome
nucleotide. ‘Strong’ editing sites, where the consensus
transcriptome includes the edited nucleotide, were detected by observing all
DNA reads to differ from the transcriptome nucleotide.
DOI:
http://dx.doi.org/10.7554/eLife.05198.003
A general approach to detect RNA editing sites in organisms that lack a
sequenced genome.
(A) Squid RNA-seq data is used to create a de novo transcriptome
followed by the detection of conserved ORFs. (B)
‘Weak’ and ‘strong’ editing sites are detected by
comparing RNA and DNA reads from the same animal to the ORFs from the
transcriptome. ‘Weak’ editing sites were detected by observing
the minority of the RNA reads to differ from the consensus transcriptome
nucleotide. ‘Strong’ editing sites, where the consensus
transcriptome includes the edited nucleotide, were detected by observing all
DNA reads to differ from the transcriptome nucleotide.DOI:
http://dx.doi.org/10.7554/eLife.05198.003We sequenced cDNA from the giant axon system (giant fiber lobe: GFL), the optic lobes
(OL) and matching germline gDNA isolated from the same animal. cDNA was also sequenced
from the vertical lobe (VL), buccal ganglion (BG) and the Stellate Ganglion (SG) from
another animal (The SG and GFL are parts of the peripheral nervous system; all the rest
are from the central nervous system). The GFL and OL RNA-seq reads were used to
construct a transcriptome model (Grabherr et al.,
2011). To focus on editing sites inside bona fide coding
regions, we retained only transcript-fragments with open reading frames (ORFs)
homologous to known proteins (UniProt Consortium,
2014) (Figure 1A) and used the editing
detection procedures outlined in Figure 1B.
Surprisingly, our pipeline identified 81,930 weak sites, and 5644 strong sites, due to
A-to-G transitions (Figure 2A). Only 12,403 weak
sites and 219 strong sites were identified for the other 11 possible types of
modifications. These numbers suggest false-positive rates of 15% and 4% respectively,
mainly due to transcriptome assembly problems, SNPs, somatic mutations and systematic
mis-alignments. Note that these false-positive rates are considerably lower than those
for similar searches for editing within human coding sequences, where a genome reference
was employed (Ramaswami et al., 2013).
Figure 2.
High number of RNA editing sites in squid translates into an
extraordinary number of recoding events.
(A) The number of nucleotide modifications observed in the
squid nervous system for each possible substitution type (in blue, 87% of
all detected modifications were A-to-G). A similar analysis of human and
Rhesus macaque sequencing data (green and brown, respectively) shows low
levels, and no enrichment, of A-to-I editing in coding regions, as reported
previously. In the inset, the distribution of nucleotide modifications
observed in squid mitochondria-encoded genes, used here as a negative
control. The ADAR enzymes have no reported activity in the mitochondria and,
accordingly, no A-to-G overrepresentation is observed. Also see Figure 2—figure supplement 1.
(B) Scope of recoding due to RNA editing in squid, both in
the total number of recoding events and the total number of genes affected,
is orders of magnitude higher than human, mouse, and fly (numbers for other
organisms are based on recent publications using RNA-seq datasets comparable
to the one used here [Danecek et al.,
2012; Ramaswami et al.,
2013; St Laurent et al.,
2013]).
DOI:
http://dx.doi.org/10.7554/eLife.05198.004
(A) The number of modifications observed for each possible
modification type, considering only modifications that appear in clusters.
About half of the A-to-G modifications appear in clusters of at least three
consecutive same-type mismatches, in accordance with the expected properties
of A-to-I editing sites, found in other organisms (Morse et al., 2002; Levanon et al., 2004). (B) The number of reads with
3, 4, and 5 consecutive identical mismatches for each possible modification
type. Most of these reads contained A-to-G modifications. (C)
The sequence surrounding of the observed A-to-G modifications, compared with
that surrounding random adenosines in our model transcriptome. The sequence
surrounding the ‘weak’ sites and the ‘strong’
sites (Figure 1B) are similar to each
other and to what is known for other species (Kleinberger and Eisenberg, 2010).
DOI:
http://dx.doi.org/10.7554/eLife.05198.005
(A) For the A-to-G sites, different tissues show varying levels
of editing, globally, as well as site-specific tissue-dependent regulation.
For example, higher modification levels (red) are observed in the GFL tissue
whereas low levels (green) are observed in the VL tissue. Yet, some sites
(top rows) are edited more strongly in VL. (B) Hierarchical
clustering of the non A-to-G modification levels in the five different
neuronal tissues. Consistently with the modifications being due to genomic
polymorphisms, data cluster according to the animal it was taken from:
modification levels in the GFL and OL tissues, which were taken from one
individual animal, form one cluster, as do the VL, BG and SG tissues, taken
from another individual animal. Modification levels are, by and large,
uniform across tissues coming from the same individual animal. Note that in
both panels only sites with significantly variable modification levels are
presented (binomial analysis was performed with Bonferroni-corrected p-value
of 0.05 as a cutoff), each row represents one modification site.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL),
Buccal ganglia (BG), and Stellate ganglion (SG).
DOI:
http://dx.doi.org/10.7554/eLife.05198.006
(A) An example of editing sites verified using Sanger
sequencing in the squid protein piccolo. Arrowheads mark the locations of
the editing sites. (B) Editing levels measured by Sanger
sequencing for the 40 sites correlate with RNA sequencing results.
DOI:
http://dx.doi.org/10.7554/eLife.05198.007
(A) The distribution of the quality scores for all the sites
used (all the positions inside all the analyzed reads), A-to-G
modifications, and non A-to-G modifications. No difference is observed
between these three groups. Note that sites with Q < 30 were excluded.
(B) The number of mismatches detected as a function of the
position inside the read. Non A-to-G mismatches tend to occur at reads'
ends, suggesting alignment artifacts (which tend to affect reads' ends)
are responsible to some of these mismatches (Ramaswami et al., 2012). A-to-G mismatches do not show
such tendency. (C) The distribution of modification levels for
A-to-G and non A-to-G sites, for the GFL and OL tissues. The increased
number of non A-to-G sites with ∼50% modification level hint at some
genomic polymorphisms (SNPs), that were not represented in our DNA reads due
to the limited coverage, are included among the non A-to-G mismatches.
Consistently, 51% of the sites with non A-to-G modification levels between
40–60% recur in both tissues (coming from the same individual
animal), compared to only 22% of the A-to-G modifications in the same range.
Similarly, 50% of non A-to-G modification levels higher than 90% recur in
both tissues (coming from the same individual animal), compared to only 21%
for A-to-G modifications in the same range. These two ranges are the only
ones in which such difference is observed. Abbreviations: Giant fiber lobe
(GFL), Optic lobe (OL), quality score (Q).
DOI:
http://dx.doi.org/10.7554/eLife.05198.008
(RNA-seq from the GFL and OL tissues only, originating from the animal whose
DNA was sequenced). (A) The number of A-to-G sites detected
increases with the number of RNA reads, with no sign for saturation. Thus,
we expect the number of editing sites to be much larger than the one
reported here. (B) The number of A-to-G sites detected in each
gene correlates with the gene's RNA coverage, demonstrating that with
much larger RNA-seq data, the number of detected editing sites could be as
high as ∼200,000 (expected number of 17 sites per protein, on
average, for each of the ∼12K ORFs in our model transcriptome).
(C) The number of modification sites detected as a function
of the number of DNA reads. Detection of modification sites is based on
mismatches between cDNA reads and the consensus. However, one of the main
sources for such mismatches, which masks the signal due to RNA editing, is
heterozygosity of the genome. The more DNA reads available, the better one
can identify and exclude genomically heterozygous sites (SNPs) and improve
signal-to-noise ratio. (i) ‘weak’ sites detection (ii)
‘strong’ sites detection - here exclusion of SNPs is part of
the detection scheme itself (see Methods) and thus the number of detected
sites (and not only the signal-to-noise ratio) increases with gDNA coverage
(iii) ‘Weak’ and ‘strong’ sites, combined.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Standard Error
(SE).
DOI:
http://dx.doi.org/10.7554/eLife.05198.009
High number of RNA editing sites in squid translates into an
extraordinary number of recoding events.
(A) The number of nucleotide modifications observed in the
squid nervous system for each possible substitution type (in blue, 87% of
all detected modifications were A-to-G). A similar analysis of human and
Rhesus macaque sequencing data (green and brown, respectively) shows low
levels, and no enrichment, of A-to-I editing in coding regions, as reported
previously. In the inset, the distribution of nucleotide modifications
observed in squid mitochondria-encoded genes, used here as a negative
control. The ADAR enzymes have no reported activity in the mitochondria and,
accordingly, no A-to-G overrepresentation is observed. Also see Figure 2—figure supplement 1.
(B) Scope of recoding due to RNA editing in squid, both in
the total number of recoding events and the total number of genes affected,
is orders of magnitude higher than human, mouse, and fly (numbers for other
organisms are based on recent publications using RNA-seq datasets comparable
to the one used here [Danecek et al.,
2012; Ramaswami et al.,
2013; St Laurent et al.,
2013]).
Figure 2—figure supplement 1.
A-to-G modifications appear in clusters of consecutive identical
mismatches and show distinctive 5′ and 3′ neighbor
preferences.
(A) The number of modifications observed for each possible
modification type, considering only modifications that appear in clusters.
About half of the A-to-G modifications appear in clusters of at least three
consecutive same-type mismatches, in accordance with the expected properties
of A-to-I editing sites, found in other organisms (Morse et al., 2002; Levanon et al., 2004). (B) The number of reads with
3, 4, and 5 consecutive identical mismatches for each possible modification
type. Most of these reads contained A-to-G modifications. (C)
The sequence surrounding of the observed A-to-G modifications, compared with
that surrounding random adenosines in our model transcriptome. The sequence
surrounding the ‘weak’ sites and the ‘strong’
sites (Figure 1B) are similar to each
other and to what is known for other species (Kleinberger and Eisenberg, 2010).
DOI:
http://dx.doi.org/10.7554/eLife.05198.005
DOI:
http://dx.doi.org/10.7554/eLife.05198.004
A-to-G modifications appear in clusters of consecutive identical
mismatches and show distinctive 5′ and 3′ neighbor
preferences.
(A) The number of modifications observed for each possible
modification type, considering only modifications that appear in clusters.
About half of the A-to-G modifications appear in clusters of at least three
consecutive same-type mismatches, in accordance with the expected properties
of A-to-I editing sites, found in other organisms (Morse et al., 2002; Levanon et al., 2004). (B) The number of reads with
3, 4, and 5 consecutive identical mismatches for each possible modification
type. Most of these reads contained A-to-G modifications. (C)
The sequence surrounding of the observed A-to-G modifications, compared with
that surrounding random adenosines in our model transcriptome. The sequence
surrounding the ‘weak’ sites and the ‘strong’
sites (Figure 1B) are similar to each
other and to what is known for other species (Kleinberger and Eisenberg, 2010).DOI:
http://dx.doi.org/10.7554/eLife.05198.005
Hierarchical clustering reveals tissue selectivity in the modification
levels of the A-to-G sites, but not in the non A-to-G sites.
(A) For the A-to-G sites, different tissues show varying levels
of editing, globally, as well as site-specific tissue-dependent regulation.
For example, higher modification levels (red) are observed in the GFL tissue
whereas low levels (green) are observed in the VL tissue. Yet, some sites
(top rows) are edited more strongly in VL. (B) Hierarchical
clustering of the non A-to-G modification levels in the five different
neuronal tissues. Consistently with the modifications being due to genomic
polymorphisms, data cluster according to the animal it was taken from:
modification levels in the GFL and OL tissues, which were taken from one
individual animal, form one cluster, as do the VL, BG and SG tissues, taken
from another individual animal. Modification levels are, by and large,
uniform across tissues coming from the same individual animal. Note that in
both panels only sites with significantly variable modification levels are
presented (binomial analysis was performed with Bonferroni-corrected p-value
of 0.05 as a cutoff), each row represents one modification site.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL),
Buccal ganglia (BG), and Stellate ganglion (SG).DOI:
http://dx.doi.org/10.7554/eLife.05198.006
Validation of editing using Sanger sequencing.
(A) An example of editing sites verified using Sanger
sequencing in the squid protein piccolo. Arrowheads mark the locations of
the editing sites. (B) Editing levels measured by Sanger
sequencing for the 40 sites correlate with RNA sequencing results.DOI:
http://dx.doi.org/10.7554/eLife.05198.007
Quality controls for the A-to-G modifications and the non A-to-G
modifications.
(A) The distribution of the quality scores for all the sites
used (all the positions inside all the analyzed reads), A-to-G
modifications, and non A-to-G modifications. No difference is observed
between these three groups. Note that sites with Q < 30 were excluded.
(B) The number of mismatches detected as a function of the
position inside the read. Non A-to-G mismatches tend to occur at reads'
ends, suggesting alignment artifacts (which tend to affect reads' ends)
are responsible to some of these mismatches (Ramaswami et al., 2012). A-to-G mismatches do not show
such tendency. (C) The distribution of modification levels for
A-to-G and non A-to-G sites, for the GFL and OL tissues. The increased
number of non A-to-G sites with ∼50% modification level hint at some
genomic polymorphisms (SNPs), that were not represented in our DNA reads due
to the limited coverage, are included among the non A-to-G mismatches.
Consistently, 51% of the sites with non A-to-G modification levels between
40–60% recur in both tissues (coming from the same individual
animal), compared to only 22% of the A-to-G modifications in the same range.
Similarly, 50% of non A-to-G modification levels higher than 90% recur in
both tissues (coming from the same individual animal), compared to only 21%
for A-to-G modifications in the same range. These two ranges are the only
ones in which such difference is observed. Abbreviations: Giant fiber lobe
(GFL), Optic lobe (OL), quality score (Q).DOI:
http://dx.doi.org/10.7554/eLife.05198.008
Number of modification sites detected as a function of the amount of DNA
and RNA reads.
(RNA-seq from the GFL and OL tissues only, originating from the animal whose
DNA was sequenced). (A) The number of A-to-G sites detected
increases with the number of RNA reads, with no sign for saturation. Thus,
we expect the number of editing sites to be much larger than the one
reported here. (B) The number of A-to-G sites detected in each
gene correlates with the gene's RNA coverage, demonstrating that with
much larger RNA-seq data, the number of detected editing sites could be as
high as ∼200,000 (expected number of 17 sites per protein, on
average, for each of the ∼12K ORFs in our model transcriptome).
(C) The number of modification sites detected as a function
of the number of DNA reads. Detection of modification sites is based on
mismatches between cDNA reads and the consensus. However, one of the main
sources for such mismatches, which masks the signal due to RNA editing, is
heterozygosity of the genome. The more DNA reads available, the better one
can identify and exclude genomically heterozygous sites (SNPs) and improve
signal-to-noise ratio. (i) ‘weak’ sites detection (ii)
‘strong’ sites detection - here exclusion of SNPs is part of
the detection scheme itself (see Methods) and thus the number of detected
sites (and not only the signal-to-noise ratio) increases with gDNA coverage
(iii) ‘Weak’ and ‘strong’ sites, combined.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Standard Error
(SE).DOI:
http://dx.doi.org/10.7554/eLife.05198.009Although the number of A-to-G discrepancies was unexpectedly large, subsequent analyses
support the idea that they are caused by RNA editing rather than other sources of error.
First, we applied our pipeline to similarly sized data sets from a human blood sample
and from the rhesus macaque brain, each containing matching RNA and DNA sequence reads.
As expected for mammals, the quantity of AG mismatches in coding regions were similar to
those from non-AG mismatches, and both were quantitatively indistinguishable from the
noise determined from the squid data (Figure 2A
and Supplementary file
1A,B). These controls demonstrate that the enormous number of AG mismatches in
the squid data is not an artefact of our analysis pipeline. Other features point to the
biological origin of our AG mismatches. Similar to A-to-I editing sites in other
organisms (Morse et al., 2002; Levanon et al., 2004; Kleinberger and Eisenberg, 2010), those identified here tend to
cluster and show distinctive 5′ and 3′ neighbor preferences (Figure 2—figure supplement 1). In addition,
hierarchical clustering of results from five tissues reveals that A-to-G modifications,
but not other types, exhibit clear tissue-specificity, suggesting they do not result
from genomic polymorphisms and mapping artifacts (Figure 2—figure supplement 2). No A-to-G overrepresentation is
observed in mitochondria-encoded genes (Figure
2A), in agreement with the absence of ADARs, and by extension A-to-I editing, in
the mitochondria. Finally, direct Sanger sequencing from a second individual confirmed
editing at 40/40 A-to-G sites, and deep-sequencing validated 120/143 A-to-G sites but
none of the 12 non A-to-G sites tested (Figure
2—figure supplement 3, Supplementary file 1C–G). Taken together, the
overrepresentation of A-to-G modifications over all other types, the motifs surrounding
the A-to-G sites, the tissue-specific modification levels, and the validation
experiments, provide evidence that the majority of the A-to-G modifications are true
editing events, while most non A-to-G modifications are likely technical artifacts or
genomic variations (Zaranek et al., 2010; Ramaswami et al., 2012) (Figure 2—figure supplement 4).
Figure 2—figure supplement 2.
Hierarchical clustering reveals tissue selectivity in the modification
levels of the A-to-G sites, but not in the non A-to-G sites.
(A) For the A-to-G sites, different tissues show varying levels
of editing, globally, as well as site-specific tissue-dependent regulation.
For example, higher modification levels (red) are observed in the GFL tissue
whereas low levels (green) are observed in the VL tissue. Yet, some sites
(top rows) are edited more strongly in VL. (B) Hierarchical
clustering of the non A-to-G modification levels in the five different
neuronal tissues. Consistently with the modifications being due to genomic
polymorphisms, data cluster according to the animal it was taken from:
modification levels in the GFL and OL tissues, which were taken from one
individual animal, form one cluster, as do the VL, BG and SG tissues, taken
from another individual animal. Modification levels are, by and large,
uniform across tissues coming from the same individual animal. Note that in
both panels only sites with significantly variable modification levels are
presented (binomial analysis was performed with Bonferroni-corrected p-value
of 0.05 as a cutoff), each row represents one modification site.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL),
Buccal ganglia (BG), and Stellate ganglion (SG).
DOI:
http://dx.doi.org/10.7554/eLife.05198.006
Figure 2—figure supplement 3.
Validation of editing using Sanger sequencing.
(A) An example of editing sites verified using Sanger
sequencing in the squid protein piccolo. Arrowheads mark the locations of
the editing sites. (B) Editing levels measured by Sanger
sequencing for the 40 sites correlate with RNA sequencing results.
DOI:
http://dx.doi.org/10.7554/eLife.05198.007
Figure 2—figure supplement 4.
Quality controls for the A-to-G modifications and the non A-to-G
modifications.
(A) The distribution of the quality scores for all the sites
used (all the positions inside all the analyzed reads), A-to-G
modifications, and non A-to-G modifications. No difference is observed
between these three groups. Note that sites with Q < 30 were excluded.
(B) The number of mismatches detected as a function of the
position inside the read. Non A-to-G mismatches tend to occur at reads'
ends, suggesting alignment artifacts (which tend to affect reads' ends)
are responsible to some of these mismatches (Ramaswami et al., 2012). A-to-G mismatches do not show
such tendency. (C) The distribution of modification levels for
A-to-G and non A-to-G sites, for the GFL and OL tissues. The increased
number of non A-to-G sites with ∼50% modification level hint at some
genomic polymorphisms (SNPs), that were not represented in our DNA reads due
to the limited coverage, are included among the non A-to-G mismatches.
Consistently, 51% of the sites with non A-to-G modification levels between
40–60% recur in both tissues (coming from the same individual
animal), compared to only 22% of the A-to-G modifications in the same range.
Similarly, 50% of non A-to-G modification levels higher than 90% recur in
both tissues (coming from the same individual animal), compared to only 21%
for A-to-G modifications in the same range. These two ranges are the only
ones in which such difference is observed. Abbreviations: Giant fiber lobe
(GFL), Optic lobe (OL), quality score (Q).
DOI:
http://dx.doi.org/10.7554/eLife.05198.008
Unlike with humans, the large number of A-to-I editing events translates into a large
number of recoding events: Overall, 57,108 recoding events were detected in 6991/12,039
ORFs. These numbers are orders of magnitude higher than any other species studied (Danecek et al., 2012; Ramaswami et al., 2013; St
Laurent et al., 2013; Pinto et al.,
2014) (Figure 2B). Moreover, a large
fraction of the proteins are recoded at multiple sites (Figure 3A): about 1/3 harbor ≥3 sites and 10% harbor ≥10 sites.
Even when focusing only on recoding sites with editing levels >10%, about 10% of
the squid proteins harbor ≥5 sites (Figure
3A). On the extreme end of the spectrum, the ORFs encoding α Spectrin
and Piccolo have 247 and 182 recoding sites, respectively (Figure 3B and Figure
3—figure supplement 1). It should be noted that only annotated ORFs
were examined in our pipeline, and the number of editing sites did not saturate with
respect to the number of sequence reads (Figure
2—figure supplement 5). Moreover, incompleteness of the de novo
transcriptome, as well as incorrect assembly of paralogs and splice variants, may cause
our pipeline to miss many additional sites (Supplementary file 1B). Therefore there are probably many more
recoding sites in the squid transcriptome.
Figure 3.
Editing often recodes multiple amino acids in the same protein.
(A) The fraction of the squid genes that harbor multiple
recoding events. About a third of the squid proteins harbor three or more
recoding sites and more than 10% harbor 10 or more recoding sites.
(B) Homology-modelling of the α Spectrin protein in
which 10% of the amino acids (247/2412) are recoded by editing. Amino acids
1602 to 1918 of the squid α Spectrin protein are included in the 3-D
model. Recoding sites are highlighted in green. Recoding sites with
tissue-dependent levels are highlighted in red and the corresponding editing
levels are indicated in the table. Also see Figure 3—figure supplement 1.
DOI:
http://dx.doi.org/10.7554/eLife.05198.010
Amino acids 1830 to 1966 of the squid Piccolo protein are included in the
3-D model. Recoding sites with tissue-dependent and -independent editing
levels are highlighted in red and green, respectively. Aspartate residues
involved in Ca2+ binding are highlighted in yellow.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL),
Buccal ganglia (BG), and Stellate ganglion (SG).
DOI:
http://dx.doi.org/10.7554/eLife.05198.011
Figure 3—figure supplement 1.
Homology-modelling of the squid Piccolo protein in which 9% of the amino
acids (182/2098) are recoded by editing.
Amino acids 1830 to 1966 of the squid Piccolo protein are included in the
3-D model. Recoding sites with tissue-dependent and -independent editing
levels are highlighted in red and green, respectively. Aspartate residues
involved in Ca2+ binding are highlighted in yellow.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL),
Buccal ganglia (BG), and Stellate ganglion (SG).
DOI:
http://dx.doi.org/10.7554/eLife.05198.011
Figure 2—figure supplement 5.
Number of modification sites detected as a function of the amount of DNA
and RNA reads.
(RNA-seq from the GFL and OL tissues only, originating from the animal whose
DNA was sequenced). (A) The number of A-to-G sites detected
increases with the number of RNA reads, with no sign for saturation. Thus,
we expect the number of editing sites to be much larger than the one
reported here. (B) The number of A-to-G sites detected in each
gene correlates with the gene's RNA coverage, demonstrating that with
much larger RNA-seq data, the number of detected editing sites could be as
high as ∼200,000 (expected number of 17 sites per protein, on
average, for each of the ∼12K ORFs in our model transcriptome).
(C) The number of modification sites detected as a function
of the number of DNA reads. Detection of modification sites is based on
mismatches between cDNA reads and the consensus. However, one of the main
sources for such mismatches, which masks the signal due to RNA editing, is
heterozygosity of the genome. The more DNA reads available, the better one
can identify and exclude genomically heterozygous sites (SNPs) and improve
signal-to-noise ratio. (i) ‘weak’ sites detection (ii)
‘strong’ sites detection - here exclusion of SNPs is part of
the detection scheme itself (see Methods) and thus the number of detected
sites (and not only the signal-to-noise ratio) increases with gDNA coverage
(iii) ‘Weak’ and ‘strong’ sites, combined.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Standard Error
(SE).
DOI:
http://dx.doi.org/10.7554/eLife.05198.009
Editing often recodes multiple amino acids in the same protein.
(A) The fraction of the squid genes that harbor multiple
recoding events. About a third of the squid proteins harbor three or more
recoding sites and more than 10% harbor 10 or more recoding sites.
(B) Homology-modelling of the α Spectrin protein in
which 10% of the amino acids (247/2412) are recoded by editing. Amino acids
1602 to 1918 of the squid α Spectrin protein are included in the 3-D
model. Recoding sites are highlighted in green. Recoding sites with
tissue-dependent levels are highlighted in red and the corresponding editing
levels are indicated in the table. Also see Figure 3—figure supplement 1.DOI:
http://dx.doi.org/10.7554/eLife.05198.010
Homology-modelling of the squid Piccolo protein in which 9% of the amino
acids (182/2098) are recoded by editing.
Amino acids 1830 to 1966 of the squid Piccolo protein are included in the
3-D model. Recoding sites with tissue-dependent and -independent editing
levels are highlighted in red and green, respectively. Aspartate residues
involved in Ca2+ binding are highlighted in yellow.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL),
Buccal ganglia (BG), and Stellate ganglion (SG).DOI:
http://dx.doi.org/10.7554/eLife.05198.011Consistent with other organisms (Stapleton et al.,
2006), recoding events are enriched in genes with neuronal and cytoskeletal
functions (Figure 4—figure supplement 1A
and Supplementary file
1H). To gain insight on the effected pathways, squid ORFs were mapped to all
human KEGG pathways (Kanehisa and Goto, 2000).
Editing has a global effect on most pathways (Supplementary file 1I), and those related to the nervous system
are even more affected. For example, of the 27 proteins in the ‘Synaptic vesicle
cycle’ pathway, 22 are edited and 14 heavily so (Figure 4A). Similarly, of the 39 proteins in the ‘Axon
guidance’ pathway, 33 are edited and 19 heavily so. Other notable pathways are
‘Regulation of actin cytoskeleton’ and ‘Circadian rhythm’
(Figure 4—figure supplement 1B). By
contrast, proteins in the pathways ‘Ribosome’ and ‘RNA
polymerase’ are edited less than average (Supplementary file 1I), demonstrating that some pathways may be
protected from editing. Consistently, editing levels observed in non-nervous system
tissues are considerably lower (Alon et al.,
2015).
Figure 4—figure supplement 1.
Recoding events are enriched in genes with neuronal and cytoskeletal
functions and globally affect molecular pathways.
(A) The top-scoring Gene Ontology (GO) terms (rated by false
discovery rate, FDR), enriched in a list of squid ORFs ranked by the
cumulative recoding level, that is the editing level summed over all
recoding sites (Eden et al., 2009).
(B) All of the identifiable squid proteins present in the
KEGG pathway ‘Circadian rhythm’ are edited, and many are
heavily edited. We define ‘heavily edited proteins’ as those
for which the cumulative recoding level exceeds unity (i.e., each copy of
the protein is expected to have at least one modified amino acid, on
average). These are marked red, other edited proteins in magenta, and
proteins not identifiable in the squid transcriptome in green. This figure
was created using the KEGG (Kanehisa and
Goto, 2000) pathway database website (http://www.genome.jp/kegg/pathway.html). Editing levels were
calculated using data from the Giant fiber lobe (GFL) and Optic lobe (OL)
tissues combined.
DOI:
http://dx.doi.org/10.7554/eLife.05198.013
Figure 4.
Recoding due to RNA editing affects complete molecular pathways and is
likely to be more advantageous in sites with high editing levels.
(A) All the squid proteins present in the KEGG ‘Synaptic
vesicle cycle’ pathway are edited, and most are heavily edited. We
define ‘heavily edited proteins’ as those for which the
cumulative recoding level, that is the editing level summed over all
recoding sites, exceeds unity. These are marked red, other edited proteins
yellow, and proteins not identifiable in the squid transcriptome are shown
in green. Also see Figure 4—figure
supplement 1. (B) The fraction of nonsynonymous codon
changes as a function of the editing levels, using data from the GFL and OL
tissues combined. The higher the editing level, the higher the fraction of
nonsynonymous codon changes. The fraction expected by chance is shown in
red. A similar relationship is also true for every tissue separately (Figure 4—figure supplement 2A).
Asterisks mark p-value <0.001 estimated using 1000 bootstrap runs.
DOI:
http://dx.doi.org/10.7554/eLife.05198.012
(A) The top-scoring Gene Ontology (GO) terms (rated by false
discovery rate, FDR), enriched in a list of squid ORFs ranked by the
cumulative recoding level, that is the editing level summed over all
recoding sites (Eden et al., 2009).
(B) All of the identifiable squid proteins present in the
KEGG pathway ‘Circadian rhythm’ are edited, and many are
heavily edited. We define ‘heavily edited proteins’ as those
for which the cumulative recoding level exceeds unity (i.e., each copy of
the protein is expected to have at least one modified amino acid, on
average). These are marked red, other edited proteins in magenta, and
proteins not identifiable in the squid transcriptome in green. This figure
was created using the KEGG (Kanehisa and
Goto, 2000) pathway database website (http://www.genome.jp/kegg/pathway.html). Editing levels were
calculated using data from the Giant fiber lobe (GFL) and Optic lobe (OL)
tissues combined.
DOI:
http://dx.doi.org/10.7554/eLife.05198.013
(A) For high editing levels, the fraction of nonsynonymous
codon changes is significantly different from the fraction expected by
chance for all the neuronal tissues examined. The fraction expected by
chance (see ‘Materials and methods’) is shown in red.
(B), The observed distribution of recoding types for sites
with editing levels lower than 10%, between 10–50%, and higher than
50%. Highly edited sites favor the creation of glycine and arginine, mainly
at the expense of lysine, in a statistically significant manner. Expected
values and error bars were calculated by using the mean values and standard
deviation of 100 bootstrap runs, respectively, generated by randomly
modifying adenosine in a way that preserves the editing sequence preference
and the number of events. (C) and (D): Amino acid
targeted by the editing and created by the editing in both squid and
Drosophila (St Laurent
et al., 2013). The most frequent target for removal is lysine, and
glycine and arginine are frequently created due to editing. Editing levels
were calculated using data from the GFL and OL tissues combined.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL),
Buccal ganglia (BG), and Stellate ganglion (SG).
DOI:
http://dx.doi.org/10.7554/eLife.05198.014
Each squid ORF was aligned against the conserved domains in the Conserved
Domain Database (CDD) (Marchler-Bauer et
al., 2013), and the score for substituting each amino acid by all
other types of amino acids was calculated (Boratyn et al., 2012). The substitution score is a positive or
negative integer, reflecting amino acid substitution which, compared to
chance, occur frequently or infrequently in the alignment of the conserved
domains, respectively. (A) The average editing levels, using
data from the GFL and OL tissues combined, as a function of the amino acid
substitution score. The average editing levels for negative substitution
scores is significantly lower compared to what is expected by chance.
(B) The distribution of the recoding sites as a function of
the amino acid substitution score. Recoding sites tend to avoid large
negative substitution scores compared with random changes. (C)
The average substitution score as a function of the editing levels, using
data from the GFL and OL tissues combined. The higher the editing levels,
the higher the average substitution score, indicating that highly edited
sites are more likely to recode to amino acids that occur frequently in
other species. Expected values and error bars were calculated by using the
mean values and standard deviation of 10,000 bootstrap runs, respectively.
For A and C, the editing levels in all the sites
with the same recoding type were randomly shuffled. For B,
adenosines were randomly modified in a way that preserves the sequence
preference and the total number of editing events. One asterisk mark p-value
<0.05, two mark p-value<1e-4. Abbreviations: Giant fiber lobe
(GFL), Optic lobe (OL).
DOI:
http://dx.doi.org/10.7554/eLife.05198.015
Recoding due to RNA editing affects complete molecular pathways and is
likely to be more advantageous in sites with high editing levels.
(A) All the squid proteins present in the KEGG ‘Synaptic
vesicle cycle’ pathway are edited, and most are heavily edited. We
define ‘heavily edited proteins’ as those for which the
cumulative recoding level, that is the editing level summed over all
recoding sites, exceeds unity. These are marked red, other edited proteins
yellow, and proteins not identifiable in the squid transcriptome are shown
in green. Also see Figure 4—figure
supplement 1. (B) The fraction of nonsynonymous codon
changes as a function of the editing levels, using data from the GFL and OL
tissues combined. The higher the editing level, the higher the fraction of
nonsynonymous codon changes. The fraction expected by chance is shown in
red. A similar relationship is also true for every tissue separately (Figure 4—figure supplement 2A).
Asterisks mark p-value <0.001 estimated using 1000 bootstrap runs.
Figure 4—figure supplement 2.
The fraction of nonsynonymous codon changes as a function of editing
levels and the amino acid modifications due to editing.
(A) For high editing levels, the fraction of nonsynonymous
codon changes is significantly different from the fraction expected by
chance for all the neuronal tissues examined. The fraction expected by
chance (see ‘Materials and methods’) is shown in red.
(B), The observed distribution of recoding types for sites
with editing levels lower than 10%, between 10–50%, and higher than
50%. Highly edited sites favor the creation of glycine and arginine, mainly
at the expense of lysine, in a statistically significant manner. Expected
values and error bars were calculated by using the mean values and standard
deviation of 100 bootstrap runs, respectively, generated by randomly
modifying adenosine in a way that preserves the editing sequence preference
and the number of events. (C) and (D): Amino acid
targeted by the editing and created by the editing in both squid and
Drosophila (St Laurent
et al., 2013). The most frequent target for removal is lysine, and
glycine and arginine are frequently created due to editing. Editing levels
were calculated using data from the GFL and OL tissues combined.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL),
Buccal ganglia (BG), and Stellate ganglion (SG).
DOI:
http://dx.doi.org/10.7554/eLife.05198.014
DOI:
http://dx.doi.org/10.7554/eLife.05198.012
Recoding events are enriched in genes with neuronal and cytoskeletal
functions and globally affect molecular pathways.
(A) The top-scoring Gene Ontology (GO) terms (rated by false
discovery rate, FDR), enriched in a list of squid ORFs ranked by the
cumulative recoding level, that is the editing level summed over all
recoding sites (Eden et al., 2009).
(B) All of the identifiable squid proteins present in the
KEGG pathway ‘Circadian rhythm’ are edited, and many are
heavily edited. We define ‘heavily edited proteins’ as those
for which the cumulative recoding level exceeds unity (i.e., each copy of
the protein is expected to have at least one modified amino acid, on
average). These are marked red, other edited proteins in magenta, and
proteins not identifiable in the squid transcriptome in green. This figure
was created using the KEGG (Kanehisa and
Goto, 2000) pathway database website (http://www.genome.jp/kegg/pathway.html). Editing levels were
calculated using data from the Giant fiber lobe (GFL) and Optic lobe (OL)
tissues combined.DOI:
http://dx.doi.org/10.7554/eLife.05198.013
The fraction of nonsynonymous codon changes as a function of editing
levels and the amino acid modifications due to editing.
(A) For high editing levels, the fraction of nonsynonymous
codon changes is significantly different from the fraction expected by
chance for all the neuronal tissues examined. The fraction expected by
chance (see ‘Materials and methods’) is shown in red.
(B), The observed distribution of recoding types for sites
with editing levels lower than 10%, between 10–50%, and higher than
50%. Highly edited sites favor the creation of glycine and arginine, mainly
at the expense of lysine, in a statistically significant manner. Expected
values and error bars were calculated by using the mean values and standard
deviation of 100 bootstrap runs, respectively, generated by randomly
modifying adenosine in a way that preserves the editing sequence preference
and the number of events. (C) and (D): Amino acid
targeted by the editing and created by the editing in both squid and
Drosophila (St Laurent
et al., 2013). The most frequent target for removal is lysine, and
glycine and arginine are frequently created due to editing. Editing levels
were calculated using data from the GFL and OL tissues combined.
Abbreviations: Giant fiber lobe (GFL), Optic lobe (OL), Vertical lobe (VL),
Buccal ganglia (BG), and Stellate ganglion (SG).DOI:
http://dx.doi.org/10.7554/eLife.05198.014
Editing tends to avoid potentially deleterious recoding events.
Each squid ORF was aligned against the conserved domains in the Conserved
Domain Database (CDD) (Marchler-Bauer et
al., 2013), and the score for substituting each amino acid by all
other types of amino acids was calculated (Boratyn et al., 2012). The substitution score is a positive or
negative integer, reflecting amino acid substitution which, compared to
chance, occur frequently or infrequently in the alignment of the conserved
domains, respectively. (A) The average editing levels, using
data from the GFL and OL tissues combined, as a function of the amino acid
substitution score. The average editing levels for negative substitution
scores is significantly lower compared to what is expected by chance.
(B) The distribution of the recoding sites as a function of
the amino acid substitution score. Recoding sites tend to avoid large
negative substitution scores compared with random changes. (C)
The average substitution score as a function of the editing levels, using
data from the GFL and OL tissues combined. The higher the editing levels,
the higher the average substitution score, indicating that highly edited
sites are more likely to recode to amino acids that occur frequently in
other species. Expected values and error bars were calculated by using the
mean values and standard deviation of 10,000 bootstrap runs, respectively.
For A and C, the editing levels in all the sites
with the same recoding type were randomly shuffled. For B,
adenosines were randomly modified in a way that preserves the sequence
preference and the total number of editing events. One asterisk mark p-value
<0.05, two mark p-value<1e-4. Abbreviations: Giant fiber lobe
(GFL), Optic lobe (OL).DOI:
http://dx.doi.org/10.7554/eLife.05198.015Recently, it was suggested that RNA editing is generally not advantageous in humans
(Xu and Zhang, 2014), as nonsynonymous
events are less frequent than expected by chance (Xu
and Zhang, 2014). Strikingly, for sites with high editing levels in squid, the
opposite is true (Figure 4B and Figure 4—figure supplement 2A). Recoding
events favor creation of glycine and arginine, mainly at the expense of lysine (Figure 4—figure supplement 2B–D).
Moreover, highly edited sites within conserved domains tend to recode to amino acids
that occur frequently in other species at the same position (Figure 4—figure supplement 3), suggesting selection
towards functional substitutions and against deleterious ones.
Figure 4—figure supplement 3.
Editing tends to avoid potentially deleterious recoding events.
Each squid ORF was aligned against the conserved domains in the Conserved
Domain Database (CDD) (Marchler-Bauer et
al., 2013), and the score for substituting each amino acid by all
other types of amino acids was calculated (Boratyn et al., 2012). The substitution score is a positive or
negative integer, reflecting amino acid substitution which, compared to
chance, occur frequently or infrequently in the alignment of the conserved
domains, respectively. (A) The average editing levels, using
data from the GFL and OL tissues combined, as a function of the amino acid
substitution score. The average editing levels for negative substitution
scores is significantly lower compared to what is expected by chance.
(B) The distribution of the recoding sites as a function of
the amino acid substitution score. Recoding sites tend to avoid large
negative substitution scores compared with random changes. (C)
The average substitution score as a function of the editing levels, using
data from the GFL and OL tissues combined. The higher the editing levels,
the higher the average substitution score, indicating that highly edited
sites are more likely to recode to amino acids that occur frequently in
other species. Expected values and error bars were calculated by using the
mean values and standard deviation of 10,000 bootstrap runs, respectively.
For A and C, the editing levels in all the sites
with the same recoding type were randomly shuffled. For B,
adenosines were randomly modified in a way that preserves the sequence
preference and the total number of editing events. One asterisk mark p-value
<0.05, two mark p-value<1e-4. Abbreviations: Giant fiber lobe
(GFL), Optic lobe (OL).
DOI:
http://dx.doi.org/10.7554/eLife.05198.015
The squid giant axon has been one of the most important models for neurophysiology.
Studies using this preparation serve as a foundation for our current understanding of
excitability, ion homeostasis, and axonal transport. Accordingly, we examined the extent
to which RNA editing might affect these processes (Supplementary file 1J). In
total, 87 GFL ORFs that encode voltage and neurotransmitter gated ion channels, ion
transporters, synaptic release machinery and molecular motors were identified in our
transcriptome. In agreement with the overall editing frequency in squid, 70% harbored
editing sites. Unexpectedly, however, 54% were heavily edited, many more than the 24%
expected. Thus even in a background of hyper RNA editing, squid, like other organisms,
preferentially edit nervous system targets.These data suggest that editing in squid has fundamentally different underpinnings and
consequences. Have squid ADARs evolved novel structures that account for the high-level
editing? A past study showed that a squid ADAR2 ortholog can be expressed as two
isoforms due to alternative splicing (Palavicini et
al., 2009): one, having two double-stranded RNA (dsRNA) binding motifs,
resembles vertebrate ADAR2s. A second, however, contains an ‘extra’ dsRNA
binding motif at its N-terminus. This non-canonical isoform edits RNA more efficiently,
edits more sites, and binds dsRNA with a far higher affinity (Palavicini et al., 2009, 2012). Further Squid ADAR2 messages themselves contain many editing sites,
leading to many subtly different isoforms. An ADAR1 isoform is also present in our
transcriptome and is the focus of an ongoing study. An equally intriguing question is
why squid edit to this extent? The process clearly creates tremendous protein diversity,
and this may in part explain the behavioral sophistication of these complex
invertebrates. A recent study showed that editing can be used for temperature adaptation
in octopus (Garrett and Rosenthal, 2012b) and
this makes sense based on the codon changes that it catalyzes (Garrett and Rosenthal, 2012a) (Figure 4—figure supplement 2C–D). In
Drosophila, editing can respond to acute temperature changes (Savva et al., 2012). The large number of sites in
squid suggests that editing is well positioned to respond to environmental variation.
Most model organisms studied so far are mammals which are homeotherms. Future studies of
more diverse species are needed to reveal the extent to which cold-blooded organisms
might utilize extensive editing to respond to temperature changes and other
environmental variables.
Materials and methods
Specimen collection and dissection
Specimens of the squid Doryteuthis pealeii were collected by trawl
in the Vineyard Sound by the animal collection department of the Marine Biological
Laboratory in Woods Hole, Massachusetts during the month of July. The giant fiber
lobe (GFL) tissue of the Stellate ganglion, the optic lobe (OL) tissue and a portion
of the sperm sack were manually dissected from a single adult male immersed in
chilled, filtered seawater. The Buccal ganglia (BG), Stellate ganglion with the giant
fiber lobe removed (SG) and Vertical lobe (VL) tissues were dissected from a second
adult male individual. Tissues were also dissected from non-neuronal regions: the
branchial heart, the Gill, the ventral epithelial layer on the pen, the marginal
epithelial layer on the pen, the iridophore layer of the skin, and the chromatophore
layer of the skin. Each of these six tissues originated from a different animal. RNA
from all tissues was extracted with the RNAqueous kit (Life Technologies, Carlsbad,
CA), and genomic DNA was extracted from the sperm sack using Genomic Tip Columns
(Qiagen, Venlo, Limburg, The Netherlands).
Library preparation for sequencing
The genomic DNA sequencing library was prepared using the TruSeq DNA Sample Prep kit,
as described by the manufacturer (Illumina, San Diego, CA), and sequenced using three
lanes of Illumina HiSeq 2000 instrument. The RNA-Seq libraries for all the samples
were prepared using the TruSeq Stranded mRNA Sample Prep Kit, as described by the
manufacturer (Illumina), and were sequenced using Illumina HiSeq 2000 instrument. The
GFL and OL libraries were sequenced together on a single lane, the same for the VL
and BG libraries (with one unrelated library) and the SG library was sequenced on a
single lane (with one unrelated library).
Detection of editing sites in the squid nervous system
Illumina sequencing was utilized to generate ∼87 million paired-end, 100 nt
reads, using RNA from GFL tissue, the same number of reads using RNA from OL tissue,
and ∼440 million paired-end, 100 nt reads, using germline DNA. The Trinity
transcript assembly tool (Grabherr et al.,
2011) was used with the default parameters (except for
‘min_kmer_cov’ which was set to 2 instead of 1) to construct the genes
sequences from the GFL and the OL sequencing data combined; giving 99,226 putative
gene-fragments (termed ‘components’). Most of the generated components
were short (median of 379 bases). As Trinity attempts to identify the different
transcripts (isoforms) of the components, 14,643 components were represented by more
than one sequence (putative transcript-fragment). For these components, the longest
sequence was selected and further used. The average length of the representing
sequences for all components was 589 bases, bringing the total size of the recovered
squid transcriptome to about 58 million bases.DNA reads and RNA reads were separately aligned against the RNA components using
Bowtie2 with local alignment configuration and default parameters (Langmead and Salzberg, 2012). Only uniquely
aligned reads were used (taking only reads with the maximal ‘mapping
quality’). As most of the components were short, we didn't demand that both
mate pairs be aligned to the same component. Instead, each read was separately
analyzed. Overall, 77% (∼268 million out of ∼350 million) of the GFL
and OL RNA reads combined were uniquely aligned to the components. However, as
expected, only 3.5% of the DNA reads (∼30 million out of ∼880 million)
were uniquely aligned to the components.To focus on editing sites inside coding regions, and avoid repetitive elements that
are prone to assembly and alignment errors, we retained only those components that
were found to be significant similar (Blastx E-value<1e-6) to the Swiss-Prot
proteins dataset (UniProt Consortium, 2014).
In these components, the alignment to the homolog Swiss-Prot protein covered most of
the squid sequence (63% on average). The detected ORFs were extended in both
directions until a stop codon or the end of a component was reached. Overall, 12,039
ORFs, with average length of 1370 bases, were detected. These ORFs represent 8276
Swiss-Prot proteins with distinct names (two or more different ORFs may be two squid
paralogs of the same Swiss-Prot protein, or fragments of the same squid gene aligned
to different regions of the Swiss-Prot protein). The total size of the detected
coding sequences in the squid was ∼16 million bases, 28% of the total
transcriptome length recovered using Trinity (the constructed squid coding sequences
are given in a text file in a fasta format, Alon et
al. (2015), available via Dryad digital repository). The coverage of these
ORFs was high; in the GFL and OL data combined, ∼199 million RNA reads were
aligned to these Swiss-Prot ORFs, with an average RNA coverage of 1206×. The
average DNA coverage was 106× as ∼21 million DNA reads, with average
length of 83 bases were aligned to the coding sequences.The above alignment data and the ORFs information were used to find the locations of
all DNA-RNA mismatches inside the coding regions. In the following, bases called with
quality score Q < 30 were discarded. Note, however, that Trinity consensus
sequence does take into account these bases, as well as reads that might have not
been uniquely aligned to the transcriptome. It is often customary to filter out
reads' ends when analyzing RNA-DNA mismatches. A main reason for that is the common
mismatches at reads' ends due to alignment artifacts when a splicing junction occurs
near the ends. In our case, as alignment is done to the transcriptome, we did not
observe any increase in AG mismatch rate near reads' ends (Figure 2—figure supplement 4), and thus no such filter
was used. Two procedures were used to detect editing sites: (A)
‘weak’ editing sites procedure: a binomial test was applied to find the
significant modifications between RNA reads and the Swiss-Prot ORFs. The binomial
statistics uses the number of successes (the number of reads with a mismatch of a
given type in a given position), the number of trials (the total number of RNA reads
aligned to the given position) and the error probability. The probability of having a
sequencing error (the error probability) was estimated using the sequence quality
score. We counted only mismatches with >=30 quality score, and therefore
the expected error probability was set to 0.1%. The binomial test was applied to
every position inside the Swiss-Prot ORFs. The p-value for each location was
corrected for multiple testing using a Benjamini-Hochberg false-discovery rate of
10%. Furthermore, in order to exclude RNA variability due to genomic polymorphisms,
we filtered out all modification sites in which any of the DNA reads aligned to the
site does not agree with the transcriptome. This procedure assumes the RNA consensus
in the site is identical to the gDNA reads, and thus is not suited to detect sites in
which the editing appear in most of the RNA reads and therefore also in the
Trinity-generated ORFs (‘strong’ editing sites). (B)
‘strong’ editing sites procedure: the locations in which all DNA reads
showed a different base than the ORF. The probability of such sites to be not a
result of editing but rather a single nucleotide polymorphism (SNP) was estimated by
(1/2)(#DNA reads) (no allele-specific expression) multiplied by the
prior probability of a SNP which was taken to be 0.001. Here too, the p-value for
each location was corrected for multiple testing using a Benjamini-Hochberg
false-discovery rate of 10%.Overall, 81,930 weak and 5649 strong A-to-G modification sites were detected in 7776
ORFs, and only 12,403 weak and 254 strong non A-to-G modifications sites (the
modification sites and their number of supporting reads in all the tissues studied
are tabulated in Alon et al. (2015) available
via Dryad digital repository). Interestingly, 2905 of the A-to-G modification sites
reside in 268 out of the 475 ORFs with only non-metazoans homologs. The number of
weak A-to-G sites detected (but not strong ones) is likely to increase with RNA
coverage, as we demonstrated by sampling parts of the sequencing data and
re-calculating the number of A-to-G sites (Figure
2—figure supplement 5A). Another way to look at the dependence
between the RNA coverage and the detected number of weak A-to-G sites is by recording
the number of A-to-G sites detected in each ORF as a function of the ORF RNA
coverage. The sorted RNA coverage was divided into ten equal bins and the mean number
of A-to-G sites in each bin was calculated. As expected, higher RNA coverage is
correlated with high number of A-to-G sites (Figure
2—figure supplement 5B). As with the RNA reads, the dependence of
the detection procedures on the DNA reads was examined by sampling parts of the DNA
sequencing data. Increasing the DNA coverage increased the precision of the
‘weak’ site detection procedure (as more SNPs are removed from the
observed modifications) and strongly increased the number of detected
‘strong’ sites (Figure
2—figure supplement 5C).We did not apply additional read-number filters, as these seem to only marginally
increase accuracy while reducing the number of detected sites. However, we did
exclude five strong sites for which there were no uniquely-aligned, high-quality,
supporting RNA reads. This brings the number of strong A-to-G and non A-to-G
modifications to 5644 and 219, respectively. The full list of weak and strong sites
(Alon et al. (2015) available via Dryad
data repository) provides the number of DNA and RNA reads per site.We examined clusters of mismatches of the same type (several consecutive identical
mismatches), revealing a high number of ORFs with A-to-G mismatch clusters (Figure 2—figure supplement 1A). For
example, examining only ORFs with three (and above) consecutive identical mismatches,
45,199 A-to-G sites in 4265 ORFs were detected, and only 470 non A-to-G sites.
Interestingly, the clusters of A-to-G sites also appear at the level of the
individual reads; for example, examining only reads with four and above consecutive
identical mismatches, 85% of the reads contained A-to-G modifications (Figure 2—figure supplement 1B). This
data implies that single RNA molecules contain A-to-G clusters. We note that similar
results were obtained using an algorithm which analyzes only reads with several
consecutive mismatches, including reads that cannot be mapped by standard alignment
tools (Porath et al., 2014). Using this
algorithm 23,737 A-to-G modification sites were found in the squid coding regions,
compared to only 2933 non A-to-G modifications. Importantly, only 728 A-to-G
modifications were found in the coding regions of human using this algorithm (Porath et al., 2014), even though a much larger
RNA-seq dataset was used (∼5 billion reads of Illumina Human BodyMap 2.0),
thus supporting the dramatic difference in editing between squid and other
organisms.The sequence surrounding the A-to-G modifications sites is similar for
‘weak’ sites and ‘strong’ sites, and both differ from
what is expected by chance (Figure 2—figure
supplement 1C). As expected for bona fide A-to-I editing
sites (Kleinberger and Eisenberg, 2010), G
is significantly underrepresented in the nucleotide before the editing site (6922 out
of 87,574 sites detected) and over represented in the nucleotide that follows the
editing site (38,564 out of 87,574 sites). A and T are over represented in the
nucleotide before the editing site (42,475 and 25,061, respectively, out of 87,574
sites).
Characterization of the modifications sites detected
We characterized the differences between the A-to-G modifications and all the other
types of modifications (non A-to-G). One possible source of apparent modifications is
sequencing errors. We thus compared the quality score between A-to-G and non A-to-G
modification sites. However, no significant difference in the distribution of the
quality scores was observed (Figure 2—figure
supplement 4A). Another technical explanation for the non A-to-G
modifications can be read-end artifacts: the ends of the reads are more likely to be
misaligned due to splicing, or to contain errors generated in the RNA sequencing
protocol (Ramaswami et al., 2012). Indeed,
non A-to-G modifications (unlike A-to-G ones) tend to be located in the reads-end
(Figure 2—figure supplement 4B),
suggesting a larger fraction of these sites is likely to be a result of technical
artifacts. In addition, we studied the modification level distribution (the fraction
of cDNA reads exhibiting the modification in a given position) for both types of
modifications (Figure 2—figure supplement
4C). A higher fraction of non A-to-G sites show ∼50% modification
levels (compared to the A-to-G sites), indicating a higher fraction of missed genomic
polymorphisms. Consistently, 50% of the sites with non A-to-G modification levels
between 40–60% or between 90–100% recur in both the GFL and the OL
tissue (coming from the same individual animal), compared to only 21% of the A-to-G
modifications in the same ranges.To find statistically significant differences in the modification levels between the
GFL and the OL tissues, a binomial analysis was performed with Bonferroni-corrected
p-value of 0.05 as a cutoff. Overall, 19% (16,425 out of 87,574) of the detected
A-to-G modifications differ significantly between the GFL and OL tissues. Most of
these sites (84%, 13,731 out of 16,425) have higher modification levels in the GFL
tissue. In contrast, only 6% (783 out of 12,614) of the non A-to-G modifications are
significantly different for the same two tissues, with no clear tissue preference:
45% and 55% of these sites have higher modification levels in the GFL and the OL
tissue, respectively. As the same technical artifacts and genomic polymorphisms are
expected to replicate in both tissues, these data increase our confidence that most
of the A-to-G sites are bona fide editing sites. To further
characterize the tissue-dependence of modifications levels, Illumina sequencing was
again utilized to generate ∼54, ∼62 and ∼42 million paired-end,
100 nt reads, using RNA from Buccal ganglia (BG), Stellate ganglion with the giant
fiber lobe removed (SG), and Vertical lobe (VL), respectively, collected from a
different animal. The same alignment procedure was applied to quantify the
modification level at the previously described sites for each of the additional
samples. Statistically significant differences in the modification levels between all
the five neuronal tissues were detected using a binomial analysis with
Bonferroni-corrected p-value of 0.05 as a cutoff. The sites with significantly
variable A-to-G modification levels were subjected for hierarchical clustering,
revealing clear tissue selectivity with higher modification levels in the GFL tissue
and low levels in the VL tissue (Figure
2—figure supplement 2A). The same analysis for the non A-to-G
modification levels demonstrates consistency between the individual animals as the
modification levels in the GFL and OL tissues form one cluster, and the VL, BG and SG
tissues form a second cluster, again suggesting that many of the non A-to-G
modifications are due to genomic polymorphisms (Figure 2—figure supplement 2B).To characterize the modification levels in non-neuronal tissues, Illumina sequencing
was again utilized to generate ∼23, ∼23, ∼19, ∼26,
∼19 and ∼14 million paired-end, 150 nt reads, using RNA from the
branchial heart, the Gill, the ventral epithelial layer on the pen, the marginal
epithelial layer on the pen, the iridophore layer of the skin, and the chromatophore
layer of the skin. The same alignment procedure was applied to quantify the
modification level at the previously described sites for each of the additional
samples, revealing considerably lower editing levels in the non-neuronal tissues
(Alon et al. (2015), available via Dryad
data repository).
A-to-I editing verified by Sanger sequencing and deep sequencing
Direct validation of editing was performed on a subset of the detected A-to-G sites
using Sanger sequencing and deep sequencing. To reduce the chance for RNA
contamination in the DNA or vice versa, primers were designed to differentially
amplify the gDNA, by residing in introns, or cDNA, by spanning exon–exon
boundaries. To identify intronic sequence and exon–exon junctions the
following steps were performed: (a) we recorded all the cases in which the beginning
or the end of the DNA read was trimmed during the local alignment against the Trinity
sequences. (b) If the flanking region could be aligned against the Trinity sequences,
even if aligned separately without the rest of the read, it was discarded. (c) If at
least three DNA reads showed the same flanking sequence starting from the same
position inside the ORFs, it was considered to be an intron fragment. This procedure
revealed the positions of part of the exon–exon junctions and part of the
intron sequences that correspond to the junctions. Overall, this procedure resulted
in about 2100 regions, containing ∼5% of all the detected editing sites, in
which the gDNA and the cDNA could potentially be differentially amplified.For the Sanger sequencing, primers were designed to differentially amplify the gDNA
and cDNA of 19 ORFs (Supplementary file 1C). To allow better detection of editing, all the
sites tested using Sanger sequencing were chosen to have >=20% modification
levels (in our original GFL analysis). For the Sanger validation experiment the GFL
tissue from a single animal was used (different from the animals used for the HiSeq
sequencing experiments). All the sites tested (40 out of 40) were validated using
Sanger sequencing (Figure 2—figure
supplement 3 and Supplementary file 1D).For the deep sequencing validations, three groups of targets were tested (Supplementary file 1E,F):
(a) twenty ‘interesting’ genes, that is, squid components with homology
to genes known to be implicated in human diseases or in other important pathways,
selected from the dataset of ∼2100 regions described above. (b) 20 regions
randomly-selected from the above dataset of ∼2100 regions. (c) 20 regions
randomly-selected from the set of putatively edited regions for which we could not
design unique gDNA primers, and thus gDNA and the cDNA could not be differentially
amplified. For this group the same primers were used to amplify the gDNA and the
cDNA. As with the Sanger validation experiment, the deep sequencing validations were
done using GFL tissue from a single animal (different from the animals used for the
HiSeq and Sanger sequencing experiments). All primer sets were designed with an
overhang so that sequencing and indexing primers could be added to the amplicons in a
nested PCR reaction. Samples from gDNA and cDNA were distinguished by unique
sequencing indexes. After nested PCR, amplicons were pooled, purified by Ampure XP
(Beckman Coulter, Danvers, MA), and sequenced on an Illumina MiSeq instrument. All
the cDNA targets were amplified, but only 49 of the 60 gDNA targets amplified well
enough for analysis (Supplementary file 1E,F). For the other eleven targets, the presence of
undetected introns could have disrupted amplification. The DNA and RNA reads were
analyzed using the same detection procedure described above (Figure 1B), with the sole exception that sequence variations
below 0.1% (the expected sequencing error rate) were allowed in the DNA reads (as
mandated by the much larger DNA coverage in this validation study). Altogether, 84%
(120 out of 143) of the A-to-G sites examined were validated (Supplementary file 1F,G).
In contrast, none of the 12 non A-to-G sites examined were validated. Moreover, 170
additional A-to-G sites (86% of all novel detected sites, a similar fraction to the
HiSeq data used in the original detection, Figure
2A) were identified, more than doubling the number of A-to-G sites
detected. Similar results were observed for the three groups examined. Finally, for
the validated A-to-G sites, high correlation was obtained between the editing levels
(the fraction of cDNA reads exhibiting the editing in a given position) measured
using the HiSeq data and the MiSeq data (Pearson's r of 0.86, p-value = 1e-35)
(Supplementary file
1G).
The effect of the A-to-I editing on the proteome
To demonstrate the extent of massive recoding on two examples, homology-modelling of
the squid proteins α Spectrin and Piccolo was performed with SWISS-MODEL using
default parameters (Biasini et al., 2014) and
visualized using UCSF Chimera package http://www.cgl.ucsf.edu/chimera (Figure 3B and Figure 3—figure
supplement 1).To find if the recoding events are enriched in genes with specific functions, we have
calculated the cumulative recoding level, that is, the editing level summed over all
recoding sites within each squid ORF. This gives a single score representing the
extent of recoding in the whole protein. The squid ORFs list was ranked using the
cumulative recoding level and all the Swiss-Prot annotations were converted to human
Swiss-Prot annotations (when possible) for consistency. The GO analysis tool GOrilla
was used to find enriched GO annotations in the ranked list (Eden et al., 2009). In order to control for possible detection
bias in highly expressed genes, the same list was ranked using the gene expression
level (FPKM) and was also analyzed using GOrilla. As expected, the enriched GO
annotations in the control list (that is, genes ranked by expression levels) were
mainly connected to the ribosome (translational elongation, structural constituent of
ribosome, RNA binding and so on). In contrast, the list ranked by the cumulative
recoding level gave enriched GO annotations which are mainly connected to neuronal
and cytoskeleton functions (Figure 4—figure
supplement 1A and Supplementary file 1H). Trying to detect enriched GO annotations at the
bottom of the list (ranked by the cumulative recoding level) did not produce any
significant results.The extensive recoding due to RNA editing can affect many molecular pathways. In
order to appreciate the extent of this phenomenon, squid ORFs were mapped to all
human KEGG pathways (Kanehisa and Goto,
2000), revealing that RNA editing has a global effect on the majority of the
squid pathways (Supplementary
file 1I). In this analysis, the homologs to the human proteins can be: (a)
‘heavily edited’, defined as proteins for which the cumulative recoding
level, that is the editing level summed over all recoding sites, exceeds unity, (b)
‘edited’, if the protein has at least one recoding site, (c) not
edited, or (d) not identifiable in the squid transcriptome. Editing levels were
calculated using data from the GFL and OL tissues combined. On average, 74% and 22%
of the identifiable proteins in each pathway are edited or heavily edited,
respectively. Pathways related to the nervous system are even more extensively
edited: 75% and 35% of the identifiable proteins in these pathways are edited and
heavily edited, respectively. On the flip side, in the pathways
‘Ribosome’ and ‘RNA polymerase’ only 50% and 33% of the
identifiable proteins are edited and 0% and 4% of the identifiable proteins are
heavily edited, respectively (Supplementary file 1I), demonstrating that some crucial pathways are
protected from editing. We have also specifically examined the extent of recoding in
squid ORFs that encode voltage and neurotransmitter gated ion channels, ion
transporters, synaptic release machinery and molecular motors. Overall, we examined
87 ORFs that are homologous to the following proteins: Voltage-gated potassium
channel alpha subunit, Voltage-dependent sodium channel alpha subunit,
Voltage-dependent calcium channels, Ionotropic glutamate receptors, Synaptotagmin,
Synaptobrevin, Syntaxin, Synapsin, Snap 25, Sodium/potassium-transporting ATPase
alpha subunit, Sodium/calcium exchanger (SLC8), Sodium bicarbonate exchanger (SLC4),
Sodium/hydrogen exchanger (SLC9), Sodium/potassium/calcium exchanger (SLC24), Dynein,
and Kinesin (Supplementary
file 1J). In the GFL tissue, 47 out of the 87 proteins (54%) are heavily
edited, more than twofold higher than expected by chance (p-value<1e-6). In
fact, in all the examined neuronal tissues, this group of 87 proteins is heavily
edited, significantly higher than expected by chance (Supplementary file
1J).An important question is whether editing in the squid ORFs tends to avoid recoding by
preferring synonymous modifications or, alternatively, tends to create more recoding
sites than expected by chance. In squid, the expected fraction of nonsynonymous
changes is 0.65, estimated by random changes of adenosines in the ORFs, accounting
for the observed local sequence preference of the editing sites (Figure 2—figure supplement 1C), as follows: (a) we
examined the sequence surrounding sites detected as edited (the base preceding and
the following the site), and counted the number of times each one of the 16 possible
combinations of upstream and downstream nucleotide appears. These numbers were
normalized by the number of times each combination appears for all the A bases in all
the ORFs, to produce the observed probability of being targeted by editing given a
certain combination of 5′ and 3′ nucleotides. (b) all the locations
inside the ORFs were screened in a random order until an A base was encountered. (c)
a random number was generated, if it was below the normalized probability
corresponding to the sequence surrounding the site in question, this site was
selected for further use. (d) the potential effect on the codon due to changing the A
base to G was examined, in particular, whether the change is nonsynonymous or
synonymous. (e) steps (b–d) were repeated until the number of A bases changed
matched the observed number of editing sites. The described randomization procedure
was also used to calculate the expected codon changes due to editing (Figure 4—figure supplement 2B). We found
that the higher the editing levels, the higher the fraction of nonsynonymous changes,
and for editing levels >20% the nonsynonymous fraction is significantly higher
than the expected fraction of 0.65 (Figure 4B
and Figure 4—figure supplement
2A).A recoding event which creates an amino acid rarely found in homologous proteins may
indicate that the editing is deleterious. Therefore, each squid ORF was aligned
against the conserved domains in the Conserved Domain Database (CDD) using
DELTA-BLAST (Boratyn et al., 2012; Marchler-Bauer et al., 2013). DELTA-BLAST
calculates the position-specific score matrices (PSSM) for each possible amino acid
substitution. Positive scores indicate that a certain amino acid substitution occurs
more frequently in the alignment against the conserved domains than expected by
chance, while negative scores indicate that the substitution occurs less frequently
than expected. The substitution score of each editing event was recorded as well as
the editing level in the same position, using data from the GFL and OL tissues
combined. As a control, the editing levels in all the sites with the same recoding
type were randomly shuffled. This accounts for the fact that editing creates specific
amino acid changes (Figure 4—figure
supplement 2B–D), and in turn, these specific changes may be
correlated with average editing levels and with substitution scores. Thus, the
shuffled dataset preserves both the distribution of recoding types and the
distribution of editing levels for each recoding type. This analysis revealed that
the average editing levels in sites with large negative substitution scores, which
may be deleterious, are significantly lower than what is expected by chance (Figure 4—figure supplement 3A). The
distribution of recoding sites for each substitution score was calculated and
compared with random changes that preserve the sequence preference and the total
number of editing events (Figure 2—figure
supplement 1C). Consistent with the above finding, there are significantly
less recoding sites with large negative substitution score than expected by chance
(Figure 4—figure supplement 3B).
Finally, on average, the higher the editing level in a given site, the higher the
average substitution score, above what is expected by chance (Figure 4—figure supplement 3C). Thus, this analysis
indicates that recoding by editing tends to avoid potentially deleterious sites and
that editing sites with high editing levels might be more important functionally than
editing sites with low editing levels.
Applying our detection procedure for human and rhesus macaque sequencing
data
We searched publicly available datasets and found two datasets of matched DNA- and
RNA-seq data which are comparable in size and read-lengths to our squid data: (i)
Human RNA-seq and DNA-seq data of blood samples (Chen et al., 2012) (ii) Macaque RNA-seq and DNA-seq brain data (Chen et al., 2014). The same pipeline as
described above was applied to the data, with one single exception: while comparing
the transcriptome model with SwissProt we used only non-vertebrate SwissProt
sequences, in order to mimic the situation for squid in which there are no SwissProt
entries from closely related species. The analysis of the human and macaque data was
done for two purposes: (A) to show that the enormous number of AG
mismatches in the squid data is real and not an artefact of the analysis pipeline,
and (B) to demonstrate that our pipeline could identify RNA editing
sites established by previous studies. The results of the analysis are summarized in
Figure 2A and Supplementary file 1A. For
non-AG mismatches, we obtained roughly the same numbers as those for squid. However,
as expected, the AG mismatches in the human and macaque samples did not show any
enrichment, and their abundance (in coding regions) was similar to those for other
mismatches. In other words, the sensitivity of our method allows the detection of the
super-strong editing signal of squid, but cannot separate the rare editing sites in
mammals (in coding regions) from noise. Note that the task of detecting recoding
sites in mammals is highly non-trivial even when one takes advantage of all
information available, including the accurate genome sequence. The best efforts, so
far, have yielded enrichment of AG mismatches, but still most detected sites are
non-AG (i.e., most-probably, false-positives) (Ramaswami et al., 2013).In order to assess the precision of our method in recalling true editing sites we
have looked at the validated editing sites found in the two studies above, and
checked whether they have been picked up by our algorithm as well. The full data is
presented in Supplementary
file 1B. Overall, about half of the sites were picked up by our pipeline.
The sites that were not identified, mostly resided in regions poorly described by our
Trinity transcriptome assembly (some were just very weakly edited). Thus, one may
conclude that the recall rate of our method is lower than 0.5, and the true extent of
squid recoding is even much larger than we report.
Data Deposit
The data reported in this paper was deposited to the Sequence Read Archive (SRA),
under accession SRP044717. The constructed squid coding sequences and all the A-to-G
modification sites detected in the coding regions of the squid are available via
Dryad digital repository (Alon et al., 2015;
http://dx.doi.org/10.5061/dryad.2hv7d).eLife posts the editorial decision letter and author response on a selection of the
published articles (subject to the approval of the authors). An edited version of the
letter sent to the authors after peer review is shown, indicating the substantive
concerns or comments; minor concerns are not usually shown. Reviewers have the
opportunity to discuss the decision before the letter is sent (see review
process). Similarly, the author response typically shows only responses
to the major concerns raised by the reviewers.Thank you for sending your work entitled “The majority of transcripts in the
squid nervous system are extensively recoded by A-to-I RNA editing” for
consideration at eLife. Your article has been favorably evaluated by
Chris Ponting (Senior editor) and 2 reviewers, one of whom is a member of our Board of
Reviewing Editors.The Reviewing editor and the other reviewer discussed their comments before we reached
this decision, and the Reviewing editor has assembled the following comments to help you
prepare a revised submission.The paper by Alon et al. is a well-performed, concise study that shows extensive RNA
editing in the squid genome. Since the extent of editing is several orders of magnitude
larger than that reported in the human genome, and actually in any other metazoan
species, the results are obviously of biological relevance, and therefore appropriate
for eLife. The manuscript also describes a novel bioinformatic method
to identify editing sites. Overall, the manuscript is well written and both Methods and
Results are clearly explained.Despite being unexpected, the results appear to be quite robust: the control in primates
shows that they are not an artifact from novel pipeline employed to predict editing
sites in absence of assembled genome sequences. The strong bias to AG mismatches, the
clustering pattern in genes, the neighbor preferences, the tissue specificity
(irrespective of biological origin), the resulting recoding events towards the common
amino-acid, etc., all these strongly support the editing sites found by the authors.1) The de novo transcriptome assembly is a very trivial computational issue. Many false
positives are expected at least in complex mammalian transcriptomes. Paralogs could
affect the reconstruction of real isoforms leading to a sort of chimeric transcripts. In
addition, alternative splicing may complicate transcript reconstruction. Are there
estimations about the impact of alternative splicing and paralogs in squid? Any impact
of this on the results should be discussed in the text. Also, the text should clarify
that this is not a completely de novo method since genomic sequences are generated.2) The strategy is biased towards the RNA editing prediction in protein coding regions
(CDS). Can RNA editing events be detected also in non-CDS regions by the method? If not,
this should be clarified in the text. Related to this, evidence of RNA editing in
repetitive regions in squid could potentially be interesting, probably revealing an
opposite trend than mammals.3) Regarding methodology, can the statistical binomial test detect any significant
change in the non-AG positions? If yes, how do you explain this finding?The average RNA and DNA coverage is high but regarding RNA editing candidates, are there
filters to exclude low covered sites? What is the minimal coverage for RNA and DNA?Did you apply filters to RNA and DNA reads? I mean reads with low quality and positions
at read ends.4) Have the authors considered the possibility that their results arise from somatic
genomic editing, rather than RNA editing? While for the human and macaque control, the
RNA and DNA samples are from the same tissues, in the case of squid, RNA samples are
from the tissues from the nervous system, while DNA is from the sperm sack. To
unequivocally conclude that the observations are indeed from RNA editing, I guess that
DNA and RNA need to be from the same biological source. Maybe the investigation of the
distribution of the relative proportion of reads supporting and not supporting the edit
could help here.5) Related to the above, the authors used RNA only from tissues from the nervous system.
Therefore, it is not possible to assess whether the phenomenon observed is
characteristic of this system, or it is actually systemic in the entire organism. I
think that sequencing RNA from some other non-nervous tissue could help to distinguish
between the two hypotheses.6) Regarding the characterization of RNA editing events, events tend to be tissue
specific. Are there events showing tissue specific levels? That is, cases in which the
gene locus in expressed at the same level in all tissues but editing levels are
different.7) It is a little bit disappointing that there is limited investigation in the potential
mechanisms behind the extensive editing observed. The authors could have at least
investigated ADAR with some additional detail. The RNA (and DNA sequence) helps to
delineate the ADAR sequence, and the RNA reads to estimate expression levels. Are there
multiple copies of ADAR in the squid genome? Is ADAR expressed at comparatively higher
expression levels than in organism with low editing levels (they can use the mouse and
human samples to make this comparison? Has the ADAR sequence in squid diverged faster
than expected? In specific domains? All these questions are quite simple to answer.8) The authors also provide an adaptive explanation to the high levels of editing
observed in the squid genome, and hypothesize that, in contrast to current assumptions,
that extensive editing is common as a way to cope with temperature adaption, except in
mammals that, as homeotherms, would not require such a process. This is, by the way,
reminiscent of the isochore theory by Bernardi that would separate homeotherm
vertebrates from “cold-blooded” (poikilotherm) vertebrates (to which, by
the way, the authors may want to cite). If the authors were correct that would indeed be
a quite relevant result. They could easily employ their pipeline in available vertebrate
RNAseq data (for instance, http://www.sciencemag.org/content/338/6114/1587.full) to test this
hypothesis.Before we address your report point-by-point, we would like to reiterate an important
issue that is touched upon in many of your queries: our editing detection pipeline,
relying on a de novo transcriptome assembly, is inferior to established methods which
align RNA reads to a genomic reference. All tools for de novo transcript assembly have
limited accuracy, particularly when they try to predict close paralogs and splice
variants from short cDNA reads. These shortcomings may lead to false-positives in our
pipeline, which add to errors caused by the more familiar somatic mutations, SNPs, and
alignment errors that are common to all editing detection schemes. As is the case for
all editing detection pipelines, the challenge is to minimize noise level to the extent
that true editing sites are not buried in the noise. Here, we are able to apply our
pipeline to squid despite the above described errors because the editing signal is
enormous.We are confident that the vast majority of A-to-G discrepancies that we have identified
reflect true RNA editing events for many reasons. First, all of the sources of error
outlined above are not expected to be biased towards A-to-G mismatches. Even if they
were, they should not be biased towards the expressed strand of the DNA (i.e. one should
expect equal numbers of A-to-G and T-to-C mismatches on the expressed strand if they
were true errors). Accordingly, we conclude that the level of non-AG mismatches
detected, which we used to estimate the false-discovery rate, already includes the sum
of all of the above types of error (and any additional ones of unknown source). The
dramatic overrepresentation of A-to-G mismatches over non-AG ones (including T-to-C) is
therefore strong evidence that most of the detected A-to-G modifications are due to bona
fide ADAR editing. Our conclusions are further supported by the fact that edited
adenosines are surrounded by ADAR’s preferred nucleotide motif, that, as with
other organisms, editing sites are clustered, and through direct validation of a
randomly selected subset of the sites. That being said, we do expect a false discovery
rate of 10-15% in our list of ∼90,000 predicted sites, and these thousands of
false-positives are probably due to inaccurate assembly, SNPs, somatic mutations and
misalignments.1) The de novo transcriptome assembly is a very trivial computational issue.
Many false positives are expected at least in complex mammalian transcriptomes.
Paralogs could affect the reconstruction of real isoforms leading to a sort of
chimeric transcripts. In addition, alternative splicing may complicate transcript
reconstruction. Are there estimations about the impact of alternative splicing and
paralogs in squid? Any impact of this on the results should be discussed in the
text.We completely agree that de novo assembly of complex transcriptomes is bound to result
in many misidentifications of paralogs and splice variants. This may impact our results
by leading to some false detection of “editing“ sites as well as by
missing true sites. As mentioned above, the total false discovery rate can be estimated
by non-AG mismatches, and is rather low. In fact, it is much lower than genome-aware
methods in model organisms, which identify only ∼40% A-to-G mismatches in coding
sequences (see e.g. Supplementary
Table 3 of Ramaswami et al., 2013). The reason is not that our pipeline is
better than the genome-aware methods; it's just that the squid recoding signal is
very strong.It is difficult to estimate the number of false positives due to mis-assembly, but one
may use the total number of T-to-C mismatches (∼5000) to approximate the total
false discovery rate, which is an upper bound for false-discovery rate due to
mis-assembly. As discussed in the text, it seems that SNPs are a major contributor to
false discoveries, so not all of these ∼5000 sites are due to mis-assembly.
Moreover, by applying our pipeline to primate datasets of similar size we identify
≤1000 sites for each mismatch type. Thus one might roughly estimate the number of
false positives due to mis-assembly at a few thousands at most.As for false-negatives, an incomplete transcriptome may lead to missing true recoding
sites. The scope of this may be demonstrated by our primate analysis (Table B), where
the major reason for not identifying about half of the known recoding sites is the
absence of these sites in the de novo transcriptome, or lack of resolution of different
splice variants (see Table B and footnotes). Accordingly, the full recoding repertoire
in the squid may be roughly twice what we report here.In summary, we estimate the number of false-positives due to mis-assembly to be up to a
few thousands. This source of error is included in our false discovery rate, and is far
lower than the true signal. As for false-negatives, we can only extrapolate from the
primate data, and roughly estimate their number to be sizable, making the full recoding
repertoire of the squid up to twice the size reported here.We have amended the text to the following: “This suggests false-positive rates of
15% and 4% respectively, mainly due to transcriptome assembly problems, SNPs, somatic
mutations and systematic mis-alignments. Note that these false-positive rates are
considerably lower than those for similar searches for editing within human coding
sequences, where a genome reference was employed.”And also: “Moreover, incompleteness of the de novo transcriptome, as well as
incorrect assembly of paralogs and splice variants, may cause our pipeline to miss many
additional sites (Table B).”Also, the text should clarify that this is not a completely de novo method since
genomic sequences are generated.Our transcriptome assembly does not use the DNA-reads in any way, and thus is indeed
'de novo'. Editing detection does use DNA-reads, and accordingly we do not
refer to our pipeline as a de novo method for editing detection.2) The strategy is biased towards the RNA editing prediction in protein coding
regions (CDS). Can RNA editing events be detected also in non-CDS regions by the
method? If not, this should be clarified in the text. Related to this, evidence of
RNA editing in repetitive regions in squid could potentially be interesting, probably
revealing an opposite trend than mammals.The question of editing in squid repeats is indeed interesting; however without a
fully-assembled genome available, it is difficult to attack at this point. The main
reason is that non-coding regions in the squid genome tend to be rich in repeats that
dramatically affect the quality of the de novo transcriptome analysis. Thus, while the
de novo transcriptome does include some non-coding sequences, and our method may, in
principle, apply (hence it is not essentially biased towards CDS), we chose to focus
only at CDS in order to improve our signal-to-noise ratio.We have amended the text to read: “To focus on editing sites inside coding
regions, and avoid repetitive elements that are prone to assembly and alignment errors,
we retained only those components that were found to be significant similar (Blastx
E-value<1e-6) to the Swiss-Prot proteins dataset.”3) Regarding methodology, can the statistical binomial test detect any
significant change in the non-AG positions? If yes, how do you explain this
finding?The statistical test is designed to filter out sequencing errors. Thus, many mismatches
that are not editing sites pass these tests, such as (i) SNPs not identified due to
insufficient DNA-seq reads coverage (ii) somatic mutations (iii) mis-alignments (of DNA
and/or RNA reads) and (iv) errors in the underlying transcriptome, due to the issues
mentioned in point #1. In addition, even for the sequencing errors, the
multiple-correction scheme we use (Benjamini-Hochberg) does not guarantee zero
false-positives.All of the above lead to the thousands of non-AG mismatches observed. As mentioned
above, the data presented in Figure 2–figure
supplement 2 suggests that SNPs are the major source of error.The average RNA and DNA coverage is high but regarding RNA editing candidates,
are there filters to exclude low covered sites? What is the minimal coverage for RNA
and DNA?We thank the referees for raising this important issue that was not properly discussed
in our manuscript. In order to minimize the number of arbitrary parameters in our
pipeline, we did not apply any additional DNA and RNA-coverage filter, except for the
implicit requirement for enough read coverage to attain significance in the statistical
tests. Such additional filters do improve the quality of the results (in terms of
signal-to-noise), but we still feel they are not necessary (except for one minor
modification, see below) as we now explain:Weak sites detection is based on a binomial test applied to the RNA reads mapped to the
transcriptome. The minimal number of RNA reads to significance in this case is two
reads, both of which are mismatched. There are only two such cases among the 81930 weak
sites.One may wonder how come a site with 2 reads reading “G” could have an
“A” in the consensus (as is the case for weak sites). The answer is that
while our alignments and statistical tests take into account only uniquely aligned reads
and quality scores Q≥30, Trinity does not apply these criteria. We chose not to
fiddle with Trinity transcriptome assembly and use it as is, and thus there are quite a
few cases where the consensus nucleotide of Trinity differs from the majority of the
reads we considered in our downstream analysis.Another possible filter is the number of DNA reads. As mentioned above we did not apply
any filter of this number. In fact, among our 81930 weak sites there are 1108 sites with
no DNA reads at all, and 1148, 1581, 2110 and 2753 sites with 1,2,3,4 supporting DNA
reads, respectively. At first, one could have thought that these sites should be
discarded, as they might very well be just genomic SNPs. However, looking at the non-AG
mismatches, and repeating our analysis with the additional filter of
minimum-N-DNA-reads, we get the following results:*Note that the counts of AG sites are not exactly the same as the numbers quoted
above for the current dataset. This is because each change of the filters influences the
downstream false discovery rate significance calculation.So, while adding the DNA reads filter does marginally improve accuracy, the number of AG
sites lost (roughly 1800 sites per one additional DNA read required) is an order of
magnitude larger than the false-positive sites weeded out (quantified by the number of
TC sites). We therefore choose not to imply the additional filter, and keep the
thousands of sites with low DNA reads coverage.Notably, eight of the 143 sites tested in our MiSeq validation were taken from this
group. All of these eight sites were confirmed to be editing sites (see Table G). In
addition, most of these sites show evidence of editing in the additional tissues tested,
supporting them being bona-fide editing sites (Supplemental Table 1, available via Dryad data repository).
Lastly, these sites show a clear sequence motif resembling the known ADAR motif:
depletion of G/C in 5' (only 8% and 12% of these sites have G and C in the
5’, respectively) and enrichment of G in 3' (41% of these sites have G in
the 3’).Strong sites detection is based on a p-value calculated by the number of DNA reads. The
minimal number of DNA reads to achieve significance in this case is five DNA reads.We did not apply another filter, which is the number of supporting RNA reads showing G.
Trinity decision to call the consensus site G was considered strong enough evidence for
G in the RNA reads. As explained above, Trinity might call a base even in the absence of
uniquely aligned, high quality reads, so the low number of reads supporting the
“G” call, or even the absence of any such reads, does not mean the site is
not an editing sites. Looking more closely at the sites with low RNA-reads coverage, we
find among our 5649 strong sites 5, 8 and 35 sites with 0, 1, 2 supporting RNA reads
showing “G”, respectively. These numbers should be compared to the number
of similar mismatches of the second most common modification type, which are 7(GA),
2(CT) and 5 (TC), respectively. We therefore agree that, despite our wish to minimize
parameters and avoid arbitrary cutoffs, requiring at least one (uniquely aligned, high
quality) RNA read to support editing is reasonable. We thus add this additional filter,
and remove the five strong sites from the dataset. Sites with even a single read showing
“G” are included (indeed, all of these eight sites, but only two out of
the five we excluded above, show evidence for editing in the additional three
nervous-system tissues tested).The omission of these five sites (three recoding sites) has no visible effect on the
figures, and thus the figures need not be replaced. Numbers have been modified
throughout the paper, when needed, but the changes are always insignificant. Note that
the table of editing sites provides the full information on the number of DNA and RNA
reads, enabling the user to gauge the confidence of the specific site of interest.We have added to the Methods section the following sentences: “We did not apply
additional read-number filters, as these seem to only marginally increase accuracy while
reducing the number of detected sites. However, we did exclude five strong sites for
which there were no uniquely-aligned, high-quality, supporting RNA reads. This brings
the number of strong A-to-G and non A-to-G modifications to 5,644 and 219, respectively.
The full list of weak and strong sites (Supplementary Table 1, available via Dryad data repository)
provides the number of DNA and RNA reads per site.”Did you apply filters to RNA and DNA reads? I mean reads with low quality and
positions at read ends.As mentioned in the Methods section, our alignments considered only uniquely aligned
reads and sites with quality-score Q≥30. However, Trinity de novo assembly takes
into account all reads and treats the quality score differently.We did not remove read ends—our alignment was done to the transcriptome, and thus
the splicing-junction-related misalignments that are known to introduce errors at the
reads' ends are not expected. In addition, we used the local alignment
configuration of Bowtie2, which allows reads’ end “trimming” to
optimize alignment. Indeed, as shown in Figure
2–figure supplement 4, A-to-G mismatches are not overrepresented at
reads' ends.The following text has been added to the Methods section: “In the following,
bases called with quality score Q<30 were discarded. Note, however, that Trinity
consensus sequence does take into account these bases, as well as reads that might have
not been uniquely aligned to the transcriptome. It is often customary to filter out
reads' ends when analyzing RNA-DNA mismatches. A main reason for that is the common
mismatches at reads' ends due to alignment artifacts when a splicing junction
occurs near the ends. In our case, as alignment is done to the transcriptome, we did not
observe any increase in AG mismatch rate near reads' ends (Figure 2–figure supplement 4), and thus no such filter was
used.”4) Have the authors considered the possibility that their results arise from
somatic genomic editing, rather than RNA editing? While for the human and macaque
control, the RNA and DNA samples are from the same tissues, in the case of squid, RNA
samples are from the tissues from the nervous system, while DNA is from the sperm
sack. To unequivocally conclude that the observations are indeed from RNA editing, I
guess that DNA and RNA need to be from the same biological source. Maybe the
investigation of the distribution of the relative proportion of reads supporting and
not supporting the edit could help here.As explained above, somatic mutations might indeed be a partial explanation of our false
discovery rate, but, as they should not be biased towards A-to-G (and certainly should
not introduce more A-to-G than T-to-C mismatches as they do not have strand preference),
their contribution is already accounted for in our estimated false discovery rate (based
on the non-AG mismatches), and they cannot account for a sizable fraction of our A-to-G
sites.In principle, one might have speculated that an endogenous squid-specific DNA-editing
process leads to specific A-to-G mismatches at a level much higher than the random
mutations seen in other organisms. But: (i) we have no evidence for such a DNA deaminase
enzyme in the squid (or any other organism) while we do know ADAR enzymes are present;
(ii) even such a putative DNA deaminase is not expected to select the coding strand over
the other strand, so there is no explanation for the enrichment of A-to-G over
T-to-C.In addition, we can exclude somatic mutations as a major contributor to our putative
editing sites list based on the following:The same sites re-occur in different tissues and different animals (see Figure 2–figure supplement 2).Randomly sampled sites were validated in additional different animals.Sites appear in clusters, as expected for editing sites but not somatic mutations.Nearby sites are not fully correlated (in the same read). Typically, for a pair of
adjacent sites we find reads showing all four combinations (A in both, G in both, A and
G, G and A), with only partial correlation. This is not expected for DNA-originated
mutations.The sites show a clear sequence motif resembling the known ADAR motif (depletion of G/C
in 5'; enrichment of G in 3').5) Related to the above, the authors used RNA only from tissues from the nervous
system. Therefore, it is not possible to assess whether the phenomenon observed is
characteristic of this system, or it is actually systemic in the entire organism. I
think that sequencing RNA from some other non-nervous tissue could help to
distinguish between the two hypotheses.We thank the referees for this important comment. We have now checked the editing levels
in six additional non-nervous-system tissues at the sites already identified.
Interestingly, the editing level in these tissues is an order of magnitude lower than in
the five nervous system tissues previously studied. The data has been added to Supplementary Table 1
(available via Dryad digital repository).We have added to the main text the statement: “Consistently, editing levels
observed in non-nervous-system tissues are considerably lower (Supplementary Table
1)”.We have also added to the Methods section information of the six additional tissues
tested: “Tissues were also dissected from non-neuronal regions: the branchial
heart, the Gill, the ventral epithelial layer on the pen, the marginal epithelial layer
on the pen, the iridophore layer of the skin, and the chromatophore layer of the skin.
Each of these six tissues originated from a different animal. RNA from all tissues was
extracted with the RNAqueous kit (Life Technologies), and genomic DNA was extracted from
the sperm sack using Genomic Tip Columns (Qiagen).”And: “To characterize the modifications levels in non-neuronal tissues, Illumina
sequencing was again utilized to generate ∼23, ∼23, ∼19,
∼26, ∼19 and ∼14 million paired-end, 150 nt reads, using RNA from
the branchial heart, the Gill, the ventral epithelial layer on the pen, the marginal
epithelial layer on the pen, the iridophore layer of the skin, and the chromatophore
layer of the skin. The same alignment procedure was applied to quantify the modification
level at the previously described sites for each of the additional samples, revealing
considerably lower editing levels in the non-neuronal tissues (Supplementary Table 1,
available via Dryad data repository).”6) Regarding the characterization of RNA editing events, events tend to be
tissue specific. Are there events showing tissue specific levels? That is, cases in
which the gene locus in expressed at the same level in all tissues but editing levels
are different.Supplementary Table 1
(available via Dryad digital repository) provides all data required to search for such
cases (A and G reads for each tissue). Indeed, one finds many cases in which the
expression level is similar but the editing level is very different. For example,
comparing OL and GFL tissues, there are 20,394 sites for which we obtained at least 100
reads per tissue, and the number of reads did not change more than 1.5-fold between
tissues. Interestingly, the average editing level in these sites is 1.5-fold higher in
the GFL tissue (9.4%) as compared to the OL tissue (6.1%). Looking for extreme
tissue-specificity among these similar-expression sites, we find 13 cases in which the
GFL sample exhibits editing levels >50% while the OL level at the same site are
<5%, but only 3 cases of the opposite scenario.In summary, we do see tissue specificity even when expression levels are similar. The
data provided enables the reader to find such sites using his/her own parameters.7) It is a little bit disappointing that there is limited investigation in the
potential mechanisms behind the extensive editing observed. The authors could have at
least investigated ADAR with some additional detail. The RNA (and DNA sequence) helps
to delineate the ADAR sequence, and the RNA reads to estimate expression levels. Are
there multiple copies of ADAR in the squid genome? Is ADAR expressed at comparatively
higher expression levels than in organism with low editing levels (they can use the
mouse and human samples to make this comparison? Has the ADAR sequence in squid
diverged faster than expected? In specific domains? All these questions are quite
simple to answer.We agree that, in light of these findings, the structure and function of squid ADARs
could be very intriguing. Our de novo transcriptome identified three ADAR-like enzymes.
One which is similar to ADAR1, one similar to ADAR2, and a third one that we would
predict to be inactive because it contains mutations at key residues which are involved
in catalysis. Vertebrates also contain an inactive ADAR (ADAR3), however the
squid’s “inactive” ADAR appears equally similar to all three
vertebrate ADARs. The squid ADAR2 sequence has been well characterized and is the
subject of two previous publications by the Rosenthal lab. The squid ADAR1 sequence is
the subject of an ongoing investigation by the Rosenthal lab.It is reasonable to hypothesize that high level of editing in squid is due to
comparatively high ADAR expression. However, it is difficult to compare the expression
levels in terms of reads per kilobase of exon per million mapped reads (RPKM) in the
absence of a full genome, because we can’t say whether unmapped reads are
actually mappable to poorly expressed, or poorly resolved, transcripts. That being said,
we did make a rough estimate of ADAR expression to answer this query. By ranking each
component from our transcriptome we found that squid ADAR1 ranks at 0.92 and squid ADAR2
ranks at 0.48 for the GFL tissue, where 1 is the most expressed component, and 0 the
least expressed component. These numbers are not very different from the situation in
the human brain, where (based on the normal total brain tissue in Human Body Map
dataset), ADAR1 is ranked at ∼0.9 and ADAR2 ∼0.5 (numbers are approximate,
as different exons behave differently, and identifying the expression levels of the
different splice variants is not trivial). This is only a rough estimate, and there are
several delicate problems in this comparison; therefore we prefer not to make the claim
that the expression levels are indeed similar in the text. However, we are working under
the hypothesis that other differences between squid and mammalian ADARs (see below) are
more likely to be of importance here.Much is known about squid ADAR2 and it does indeed have several interesting features
that may explain, in part, high level editing. We’d like to stress, however, that
it is only part of the puzzle. Vertebrate ADAR2’s have an invariant domain
structure: they are composed of 2 N terminal double stranded RNA binding motifs (dsRBMs)
followed by a conserved catalytic domain. Squid ADAR2 can have an additional dsRBM at
its N terminus which is included in about half the transcripts through splicing
(Palavicini et al., 2009). Thus, there is a canonical ADAR2 with 2 dsRBMS and a
non-canonical one with three. When produced recombinantly in yeast and tested in vitro
on a squid K channel mRNA substrate, the non-canonical version edits far more sites.
This version also has a much higher affinity for dsRNA (Palavicini et al., 2012).
Another interesting feature of the squid ADAR2 is that its own messages are abundantly
edited, leading to multiple isoforms. The specificities of the individual isoforms have
yet to be tested.We have also cloned and expressed Squid ADAR1. As with squid ADAR2, this enzyme has
notable differences when compared with its vertebrate counterparts. Vertebrate ADAR1s
are normally composed of three dsRBMs followed by a catalytic domain. Squid ADAR1 only
has 1 dsRBM followed by the catalytic domain. At its N-terminus, however, it contains a
highly basic domain that contains scores of phosphorylation sites. Squid ADAR1 messages
are also highly edited. A graduate student in the Rosenthal lab has produced recombinant
Squid ADAR1 and is currently characterizing its function for her doctoral project.
Because it contains many differences from vertebrate orthologs, a full-functional
characterization is a complex undertaking. Although its structure is clearly
interesting, we prefer not to include Squid ADAR1 sequence data in this manuscript
because only through structure-function studies can we assess whether its unique
features might contribute to high-level editing. We would like to assure the reviewer,
however, that these issues are very much on our minds and that we will be publishing
detailed accountings shortly.We have added the following to the main text: “Have squid ADARs evolved novel
structure that account for the high-level editing? A past study showed that a squid
ADAR2 ortholog can be expressed as two isoforms due to alternative splicing: one, having
2 double-stranded RNA (dsRNA) binding motifs, resembles vertebrate ADAR2s. A second,
however, contains an “extra” dsRNA binding motif at its N-terminus. This
non-canonical isoform edits RNA more efficiently, edits more sites, and binds dsRNA with
a far higher affinity. Further Squid ADAR2 messages themselves contain many editing
sites, leading to many subtly different isoforms. An ADAR1 isoform is also present in
our transcriptome and is the focus of an ongoing study.”8) The authors also provide an adaptive explanation to the high levels of
editing observed in the squid genome, and hypothesize that, in contrast to current
assumptions, that extensive editing is common as a way to cope with temperature
adaption, except in mammals that, as homeotherms, would not require such a process.
This is, by the way, reminiscent of the isochore theory by Bernardi that would
separate homeotherm vertebrates from “cold-blooded” (poikilotherm)
vertebrates (to which, by the way, the authors may want to cite). If the authors were
correct that would indeed be a quite relevant result. They could easily employ their
pipeline in available vertebrate RNAseq data (for instance, )
to test this hypothesis.We thank the referees for this comment. The statement at the closing sentence of the
main text was indeed a bit too strong, and we would like to rephrase and clarify our
claim. We do not claim that all cold-blooded animals have extensive
recoding. Published data for Drosophila, the leaf-cutting ant
Acromyrmex echinatior, and C. elegans all show much
less recoding activity than we report for the squid. What we do intend to suggest is
that in those species where extensive editing does happen, it can be utilized for
temperature adaptation.Thus, testing another cold-blooded species is not expected to help much, we presume most
species will not reproduce the exceptional level found in squid. We do intend to
explore, in future studies, the editing profile of species closely related to squid, in
order to better understand how this extensive phenomenon has evolved.Parenthetically, the data in the reference suggested above cannot be used for our
purposes, as no matching DNA-reads are provided.We have modified the closing sentence to read: “Most model organisms studied so
far are mammals which are homeotherms. Future studies of more diverse species are needed
to reveal the extent to which cold-blooded organisms might utilize extensive editing to
respond to temperature changes and other environmental variables”.
Authors: Georges St Laurent; Michael R Tackett; Sergey Nechkin; Dmitry Shtokalo; Denis Antonets; Yiannis A Savva; Rachel Maloney; Philipp Kapranov; Charles E Lawrence; Robert A Reenan Journal: Nat Struct Mol Biol Date: 2013-09-29 Impact factor: 15.369
Authors: Rui Chen; George I Mias; Jennifer Li-Pook-Than; Lihua Jiang; Hugo Y K Lam; Rong Chen; Elana Miriami; Konrad J Karczewski; Manoj Hariharan; Frederick E Dewey; Yong Cheng; Michael J Clark; Hogune Im; Lukas Habegger; Suganthi Balasubramanian; Maeve O'Huallachain; Joel T Dudley; Sara Hillenmeyer; Rajini Haraksingh; Donald Sharon; Ghia Euskirchen; Phil Lacroute; Keith Bettinger; Alan P Boyle; Maya Kasowski; Fabian Grubert; Scott Seki; Marco Garcia; Michelle Whirl-Carrillo; Mercedes Gallardo; Maria A Blasco; Peter L Greenberg; Phyllis Snyder; Teri E Klein; Russ B Altman; Atul J Butte; Euan A Ashley; Mark Gerstein; Kari C Nadeau; Hua Tang; Michael Snyder Journal: Cell Date: 2012-03-16 Impact factor: 41.582
Authors: Gokul Ramaswami; Rui Zhang; Robert Piskol; Liam P Keegan; Patricia Deng; Mary A O'Connell; Jin Billy Li Journal: Nat Methods Date: 2013-01-06 Impact factor: 28.547
Authors: Grzegorz M Boratyn; Alejandro A Schäffer; Richa Agarwala; Stephen F Altschul; David J Lipman; Thomas L Madden Journal: Biol Direct Date: 2012-04-17 Impact factor: 4.540
Authors: Karen Crawford; Juan F Diaz Quiroz; Kristen M Koenig; Namrata Ahuja; Caroline B Albertin; Joshua J C Rosenthal Journal: Curr Biol Date: 2020-07-30 Impact factor: 10.834
Authors: Silvia Moriano-Gutierrez; Eric J Koch; Hailey Bussan; Kymberleigh Romano; Mahdi Belcaid; Federico E Rey; Edward G Ruby; Margaret J McFall-Ngai Journal: Proc Natl Acad Sci U S A Date: 2019-03-04 Impact factor: 11.205
Authors: Virginia Savova; Esther J Pearl; Elvan Boke; Anwesha Nag; Ivan Adzhubei; Marko E Horb; Leonid Peshkin Journal: Dev Biol Date: 2017-03-07 Impact factor: 3.582