Stephen Frenk1,2, Evan H Lister-Shimauchi1,2, Shawn Ahmed1,2,3. 1. Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599-3280, USA. 2. Department of Biology, University of North Carolina, Chapel Hill, North Carolina 27599-3280, USA. 3. Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina 27599-3280, USA.
Abstract
Telomeric DNA is composed of simple tandem repeat sequences and has a G-rich strand that runs 5' to 3' toward the chromosome terminus. Small RNAs with homology to telomeres have been observed in several organisms and could originate from telomeres or from interstitial telomere sequences (ITSs), which are composites of degenerate and perfect telomere repeat sequences found on chromosome arms. We identified Caenorhabditis elegans small RNAs composed of the Caenorhabditis telomere sequence (TTAGGC)n with up to three mismatches, which might interact with telomeres. We rigorously defined ITSs for genomes of C. elegans and for two closely related nematodes, Caenorhabditis briggsae and Caenorhabditis remanei Most telomeric small RNAs with mismatches originated from ITSs, which were depleted from mRNAs but were enriched in introns whose genes often displayed hallmarks of genomic silencing. C. elegans small RNAs composed of perfect telomere repeats were very rare but their levels increased by several orders of magnitude in C. briggsae and C. remanei Major small RNA species in C. elegans begin with a 5' guanine nucleotide, which was strongly depleted from perfect telomeric small RNAs of all three Caenorhabditis species. Perfect G-rich or C-rich telomeric small RNAs commonly began with 5' UAGGCU and 5' UUAGGC or 5' CUAAGC, respectively. In contrast, telomeric small RNAs with mismatches had a mixture of all four 5' nucleotides. We suggest that perfect telomeric small RNAs have a mechanism of biogenesis that is distinct from known classes of small RNAs and that a dramatic change in their regulation occurred during recent Caenorhabditis evolution.
Telomeric DNA is composed of simple tandem repeat sequences and has a G-rich strand that runs 5' to 3' toward the chromosome terminus. Small RNAs with homology to telomeres have been observed in several organisms and could originate from telomeres or from interstitial telomere sequences (ITSs), which are composites of degenerate and perfect telomere repeat sequences found on chromosome arms. We identified Caenorhabditis elegans small RNAs composed of the Caenorhabditis telomere sequence (TTAGGC)n with up to three mismatches, which might interact with telomeres. We rigorously defined ITSs for genomes of C. elegans and for two closely related nematodes, Caenorhabditis briggsae and Caenorhabditis remanei Most telomeric small RNAs with mismatches originated from ITSs, which were depleted from mRNAs but were enriched in introns whose genes often displayed hallmarks of genomic silencing. C. elegans small RNAs composed of perfect telomere repeats were very rare but their levels increased by several orders of magnitude in C. briggsae and C. remanei Major small RNA species in C. elegans begin with a 5' guanine nucleotide, which was strongly depleted from perfect telomeric small RNAs of all three Caenorhabditis species. Perfect G-rich or C-rich telomeric small RNAs commonly began with 5' UAGGCU and 5' UUAGGC or 5' CUAAGC, respectively. In contrast, telomeric small RNAs with mismatches had a mixture of all four 5' nucleotides. We suggest that perfect telomeric small RNAs have a mechanism of biogenesis that is distinct from known classes of small RNAs and that a dramatic change in their regulation occurred during recent Caenorhabditis evolution.
Telomeres are repetitive sequences at the end of linear eukaryotic chromosomes that possess a G-rich strand that runs 5′ to 3′ toward chromosome termini (Smogorzewska and de Lange 2004). These sequences play an essential role in the maintenance of genome stability by protecting chromosome ends from recognition by DNA damage response pathways, thereby preventing chromosome fusions (Palm and de Lange 2008). Due to the end replication problem, telomeres are eroded with each cell division. Persistent telomere erosion triggers DNA damage response activation and results in senescence. To counteract telomere erosion, the reverse transcriptase telomerase uses an RNA template to add de novo telomeric repeats to chromosome termini. Although telomerase is essential for maintenance of proper germline function across generations, this enzyme is absent from most human somatic cells. Somatic silencing of telomerase and associated telomere shortening represents a barrier to tumorigenesis in large mammals (Seluanov et al. 2007; Gomes et al. 2011). Telomere maintenance in around 90% of cancers is achieved through activation of telomerase, while the remaining 10% utilize a telomerase-independent mechanism that is less well understood known as Alternative Lengthening of Telomeres (ALT) (Pickett and Reddel 2015).The protective function of telomeres depends in part on binding of a protein complex known as Shelterin, composed of POT1, TRF1, TRF2, TIN2, TPP1, and RAP1, that interacts with single- (POT1) and double-stranded (TRF1 and TRF2) telomeric DNA to promote telomere capping as well as the maintenance of a repressive chromatin environment (Dyer et al. 2017; de Lange 2018). Heterochromatin associated histone modifications such as H3K9me3, H3k20me3 and hypoacetylation of H3 and H4 are a common feature of telomeres (Schoeftner and Blasco 2010). Somewhat paradoxically, transcription by RNA polymerase II has also been observed at telomeres, leading to formation of the G-rich telomere repeat containing long noncoding RNA TERRA, which is composed of subtelomeric DNA sequences and some perfect telomere repeats, as well as the less abundant C-rich antisense telomeric transcript ARRET (Azzalin et al. 2007; Bah et al. 2012). TERRA expression is up-regulated at critically short and damaged telomeres, where it may recruit telomerase and chromatin modifiers in order to promote telomere elongation and protect telomeres (Cusanelli et al. 2013; Azzalin and Lingner 2015). It has been observed that TERRA prominently coats telomeres of the silent X chromosome in mammals and may promote X chromosome inactivation, but also localizes to autosomal telomeres and to nontelomeric regions of the genome (Schoeftner and Blasco 2008; Zhang et al. 2009; Chu et al. 2017).Silencing of certain repetitive regions of the genome such as pericentromeres has been shown to require transcription of long noncoding transcripts that interact with homologous small interfering RNAs (siRNAs) that are bound to Argonaute proteins (Reinhart and Bartel 2002; Holoch and Moazed 2015). siRNAs can also promote mRNA destruction or translational repression in the cytoplasm.Endogenous small RNA pathways have been well characterized in the nematode worm Caenorhabditis elegans (C. elegans) (Billi et al. 2014). Primary siRNAs can be created by DCR-1, the C. elegans homolog of the highly conserved RNAse class III enzyme Dicer, which cleaves double-stranded RNA transcripts to create 26 nt long siRNAs beginning with a 5′ guanine nucleotide (26G RNAs) (Bernstein et al. 2001; Ketting et al. 2001; Knight and Bass 2001). These primary siRNAs are loaded onto Argonaute proteins that bind to a target transcript through complementary base-pairing, and then recruit RNA dependent RNA polymerases to promote formation of 22 nt long secondary RNAs with a 5′ guanine (22G RNAs), which bear perfect complementarity to their targets. A distinct class of germline primary siRNAs termed piRNAs are 21 nt small RNAs with a 5′ Uracil that are created from thousands of loci in the genome. piRNAs contribute to genome defense by silencing non-self transcripts such as transposons (Batista et al. 2008; Das et al. 2008; Wang and Reinke 2008; Shi et al. 2013). piRNAs interact with the conserved Argonaute PIWI and can bind target RNAs with up to three mismatches, thereby triggering by recruitment of RDRPs and formation of 22G RNAs that are perfectly homologous to their targets and promote genomic silencing, similar to secondary siRNA formation in response to Dicer-dependent 26G primary siRNAs (Ruby et al. 2006; Shen et al. 2018; Zhang et al. 2018).Consistent with a potential role of TERRA in telomere maintenance, siRNAs with homology to telomeres have been observed in a variety of organisms, where they may contribute to telomere maintenance. Small RNAs produced from telomere repeat-containing transcripts were found to promote DNA methylation at Arabidopsis telomeres (Vrbsky et al. 2010). Telomeric small RNAs have also been detected in mouse embryonic stem cells (Cao et al. 2009). Telomere dysfunction can induce production of telomeric siRNAs with perfect homology to the vertebrate telomere repeat sequence (TTAGGG)n, which are created by dsRNAs that promote the DNA damage response (Rossiello et al. 2017). However, the origin of small RNAs with homology to telomeres that are present under standard growth conditions of wild-type metazoan cells is not well understood.Interstitial telomere sequences (ITSs) are degenerate telomeric sequences that are scattered along chromosome arms and are distinct from chromosome termini. A pioneering study by Meyne and coworkers identified cytologically visible ITSs in 55 different vertebrate species using fluorescence in situ hybridization (FISH) (Meyne et al. 1990; Lin and Yan 2008). Azzalin et al. (2001) identified a number of ITS sites in the human genome by cloning methods and found that ITSs could be divided into three categories representing different putative mechanisms of origin: short, subtelomeric, and fusion. Short ITSs are 24–130 bp in length and display few mismatches with perfect telomere sequence. The authors suggested that these ITSs may be formed during DNA double-strand break repair. Fusion ITSs are likely a consequence of ancestral head-to-head telomere fusion events. Notably, only a single fusion-type ITS was identified in the human genome at 2q13, which led to the creation of chromosome 2 from two ancestral Hominid chromosomes (IJdo et al. 1991).A number of studies have suggested a link between ITS sites and genome instability (Ashley and Ward 1993; Kilburn et al. 2001; Lin and Yan 2008; Bosco and de Lange 2012). High mutation rates were observed at a region of telomeric sequence that was inserted into a yeast chromosome close to the left telomere (Aksenova et al. 2013; Moore et al. 2018). Intriguingly, the types of mutation observed depended on the orientation of the telomeric sequence, which can lead to gross chromosomal translocations and inversions or repeat loss or gain (Aksenova et al. 2015). Although ITSs can promote genomic instability, Shelterin components in mammals bind ITSs where they may act to stabilize these sequences by preventing the formation of secondary structures that interfere with DNA replication (Yang et al. 2011).Inactivation of telomerase in C. elegans causes progressive sterility in the vast majority of animals. However, a small number manage to reproduce indefinitely in the absence of telomerase due to activation of ALT. In an analysis of stable C. elegans ALT strains, several strains were shown to have amplified a region of DNA between a telomere and an ITS sequence, which had then been trans-duplicated to the ends of the other chromosomes, where it acted as a surrogate telomere (Seo et al. 2015). This event is believed to have initiated from recombination between a critically shortened telomere and the ITS site, indicating that the high degree of similarity of ITS sites and telomeres has the potential to contribute to tumorogenesis and genome evolution by promoting a form of ALT-mediated telomere maintenance. However, despite the presence of ITSs in diverse species, the evolutionary origin and biological significance of these sequences remains mysterious.Due to the combination of a wide availability of sequencing data and a genome that has been fully sequenced and assembled end-to-end, C. elegans provides a highly convenient model system in which to study small RNAs and genome biology. In this study, we analyzed published C. elegans small RNA-seq data sets and found numerous examples of small RNA species that map to telomere repeats with up to three mismatches. We found that the majority of these small RNAs map to ITS sites, which are enriched for repressive chromatin marks and found mainly in the introns of protein-coding genes. We show that the closely related nematodes Caenorhabditis briggsae and Caenorhabditis remanei possess less abundant and generally more degenerate ITS sites than those in C. elegans, but display a substantially higher abundance of small RNAs that map perfectly to telomeres. We speculate that telomeric small RNAs may represent part of a functional pathway that has been significantly altered in C. elegans but retained in closely related species.
RESULTS
Identification of small RNAs with high levels of telomere homology
We interrogated 75 publicly available small RNA sequencing libraries created from RNA isolated from various C. elegans strains and found rare reads that mapped perfectly to the C. elegans telomere repeat (TTAGGC)n, occurring at a frequency of one per 10,000,000 reads (Supplemental Table S9; Gu et al. 2009; Gent et al. 2010; Fischer et al. 2011; Seth et al. 2013; Phillips et al. 2014, 2015; Sapetschnig et al. 2015). As piRNAs with up to three mismatches can target regions of the C. elegans genome for silencing (Bagijn et al. 2012), we counted telomeric siRNAs with up to three mismatches and found that these were substantially more abundant than those composed of perfect telomere repeat sequence (1 per 100,000) (Fig. 1A). In total, 2352 distinct telomeric small RNA species were detected. Telomeric siRNAs with zero to two mismatches were significantly more likely to be composed of the telomeric G-strand sequence than would be expected by chance (P < 2.2 × 10−16, binomial test) (Fig. 1D). The proportion of telomeric siRNAs with one to three mismatches beginning with a G nucleotide was significantly greater than that for all siRNAs that could be mapped to the genome (P < 1 × 10−15, Fisher's exact test for all comparisons) and the median length for telomeric siRNAs with 1–3 mismatches RNAs was 22 nt (Fig. 1B,C). Therefore, telomeric siRNAs with 1–3 mismatches are enriched for the effector 22G RNA species, which has been established to promote either posttranscriptional or genomic silencing depending on the Argonaute they associate with (Billi et al. 2014). In contrast, rare telomeric siRNAs with 0 mismatches (perfect telomeric siRNAs) were not enriched for 22G RNAs but displayed peak lengths of 18 and 23 nt.
FIGURE 1.
Identification of telomeric small RNAs in C. elegans. (A) Quantification of small RNA species in published data sets that map to telomere sequence with different numbers of mismatches. (B) Proportion of telomere-mapping small RNA reads in published data beginning with a particular nucleotide. (C) Distribution of telomeric small RNA length. (D) Proportion of telomeric small RNA reads mapping to the G-rich (TTAGGC) or C-rich (GCCTAA) telomeric strand.
Identification of telomeric small RNAs in C. elegans. (A) Quantification of small RNA species in published data sets that map to telomere sequence with different numbers of mismatches. (B) Proportion of telomere-mapping small RNA reads in published data beginning with a particular nucleotide. (C) Distribution of telomeric small RNA length. (D) Proportion of telomeric small RNA reads mapping to the G-rich (TTAGGC) or C-rich (GCCTAA) telomeric strand.Although the frequency of perfect telomeric siRNAs that we observed in C. elegans was rare, only one read in 10 million, the number of possible sequences for a 20mer is 420, or approximately 1 × 1012. Given that there are 12 possible perfect telomeric 20mers (six possible starting bases, multiplied by two for sense/antisense), the expected frequency of a 20mer being a perfect telomere sequence comes to approximately 1 in 1 × 1011, which is considerably lower than the frequency of 1 in 107 that we observed. To further investigate the prevalence of perfect telomeric small RNAs, we ran 1000 simulations in which 10,000,000 random nucleotide sequences were generated. The length of each sequence was determined probabilistically based on the length distribution of all the reads in the libraries we analyzed. We did not detect any perfect telomere sequences in any of the simulations. We therefore conclude that the low numbers of perfect telomeric reads we observed are not due to chance.Regarding telomeric siRNAs with mismatches, there are 1140 combinations of three positions in a 20mer. At each mismatch, there are four possible bases, so we multiplied 1140 by 4 to get 4560. Multiplying that value by our probability of 1/(420) for finding a specific 20mer comes to 1/2.41 × 108. Dividing this probability by 12, as above, resulted in a probability of finding a telomeric siRNA with a mismatch to be approximately 1 in 20,000,000. We ran a similar simulation to the one above but allowed up to three mismatches and only simulated 100,000 reads each time. Out of 1000 simulations, three showed a frequency of at least one telomeric sequence per 100,000 sequences, giving a P-value of 0.003. We therefore conclude that the observed number of telomeric small RNAs with up to three mismatches is greater than expected by chance. Given the high number of possible telomeric sequences with up to three mismatches allowed (656,640), it is interesting that we only found 2352 distinct species. This is however consistent with the hypothesis that telomeric small RNAs are largely produced from a relatively small number of genomic loci (see below).
ITS sites explain the origin of most telomeric small RNAs
Telomeric small RNAs could be divided into those that map to unique locations in the genome, which amounted to 10% and 57% of telomeric small RNAs with two or three mismatches, respectively, and those that map to multiple locations, which amounted to 100% of telomeric small RNAs with zero or one mismatches and 90% and 33% of telomeric small RNAs with two or three mismatches, respectively. The overall mapping frequency of telomeric small RNAs compares favorably with mapping frequency of all reads in a small RNA library, which generally amounts to 75%–80% in our experience.As unique telomeric small RNA reads mapped to loci that were scattered along chromosome arms, we reasoned that these loci may represent ITS sites that were previously shown to be concentrated on C. elegans chromosomes arms, based on the Ce000094 repeat sites that were automatically detected using the RECON algorithm (Bao and Eddy 2002; Lowden et al. 2011) (https://wormbase.org/). Although the Ce000094 motif corresponds to telomeric DNA, we later noticed that there were many ITS sites that were covered only partially by this motif or not at all. This is likely a consequence of the degeneracy of ITS sites, which makes them challenging to detect de novo using automated algorithms. In order to more thoroughly map ITS sites genome-wide, we located all instances of the telomeric hexanucleotide “TTAGGC” in the genome. We sought to capture perfect runs of telomere repeats as well as more degenerate sequences consisting of clusters of perfect repeats interspersed with nontelomeric sequence. TTAGGC hexanucleotides located on the same strand and within 100 nt of each other were merged to form ITS contigs, and contigs that contained at least four perfect telomeric hexanucleotides were retained, resulting in 1229 ITS sites (Supplemental File 1).ITS lengths ranged from 24 to 1204 nt with a median length of 134 nt (Fig. 2F). ITS sites were found to be largely localized to the autosome arms, which have been shown to be largely heterochromatic and enriched for repetitive sequences (Fig. 2A; Garrigues et al. 2015). There was a fivefold depletion of ITS sites on the X chromosome (Fig. 2B), consistent with lower levels of constitutive heterochromatin on this chromosome and an enrichment for Polycomb marks that silence the X chromosome in germ cells (Gaydos et al. 2012).
FIGURE 2.
Characterization of ITSs in C. elegans. (A) Localization of ITS sites on chromosome I regions in the sense orientation (TTAGGC) are shown in black, antisense in red. (B) Localization of ITS sites on the X chromosome. (C) Proportion of ITS sites overlapping different genome features (blue bars) and amount of genomic space taken up by genomic features (pink). (D) Average proportion of features overlapped by hexanucleotide repeat sequences from 1000 simulations. Error bars show standard deviation. (E) Percentage of real or permuted (“simulation”) ITS sites that are covered by a particular chromatin state. Distributions of ITS length (F) and homology score (G).
Characterization of ITSs in C. elegans. (A) Localization of ITS sites on chromosome I regions in the sense orientation (TTAGGC) are shown in black, antisense in red. (B) Localization of ITS sites on the X chromosome. (C) Proportion of ITS sites overlapping different genome features (blue bars) and amount of genomic space taken up by genomic features (pink). (D) Average proportion of features overlapped by hexanucleotide repeat sequences from 1000 simulations. Error bars show standard deviation. (E) Percentage of real or permuted (“simulation”) ITS sites that are covered by a particular chromatin state. Distributions of ITS length (F) and homology score (G).Although the mechanism by which ITS sites are created is unclear, these could be created by translocation of a telomere to an internal segment of the genome or by telomerase activity that acts at a double-strand break. Therefore, ITS sites are likely to be initially composed of perfect telomere repeats that degenerate over evolutionary time scales, such that they accumulate mutations in the absence of selective pressure. To obtain an estimate of the relative ages of ITS sites, we assigned a “homology score” to each site by performing local alignment between the ITS site and perfect telomere sequence of the same strand and length. The homology score was defined as the percent sequence identity between the two sequences. Homology scores ranged from 28.6 to 100, with a median score of 84.2 (Fig. 2G; Supplemental File 1). Just over half (650) of ITSs had two or fewer consecutive perfect telomere repeats, although 25 ITS sites contained 7–15 consecutive perfect telomere repeats (Table 1). We found no obvious genome localization patterns in terms of homology score or sequence length. ITS sites therefore represent abundant repetitive sequences displaying a broad range of degeneracy (and therefore age) and genome localization patterns consistent with other types of repetitive sequence (Ikegami et al. 2010; Garrigues et al. 2015).
TABLE 1.
Tally of the maximum number of consecutive perfect telomere repeats in ITS sites
Tally of the maximum number of consecutive perfect telomere repeats in ITS sitesThe origin of ITS sites is poorly understood. One proposed mechanism for ITS formation is the fusion of two chromosomes, although there currently exists evidence for only one such event in the human genome (IJdo et al. 1991). We sought to determine whether ancestral chromosome fusion events could account for the presence of ITSs in the C. elegans genome. However, we could find no example of an ITS site arranged head-to-head, and therefore conclude that ITS sites in the C. elegans genome are unlikely to arise as a result of chromosome fusions.To determine whether ITS sites are a source of telomeric small RNAs, we mapped all small RNAs to the genome using a pipeline that assigns multimapping reads probabilistically based on the local density of uniquely mapped reads (Axtell 2014). 76%, 91%, 83%, and 68% of small RNAs mapping to telomeres with zero, one, two, and three mismatches, respectively were assigned to ITS sites (Table 2), whereas 92% and 75% of uniquely mapping small RNA reads with two or three mismatches, respectively, could be assigned to ITS sites. Although it is impossible to determine whether perfect telomeric small RNAs originate from telomeres or internal regions with certainty, these results suggest that such small RNAs are more likely to be produced from internal genomic regions. This is because many ITS sites are in heterochromatic segments of the genome and are surrounded by high densities of siRNAs, whereas few siRNAs map closely to subtelomeres. Furthermore, we looked for uniquely mapping reads that overlapped with regions of perfect telomere repeat sequence of at least 30 bases in length, of which there were 95 in ITSs, and found 18 reads within internal regions, while only three reads could be found that overlapped telomere–subtelomere boundaries. As each ITS tract has two borders, this means that 18 small RNAs were present at these 190 ITS tract borders (1%), whereas three siRNAs were present at 12 telomere–subtelomere boundaries (25%). These results suggest that perfect telomeric small RNAs could originate from either telomeres or ITS sites.
TABLE 2.
Summary of unique telomeric small RNA reads
Summary of unique telomeric small RNA readsWe asked whether telomeric small RNA abundance varied across ITS sites. Due to the uncertainties associated with multimapping reads, we decided to focus on the uniquely mapping subset of telomeric small RNAs with two or three mismatches. We could find no strong correlation between telomere homology score and small RNA mapping frequency for ITS (Spearman's ρ = 0.068), although the ITS sites with the highest telomeric small RNA abundance had homology scores of around 80% and no telomeric small RNAs were found at sites with scores <53.8% (Fig. 3A). For example, all ITS tracts with >10 telomeric small RNAs had homology scores of >76% (Table 3). The abundance of uniquely mapping telomeric small RNAs correlated well with total small RNA abundance at an ITS tract (Spearman's ρ = 0.70), indicating that ITS sites that produce high numbers of telomeric small RNAs are likely to be sites of high small RNA production in general. Of the top 10 ITSs with the highest overall small RNA abundance, seven also displayed high telomeric small RNA abundance. The remaining three ITSs that had no telomeric small RNAs assigned had homology scores of <56%. We found one ITS site (ITS_60) which displayed an unusually high frequency of small RNA mapping that was five- to 13-fold greater than other ITS tracts with abundant siRNAs and accounted for 43% and 18% of total telomeric small RNAs with two and three mismatches, respectively (Table 3; Fig. 3). This ITS is intergenic and fully overlaps two annotated converging noncoding RNAs, whose transcripts could anneal to produce dsRNA that can be processed into abundant ITS siRNAs. (Fig. 3B). As telomeric small RNAs did not map exclusively to ITS sites, we also looked at the genome-wide distribution of telomeric small RNAs (Fig. 3C). We found three peaks of high coverage that did not overlap any ITS sites. Two of these, located on the left arms of chromosomes III and V, contained two consecutive perfect telomeric repeats and were assigned a number of reads with two or three mismatches to telomeres, but none with zero or one. The other peak, located on the right arm of chromosome I, contained three consecutive perfect telomere repeats next to a degenerate one, narrowly missing the criteria for being detected as an ITS site. Therefore, telomeric small RNAs that do not originate from ITS sites are likely to originate from non-ITS regions that happen to have short stretches of telomeric DNA.
FIGURE 3.
Telomeric small RNAs map to ITS sites. (A) Scatter plots showing ITS score versus uniquely mapping small telomeric small RNAs count as a percentage of total abundance for each category for ITS sites with at least one small RNA read. ITS_60 is shown in red. (B) Genome track showing the top four ITSs with the highest uniquely mapping telomeric small RNA count. Total small RNA coverage and telomeric small RNA reads are shown as separate tracks (C) Genome-wide distribution of ITS sites. Y-axis shows uniquely binding telomeric small RNA count.
TABLE 3.
ITS sites with more than 10 telomeric small RNAs
Telomeric small RNAs map to ITS sites. (A) Scatter plots showing ITS score versus uniquely mapping small telomeric small RNAs count as a percentage of total abundance for each category for ITS sites with at least one small RNA read. ITS_60 is shown in red. (B) Genome track showing the top four ITSs with the highest uniquely mapping telomeric small RNA count. Total small RNA coverage and telomeric small RNA reads are shown as separate tracks (C) Genome-wide distribution of ITS sites. Y-axis shows uniquely binding telomeric small RNA count.ITS sites with more than 10 telomeric small RNAs
ITS sites are mainly found in introns and are enriched for repressive chromatin marks
We looked for overlap between ITS sites and annotated genomic features and found that they were enriched in introns and in intergenic sequences. Surprisingly, 796 ITS sites (63% of total) were located in introns of protein-coding genes (Fig. 2C). In terms of sequence coverage, this represents almost a twofold enrichment for ITS sequence at introns based on the proportion of the genome that is comprised of intronic sequence. In contrast, only two ITS sites (0.16%) were found in protein-coding exons, indicating that ITS sites are depleted from exome, possibly because they are deleterious for mRNA expression. Consistently, ITS sites were also significantly depleted from both 5′- and 3′-UTRs. Such a pattern could be a general feature of blocks of DNA comprised largely of degenerate hexanucleotide repeats. To test this, we repeated the ITS detection pipeline with 100 random hexanucleotide combinations. For each hexamer, we defined degenerate repeat sites using the same method used to detect ITS sites and looked for overlap between these sites and annotated genomic features. The simulated repeat sites did not show an overall enrichment at introns or depletion at exons (Fig. 2D), indicating that this pattern of localization is specific to ITS sites. Genes containing intronic ITS sites were not significantly enriched for any GO terms.Features of eukaryotic chromatin such as histone modifications and nucleosome occupancy drive spatio-temporal regulation of genomic regions. Ho et al. (2014) used a hidden Markov model based algorithm to classify chromatin into a series of “states” based on the integration of various CHIP-seq and microarray data sets. The 16 states were grouped into six categories: promoter, enhancer, gene body, Polycomb-repressed, heterochromatin, and weak or low signal. We assigned chromatin states to each ITS sequence using the C. elegans data set. To determine statistically significant enrichment for chromatin states, we performed 1000 simulations in which a new set of 1219 sequences with the same length distribution as the ITS sequences was randomly selected from the genome and assigned chromatin states. Both intronic and intergenic ITS sites were enriched two- to fourfold for Polycomb-repressed chromatin states in both the early embryo and at the L3 larval stage of development relative to the corresponding simulated sites (P < 0.0003 for all classes), with the highest enrichment in the L3 stage (Fig. 2E; Supplemental Table S1). Intronic but not intergenic ITS sites were enriched twofold for heterochromatin chromatin states (P = 0.008 and 0.02 for early embryo and L3, respectively) and were depleted for enhancer chromatin states. The chromatin state of intronic ITSs matched that of the nearest exon in over 99% of cases for both early embryo and L3, indicating that the chromatin environment of intronic ITSs is almost always shared by the entire gene. Intergenic ITS sites were enriched 1.5-fold and twofold for enhancer states in early embryo and L3 larvae, respectively (P < 0.00048, P = 0.0003 for L3 larvae and early embryo, respectively). We looked at the genes most proximal to intergenic ITSs with enhancer chromatin states but found no significantly enriched GO terms among these genes. Both intronic and intergenic ITS sites were strongly depleted for promoter chromatin states at both stages (P < 0.001 in each case). We found no significant relationship between chromatin state and proportion of ITSs that produce at least five small RNAs (P = 0.1, χ2 test for both early embryo and L3 stages). We conclude that ITS sites are generally associated with repressive chromatin, that intronic ITS sites may promote repressive chromatin formation via heterochromatin or Polycomb pathways, but that some intergenic ITS sites may possess enhancer function.
C. briggsae and C. remanei ITS sites are less common and more degenerate than those of C. elegans
We sought to gain insights into the evolutionary conservation of ITS sites and the relationship between ITS sites and small RNA production in nematodes. C. briggsae is one of the closest known living relatives of C. elegans, and diverged from C. elegans between 18–100 million years ago (Stein et al. 2003; Cutter 2008). It is physically virtually indistinguishable from C. elegans. C. remanei is another close relative of C. elegans. Although C. briggsae and C. remanei may have shared a recent common ancestor that diverged from C. elegans, C. remanei reproduces by gonochoristic male–female sexual reproduction, whereas C. elegans and C. briggsae are both hermaphroditic selfers (Fierst et al. 2015).We applied the same method for detecting ITS sites in C. briggsae and C. remanei and found only 372 and 292 sites, respectively, less than one third of the number found in C. elegans in each case. ITS sites in C. briggsae and C. remanei were significantly more divergent than those from C. elegans, with median homology scores of 49.5 and 57.1 (P < 2.2 × 10−16, Mann–Whitney U-test for both comparisons) (Fig. 4A), while the median length for ITS sites was slightly higher for C. briggsae and C. remanei at 206 (P < 2.2 × 10−16, Mann–Whitney U-test) and 178 nt (P = 6.058 × 10−09, Mann–Whitney U-test), respectively (Fig. 4B). Interestingly, although 93% and 57% of ITSs in C. briggsae and C. remanei contained no consecutive telomere repeats, both species possessed a handful of 36–228 nt ITSs with perfect telomere homology (Table 1; Supplemental File 1). Although C. remanei had the longest perfect telomere tracts of the three species, C. remanei and C. briggsae possessed few ITS tracts with long stretches of perfect telomere repeats in comparison to C. elegans (Table 1). All three nematode genomes showed an intron bias for ITS sites, which in comparison to C. elegans was stronger for C. briggase and weaker for C. remanei. (Fig. 4C,D). It was not possible to determine an enhancer bias, because the other genomes have not been annotated for this feature.
FIGURE 4.
ITS sites in C. briggsae and C. remanei. (A) Distribution of ITS scores. (B) Distribution of ITS lengths. (C) Proportion of ITS sites overlapping different genome features (blue bars) and amount of genomic space taken up by genomic features (pink). (D) As (C) but for C. remanei. (E) Number of genes for which orthologous genes in two species containing intronic ITS sites in both species. (F) Domain structures of C. elegans, C. briggsae, and C. remanei human POT1 homologs.
ITS sites in C. briggsae and C. remanei. (A) Distribution of ITS scores. (B) Distribution of ITS lengths. (C) Proportion of ITS sites overlapping different genome features (blue bars) and amount of genomic space taken up by genomic features (pink). (D) As (C) but for C. remanei. (E) Number of genes for which orthologous genes in two species containing intronic ITS sites in both species. (F) Domain structures of C. elegans, C. briggsae, and C. remanei human POT1 homologs.To gain insights into the evolutionary history of ITSs across different species, we looked for intronic ITSs that are shared within orthologous genes. We could not identify genes for which orthologues between all three species contained ITS sites. We performed pairwise comparisons between the three nemotode species and found a handful of genes in each case for which orthologous genes between the two species contained ITS sites in both species, which we termed “shared” ITS sites as opposed to “unique” sites (Fig. 4E; Supplemental Table S2). We found that the median score was slightly but significantly lower for shared sites compared with unique sites in C. elegans (78.7 versus 84, P = 0.008443, Mann–Whitney U-test), although no significant difference was noted for C. briggsae or C. remanei (Supplemental Fig. S1). Notably, there were no shared ITS sites with scores >94.4%, consistent with unique ITS sites having been formed more recently in evolution. We performed GO term analysis on C. elegans genes orthologous to C. briggsae and C. remanei genes that contain ITSs. The C. briggsae gene set was enriched for GO terms related to reproduction cell cycle and organelle organization. No significant GO terms were enriched in the C. remanei gene set. Therefore, the mechanism of ITS formation is ancient in origin but ITSs themselves are divergent among nematode species, suggesting that a possible function of ITS tracts in regulation of chromatin if present diverges rapidly between species.
Telomeric small RNAs are more abundant in C. elegans relatives
We then interrogated small RNA libraries in C. briggsae and C. elegans (Shi et al. 2013) and looked for small RNAs that map to telomeres. The same data set contained a C. elegans library, which we also analyzed as a control. Consistent with our previous analysis, reads mapping perfectly to telomeres were exceedingly rare in the C. elegans libraries; three were detected in hermaphrodites, one in embryos and none in males, corresponding to a total of around one read per 10,000,000 small RNA reads. Strikingly, libraries from C. briggsae and C. remanei contained several orders of magnitude more perfect telomeric small RNAs compared with C. elegans (Fig. 5A). Perfect telomeric small RNAs were detected at an abundance of more than 22 per million in C. briggsae embryos and mixed male/female C. remanei adults. The C. elegans perfect telomeric small RNAs detected in control libraries from the briggsae and remanei data sets showed the same G-rich strand bias as found in the libraries described earlier. However, the C. briggsae and C. remanei libraries showed a strong bias for the C-rich strand bias for siRNAs composed of perfect telomere repeats (Fig. 5B). Despite this strong distinction in strand bias, perfect telomeric small RNAs from all three species showed a striking depletion of perfect telomeric small RNAs species that begin with a 5′ G, whereas abundant levels of small RNAs with a 5′ guanine were observed for telomeric siRNAs with mismatches.
FIGURE 5.
Comparison of telomeric small RNAs between three Caenorhabditis species. (A) Telomeric small RNA abundance in small RNA sequencing libraries (Shi et al. 2013) from three Caenorhabditis species. (B) Proportion of telomeric small RNA reads mapping to the G-rich (TTAGGC) or C-rich (GCCTAA) telomeric strand. (C) Proportion of telomere-mapping small RNA reads in published data beginning with a particular nucleotide. (D) Distribution of telomeric small RNA length.
Comparison of telomeric small RNAs between three Caenorhabditis species. (A) Telomeric small RNA abundance in small RNA sequencing libraries (Shi et al. 2013) from three Caenorhabditis species. (B) Proportion of telomeric small RNA reads mapping to the G-rich (TTAGGC) or C-rich (GCCTAA) telomeric strand. (C) Proportion of telomere-mapping small RNA reads in published data beginning with a particular nucleotide. (D) Distribution of telomeric small RNA length.If the perfect telomeric siRNAs are split into G- and C-strand categories, the G-rich RNAs from all three species were enriched for RNAs beginning with 5′ UAGGCU or 5′ UUAGGC, ranging in size from 18–29 nt. The most abundant sRNAs from all three species were 23 nt RNA UUAGGCUUAGGCUUAGGCUUAGG and 22 nt RNA UAGGCUUAGGCUUAGGCUUAGG (Fig. 6). On the other hand, C-rich telomeric RNAs of C. briggsae and C. remanei species showed enrichment for 22 nt perfect telomeric small RNAs beginning with a C (Fig. 5C,D) of 30% and 32%, respectively. Most perfect C-rich telomeric RNAs from both C. briggsae and C. remanei began with 5′ CUAAGC and ranged from 18–24 nt in length, with the most abundant C-rich RNA being the 22 nt sequence CUAAGCCUAAGCCUAAGCCUAA, which comprised 22% and 29% of all sequences for C. briggsae and C. remanei, respectively (Supplemental Table S2). This sequence reflects the majority of 22 nt perfect telomeric small RNAs for each species (52% and 57%, respectively). Given that there are only six permutations of a perfect telomeric sequence of a given length, each permutation would be expected to occur at a probability of 1/6 if the sequence distribution was random. We conclude that the general depletion of 5′G nucleotides from perfect telomeric siRNAs can be largely explained by the fact that G-rich RNAs commonly begin with 5′ UAGGCU and 5′ UUAGGC, whereas C-rich RNAs commonly begin with 5′ CUAAGC (Fig. 6; Supplemental Table S2).
FIGURE 6.
Perfect telomeric small RNAs of the genus Caenorhabditis. Total observed G-rich (A) and C-rich (B) telomeric small RNAs from RNA-seq and IP data sets. (C) Abundant telomeric sRNAs can anneal to form duplexes.
Perfect telomeric small RNAs of the genus Caenorhabditis. Total observed G-rich (A) and C-rich (B) telomeric small RNAs from RNA-seq and IP data sets. (C) Abundant telomeric sRNAs can anneal to form duplexes.Due to the relatively small number of available small RNA libraries for C. briggsae and C. remanei and the fact that telomeres have not been assembled in the reference genomes for these species, we could not make inferences about whether telomeric small RNAs were more likely to originate from telomeres or ITSs based on siRNAs that map to borders of perfect telomere repeat tracts. If perfect telomeric small RNAs originate from ITS sites, an increase in abundance for these RNAs could be due to an increase in the amount of ITS sequence that is capable of producing them. However, the C. elegans genome contains a combined total of 3666 bases of perfect telomere existing in stretches of at least 30 nt in length, compared with 1596 for C. briggsae and 708 for C. remanei, indicating that increased levels of perfect telomeric ITS sequence cannot account for the increase in perfect telomeric small RNA abundance. We conclude that the composition of telomeric small RNAs in C. elegans has diverged substantially from its close relatives, but that both classes of small RNAs are dramatically depleted for 5′ G RNA species, which could reflect either a toxicity of this species or its mechanism of biogenesis.Argonaute proteins can promote the biogenesis of specific classes of small RNAs. For example, ALG-3 and ALG-4 associate with 26G RNAs and PRG-1/Piwi associate with 21U RNAs, whose levels are dramatically reduced in response to deficiency for prg-1 (Ruby et al. 2006; Batista et al. 2008; Das et al. 2008; Han et al. 2009; Vasale et al. 2010). We studied small RNAs that immunoprecipitate with Argonaute proteins PRG-1, WAGO-1, HRDE-1, and CSR-1 (Batista et al. 2008; Das et al. 2008; Claycomb et al. 2009; Gu et al. 2009; Ashe et al. 2012; Buckley et al. 2012). Perfect telomeric small RNAs were exceedingly rare in the input RNA data sets, as expected, and we found no enrichment for C- or G-strand telomeric RNAs in any IP data set (Supplemental Table S3). However, these data sets revealed the only three C. elegans examples that we observed of the major C-strand RNA species found in C. briggsae and C. remanei, 23 nt 5′ CUAAGC RNA. The IP data sets brought the total number of C. elegans C-rich telomeric RNAs to 14 (Fig. 6; Supplemental Tables S2, S3). Significant numbers of telomeric small RNAs with mismatches were associated with PRG-1, WAGO-1, HRDE-1, and CSR-1 Argonaute proteins.
Evolution of telomeric small RNAs
Perfect telomeric siRNAs have been previously reported to associate with the Tetrahymena Argonaute protein Twi10, were of lengths 23–24 nt and 33–36 nt and all possessed the sequence 5′ UGGGGU (Couvillion et al. 2009), potentially similar to the 5′ UAGGCU G-rich RNAs that were present in Caenorhabditis species. Telomeric siRNAs have been previously reported to be rare in mammals, where they may promote the response to DNA damage at telomeres. We therefore examined ∼650,000,000 reads in 77 small RNA libraries from mouse and found 156, 56, 88, and 860 telomeric reads with 0, 1, 2, and 3 mismatches, respectively (Supplemental Tables S5, S6, S9; Pandey et al. 2013; Manakov et al. 2015; Watanabe et al. 2015; Wenda et al. 2017). Perfect telomeric siRNAs occurred at a frequency of ∼1 in 4,170,000 reads and were therefore almost as rare as C. elegans perfect telomeric siRNAs. G-rich perfect telomeric siRNA were also more common in mammalian telomeric siRNAs, the most abundant of which began with 5′ GUUAGG and were 19 and 23 nt in length (Supplemental Tables S7). The most common C-rich telomeric siRNA species from mammals began with 5′ UAACCC and was 24, 28, and 30 nt in length (Supplemental Table S8). Other permutations of 5′ ends were represented for both G- and C-rich perfect telomeric RNAs from mammals.We speculated that the striking difference in telomeric small RNA composition between C. elegans and its relatives could occur in response to a functional difference in telomere biology between these species. The C. elegans shelterin complex contains the OB-fold containing proteins POT-1 and POT-2, and mutation of either pot-1 or pot-2 results in successive telomere elongation over multiple generations (Raices et al. 2008; Shtessel et al. 2013). Variants in the pot-2 gene have also been found to have a prominent effect on telomere length in wild C. elegans strains, suggesting that this protein may affect fitness in the wild (Cook et al. 2016). A third C. elegans homolog of human Pot1, MRT-1, is required for telomerase activity (Meier et al. 2009). We performed tBLASTX searches for orthologues of the POT-2 protein in C. briggsae and C. remanei genomes, respectively. Surprisingly, we found that the reference genomes of C. briggsae and C. remanei do not contain pot-2 homologs. Instead, these species each possess two copies of MRT-1, which has an SNM1 interstrand crosslink repair nuclease fused to the POT-2 OB-fold protein (Fig. 4F; Meier et al. 2009). It is possible that the absence of pot-2 or presence of an extra mrt-1 paralog in C. briggsae and C. remanei could represent a fundamental difference in telomere biology that is related to the increased abundance of perfect telomeric small RNAs in these species.We examined telomeres of several wild isolates of C. briggsae and C. remanei and found that the AF16 strain that has high levels of perfect telomeric siRNAs had elongated telomeres, whereas the other strains examined had generally short telomeres, similar in size to those of C. elegans. However, several C. briggsae and C. remanei strains possessed weak telomeric DNA signals that failed to migrate into the gel and that ran at limit mobility, which are not observed in the genomic DNA from the Bristol N2 wild-type C. elegans strain, but commonly occur in genomic DNA from C. elegans strains of trt-1 mutants whose telomeres are maintained by the telomerase-independent telomere maintenance pathway ALT (Fig. 7; Cheng et al. 2012).
FIGURE 7.
Telomere length of C. elegans, C. briggsae, and C. remanei strains. (Left) Southern blot analysis of Hinf1-digested genomic DNA isolated from C. elegans, C. briggsae and C. remanei strains using a (TTAGGC)n probe. (Right) Southern blot analysis of two C. elegans trt-1 telomerase mutant strains whose telomeres are maintained via ALT. Positions of telomeric bands located in the well or at limit mobility are noted by arrows.
Telomere length of C. elegans, C. briggsae, and C. remanei strains. (Left) Southern blot analysis of Hinf1-digested genomic DNA isolated from C. elegans, C. briggsae and C. remanei strains using a (TTAGGC)n probe. (Right) Southern blot analysis of two C. elegans trt-1 telomerase mutant strains whose telomeres are maintained via ALT. Positions of telomeric bands located in the well or at limit mobility are noted by arrows.
DISCUSSION
Small RNAs composed of telomeric repeats have been shown to occur in several different organisms (Couvillion et al. 2009; Vrbsky et al. 2010; Rossiello et al. 2017). We interrogated a number of published C. elegans small RNA sequencing data sets and found rare examples of small RNAs mapping perfectly to telomeres as well as more abundant species mapping to telomeres with up to three mismatches. Telomeric small RNAs with an increasing number of mismatches correlated with an increasing enrichment for 22G small RNA species, which represent secondary “effector” siRNA molecules that are produced de novo by RNA-dependent RNA polymerases during both exogenous and endogenous RNA interference. All telomeric small RNAs mapping to telomeres with zero or one mismatches and the majority of species with two or three mismatches could be mapped to ITS sites using a method that randomly distributes nonunique reads, suggesting that ITS sites could be responsible for the formation of perfect telomeric small RNAs. However, mapping of siRNAs to borders of perfect telomere repeats throughout the genome suggested that while long stretches of perfect telomere repeats at either ITS tracts or at telomeres can give rise to siRNAs, telomeres could represent the more abundant source of perfect telomeric small RNAs. Perfect telomeric small RNAs that are created from ITS tracts might possess characteristics of small RNAs derived from ITS tracts, such as beginning with any 5′ nucleotide.Strikingly, perfect telomeric small RNAs were substantially more abundant in C. briggsae and C. remanei, whereas telomeric small RNAs with mismatches were generally of comparable abundance to C. elegans. However, perfect telomere repeat sequences within ITSs were depleted in C. briggsae and C. remanei relative to C. elegans (Table 1), suggesting that telomeres rather than perfect telomere sequence within ITS tracts are likely to be the source of the majority of perfect telomeric small RNAs. In parallel with their altered level of abundance, perfect telomeric small RNAs of C. briggsae and C. remanei displayed a substantial bias for being composed of the C-rich strand, which contrasts from the G-rich bias of small RNAs from C. elegans. The overrepresentation of C-rich telomeric small RNAs in C. briggsae and C. remanei is puzzling, possibly reflecting a change to the mechanism of telomeric small RNA biogenesis. In this regard, the C-rich telomere small RNA species would be complementary to the telomeric noncoding RNA TERRA, which is considered to be the most abundant telomeric RNA (Azzalin and Lingner 2015). We note that mammalian telomeric small RNAs were more frequently composed of the G-rich telomeric sequence and that their levels were very low, as observed for C. elegans. Together, our results argue that small RNA classes that are very rare may be functionally significant, consistent with a proposed role for telomeric small RNAs in responding to DNA damage at mammalian telomeres (Rossiello et al. 2017).Perfect telomeric G-rich RNAs from all three Caenorhabditis species commonly began with 5′ UAGGCU and 5′ UUAGGC. C-rich telomeric small RNAs of C. briggsae and C. remanei were enriched for 5′ ends that began with 5′CUAAGC (Fig. 6). Despite having prominent 5′ end sequences, both G- and C-rich telomeric RNAs displayed a wide range of lengths, suggesting that they may have a common mechanism of biogenesis that may be distinct from other known C. elegans pathways that often display pronounced enrichment for a discrete RNA length. Perhaps consistently, Tetrahymena possesses telomeric small RNAs with 5′ UGGGGU ends that also displayed a range of sizes (Couvillion et al. 2009). We did not observe consistent 5′ sequences for mammalian perfect telomeric small RNAs, but found that 3′ ends for major species of G-rich small RNAs for Caenorhabditis species and mammals often aligned perfectly with the 5′ end of the C-rich DNA base at the chromosome terminus (Fig. 8).
FIGURE 8.
Models of telomere RNAs and chromosome termini. Comparisons of prominent species of small RNAs (orange) and chromosome termini (lavender) for (A) Caenorhabditis (Raices et al. 2008), (B) Tetrahymena (Jacob et al. 2001, 2003), and (C) mammals (Sfeir et al. 2005). The Caenorhabditis terminal (TA) dinucleotide indicates uncertain nucleotides at the end of the 3′ overhang, whereas the Tetrahymena and mammalian 3′ nucleotides have been confirmed experimentally, the most frequent indicated based on highest vertical position.
Models of telomere RNAs and chromosome termini. Comparisons of prominent species of small RNAs (orange) and chromosome termini (lavender) for (A) Caenorhabditis (Raices et al. 2008), (B) Tetrahymena (Jacob et al. 2001, 2003), and (C) mammals (Sfeir et al. 2005). The Caenorhabditis terminal (TA) dinucleotide indicates uncertain nucleotides at the end of the 3′ overhang, whereas the Tetrahymena and mammalian 3′ nucleotides have been confirmed experimentally, the most frequent indicated based on highest vertical position.Given that C. briggsae and C. remanei are more closely related to each other than to C. elegans, we examined their genomes and found consistent loss of the POT-2 protein that represses telomerase activity and ALT in C. elegans. Some wild isolates of C. elegans have long telomeres (Raices et al. 2005), which can be due to loss-of-function pot-2 mutations (Cook et al. 2016). We examined telomeric DNA from several wild isolates of C. briggsae and C. remanei and found it to be distinct from Bristol N2 C. elegans DNA in that a minor fraction of their telomeres were long or unusually structured, similar to what is observed in telomerase-negative strains where ALT is active (Pickett and Reddel 2015). Thus, loss of pot-2 or high levels of telomeric small RNAs may create ALT-like telomeric DNA in C. briggsae and C. remanei. It is possible that C. elegans POT-2 promotes turnover of telomeric small RNAs or that the second MRT-1 homolog of C. briggsae and C. remanei promotes biogenesis of telomeric siRNAs. Both POT-2 and MRT-1 have OB2-folds that are predicted to interact with 3′ ends of single-stranded telomeric DNA (Lei et al. 2003), and Argonaute proteins possesses a PAZ domain that uses a variant OB fold to interact with the 3′ termini of single-stranded RNA (Song et al. 2003).ITS sites have been observed in number of different species but have not been comprehensively characterized within a fully assembled metazoan genome. We located over one thousand ITS sites in the C. elegans genome and found a conspicuous depletion of ITS sites at exons and UTRs and enrichment at introns. This localization pattern is not a general feature of hexanucleotide repeats, and is suggestive of specific formation mechanisms and/or selection pressures involved in ITS formation. These data suggest that ITS sites may be toxic to mRNAs, possibly because RNA composed of telomere repeats may form high order structures such as G-quadruplexes (Fay et al. 2017), and may also contribute to intron evolution. Although the origin of ITSs is uncertain, our data suggest that these genome sequences may function to promote gene silencing when in introns and to promote gene expression during development when in enhancers. Alternatively, it is possible that new ITS sites are created in regions of the genome that are silenced or heterochromatic, and that their association with silencing marks is a consequence of their mechanism of biogenesis. Given the abundance of telomeric siRNAs with mismatches that map to ITS tracts and the depletion of ITSs from 3′UTRs, we favor the possibility that ITS tracts may actively promote transcriptional and posttranscriptional gene silencing, a topic that could be addressed by nuclease-mediated genome engineering. Although telomerase generally functions at telomeres, there is evidence to suggest that its reverse transcriptase activity can result in the insertion of telomeric sequence at double-strand breaks, although this generally results in the creation of a new telomere (Churikov and Geli 2017). Therefore, ITSs may represent a contribution to intron and enhancer evolution by the telomerase enzyme.We found that ITS sites are a shared feature among close relatives of C. elegans, although their compositions show some striking differences. ITSs in both C. briggsae and C. remanei are rarer and generally more degenerate than in C. elegans, although both species contain examples of ITS sites with perfect telomere homology. It is intriguing that there should be such a discrepancy in ITS divergence between these species, and this observation suggests differing selective pressures. Although it may be a coincidence that perfect telomeric small RNAs are far more abundant in C. briggsae and C. remanei, it is also possible that these RNAs are functionally linked to ITS formation and/or evolution. Telomeric small RNAs may actively inhibit formation or expansion of ITS sites, meaning that their very low levels in C. elegans could have led to an increasing number of ITSs in the C. elegans genome, many of which have been formed relatively recently.
MATERIALS AND METHODS
Analysis of high-throughput sequencing data
Publicly available RNA-seq data sets were downloaded from the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) (see Supplemental Table S9 for library information). Adapter trimming was performed as required using the bbduk.sh script from the bbmap suite3 and custom scripts. Reads were then converted to fasta format and mapped to the C. elegans genome (WS251) using Butter (https://www.biorxiv.org/content/10.1101/007427v1). To detect telomere-mapping sRNAs, reads were mapped to seven repeats of perfect telomere sequence (TTAGGC) with Bowtie (Langmead et al. 2009), allowing for up to three mismatches. Sequence pipelines were built using the Snakemake framework (Koster and Rahmann 2012). As C. elegans small RNA extraction protocols feature a size selection step that generally excludes species outside the size range of around 18–30 bases, we only considered this size range during our analysis. Note, however, that perfect telomeric RNAs >30 nt have been previously detected in Tetrahymena (Couvillion et al. 2009).
Genome-wide mapping of ITS sites
Bowtie was used to find all instances of the TTAGGC hexamer sequence in the genome. Hexamers located within 100 nt of each other and situated on the same strand were merged into clusters using bedtools (Quinlan and Hall 2010) and clusters containing four TTAGGC hexamers were retained. Telomeres or potential telomeres, defined as any ITS that begins or ends within 50 nt of the end of a contig, were removed. In the cases of clashes in which ITSs on opposite strands overlap with one another, the ITS site with the most repeats was retained. Homology scores were assigned to the final set of ITS sites by performing local sequence alignment between each ITS and a stretch of telomere sequence of the same length using the EMBOSS water algorithm http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/water.html.The method was performed on the C. elegans (WS251), C. briggsae (WS263), and C. remanei (WS263) genomes.
Analysis of genomic features at ITS sites
Genomic features such as introns, exons, etc. were assigned to ITS sites using the GenomicRanges R package [23950696]. Assignment of a particular genomic feature required that at least 50% of the length of the ITS site overlapped with this feature.Chromatin state data was obtained from http://compbio.med.harvard.edu/modencode/webpage/hihmm/. Chromatin states were assigned to ITS sites using bedtools, requiring that at least 50% of the length of the ITS overlap with the particular chromatin state. To determine enrichment of particular chromatin states in the ITS set, 10,000 simulations were performed in which the ITS set was permuted using bedtools shuffle, resulting in a set of random genomic regions with the same lengths as the real ITS set. The regions were divided into intergenic and intronic regions and were assigned chromatin states. Enrichment for a particular chromatin state was calculated by dividing the number of ITS sites with that chromatin state by the average number of sites with the same state in the simulated data. P-values were obtained by determining the percentage of simulations in which the number of sites that were assigned a particular chromatin state was equal to or more (for enrichment) or equal to or fewer (for depletion) than the real ITS set. P-values were adjusted using Benjamini–Hochberg procedure (FDR). Given that 10,000 simulations were performed, the rarest possible event that can be detected is one that occurs one in 10,000 simulations, corresponding to a rate of 0.0001. After Benjamini–Hochberg correction, this value becomes 0.00048. Therefore, “P < 0.00048” was assigned in the cases in which the number of ITS sites with a particular chromatin state was more extreme than any of the 10,000 simulations.Lists of orthologous genes between C. elegans, C. briggsae, and C. remanei were obtained from Parasite.
Statistics and data visualization
Analysis of sequencing data was performed using the R statistical computing environment. Genome regions were visualized using the Gviz R package (Hahne and Ivanek 2016). GO term enrichment analysis was performed using the RDAVIDWebService R package (Fresno and Fernandez 2013).
Southern blotting
Genomic DNA was prepared from C. elegans strain Bristol N2, C. remanei strains PB4641, EM464, and SB146, and C. briggsae strains MK104 and AF16, which were grown at 20°C on NGM plates. Southern blotting was performed using genomic DNA prepared from a Gentra kit and a digoxygenin-labeled TTAGGC probe, as previously described (Meier et al. 2009).
SUPPLEMENTAL MATERIAL
Supplemental material is available for this article.
Authors: Jessica J Vasale; Weifeng Gu; Caroline Thivierge; Pedro J Batista; Julie M Claycomb; Elaine M Youngman; Thomas F Duchaine; Craig C Mello; Darryl Conte Journal: Proc Natl Acad Sci U S A Date: 2010-02-02 Impact factor: 11.205
Authors: Alyson Ashe; Alexandra Sapetschnig; Eva-Maria Weick; Jacinth Mitchell; Marloes P Bagijn; Amy C Cording; Anna-Lisa Doebley; Leonard D Goldstein; Nicolas J Lehrbach; Jérémie Le Pen; Greta Pintacuda; Aisa Sakaguchi; Peter Sarkies; Shawn Ahmed; Eric A Miska Journal: Cell Date: 2012-06-25 Impact factor: 41.582
Authors: Joanna M Wenda; David Homolka; Zhaolin Yang; Pietro Spinelli; Ravi Sachidanandam; Radha Raman Pandey; Ramesh S Pillai Journal: Dev Cell Date: 2017-06-19 Impact factor: 12.270
Authors: Julie M Claycomb; Pedro J Batista; Ka Ming Pang; Weifeng Gu; Jessica J Vasale; Josien C van Wolfswinkel; Daniel A Chaves; Masaki Shirayama; Shohei Mitani; René F Ketting; Darryl Conte; Craig C Mello Journal: Cell Date: 2009-10-02 Impact factor: 41.582
Authors: Maxim V Zagoskin; Jianbin Wang; Ashley T Neff; Giovana M B Veronezi; Richard E Davis Journal: Nat Commun Date: 2022-02-11 Impact factor: 14.919