Anne Nitsche1, Dominic Rose2, Mario Fasold3, Kristin Reiche4, Peter F Stadler5. 1. Bioinformatics Group, Department of Computer Science, University of Leipzig, D-04107 Leipzig, Germany Interdisciplinary Center for Bioinformatics, University of Leipzig, D-04107 Leipzig, Germany. 2. Bioinformatics Group, Department of Computer Science, University of Freiburg, D-79110 Freiburg, Germany MML, Munich Leukemia Laboratory GmbH, D-81377 München, Germany. 3. Interdisciplinary Center for Bioinformatics, University of Leipzig, D-04107 Leipzig, Germany ecSeq Bioinformatics, D-04275 Leipzig, Germany. 4. Young Investigators Group Bioinformatics and Transcriptomics, Department of Proteomics, Helmholtz Centre for Environmental Research-UFZ, D-04318 Leipzig, Germany Department of Diagnostics, Fraunhofer Institute for Cell Therapy and Immunology-IZI, D-04103 Leipzig, Germany. 5. Bioinformatics Group, Department of Computer Science, University of Leipzig, D-04107 Leipzig, Germany Interdisciplinary Center for Bioinformatics, University of Leipzig, D-04107 Leipzig, Germany Department of Diagnostics, Fraunhofer Institute for Cell Therapy and Immunology-IZI, D-04103 Leipzig, Germany Max Planck Institute for Mathematics in the Sciences, D-04103 Leipzig, Germany Department of Theoretical Chemistry, University of Vienna, A-1090 Wien, Austria Center for non-coding RNA in Technology and Health, University of Copenhagen, DK-1870 Frederiksberg C, Denmark Santa Fe Institute, Santa Fe, New Mexico 87501, USA.
Abstract
Large-scale RNA sequencing has revealed a large number of long mRNA-like transcripts (lncRNAs) that do not code for proteins. The evolutionary history of these lncRNAs has been notoriously hard to study systematically due to their low level of sequence conservation that precludes comprehensive homology-based surveys and makes them nearly impossible to align. An increasing number of special cases, however, has been shown to be at least as old as the vertebrate lineage. Here we use the conservation of splice sites to trace the evolution of lncRNAs. We show that >85% of the human GENCODE lncRNAs were already present at the divergence of placental mammals and many hundreds of these RNAs date back even further. Nevertheless, we observe a fast turnover of intron/exon structures. We conclude that lncRNA genes are evolutionary ancient components of vertebrate genomes that show an unexpected and unprecedented evolutionary plasticity. We offer a public web service (http://splicemap.bioinf.uni-leipzig.de) that allows to retrieve sets of orthologous splice sites and to produce overview maps of evolutionarily conserved splice sites for visualization and further analysis. An electronic supplement containing the ncRNA data sets used in this study is available at http://www.bioinf.uni-leipzig.de/publications/supplements/12-001.
Large-scale RNA sequencing has revealed a large number of long mRNA-like transcripts (lncRNAs) that do not code for proteins. The evolutionary history of these lncRNAs has been notoriously hard to study systematically due to their low level of sequence conservation that precludes comprehensive homology-based surveys and makes them nearly impossible to align. An increasing number of special cases, however, has been shown to be at least as old as the vertebrate lineage. Here we use the conservation of splice sites to trace the evolution of lncRNAs. We show that >85% of the human GENCODE lncRNAs were already present at the divergence of placental mammals and many hundreds of these RNAs date back even further. Nevertheless, we observe a fast turnover of intron/exon structures. We conclude that lncRNA genes are evolutionary ancient components of vertebrate genomes that show an unexpected and unprecedented evolutionary plasticity. We offer a public web service (http://splicemap.bioinf.uni-leipzig.de) that allows to retrieve sets of orthologous splice sites and to produce overview maps of evolutionarily conserved splice sites for visualization and further analysis. An electronic supplement containing the ncRNA data sets used in this study is available at http://www.bioinf.uni-leipzig.de/publications/supplements/12-001.
The large genomes of higher eukaryotes are pervasively transcribed, although protein-coding sequence forms only a tiny fraction of the genome (Kapranov et al. 2007). A substantial portion of the transcriptome appears as mRNA-like nonprotein-coding transcripts (Maeda et al. 2006; The ENCODE Project Consortium 2007), although there is ample evidence for the existence of many other classes of transcripts, ranging from small structured ncRNAs (Washietl et al. 2005) to intronic transcripts (Louro et al. 2009), independently transcribed UTRs (Mercer et al. 2011), and giant “macroRNAs” (Kapranov et al. 2010; Hackermüller et al. 2014). Despite their abundance, the evolutionary history of these transcripts is still poorly understood. Apart from a few detailed case studies, global statistical analyses showed that, as a group, the mRNA-like ncRNAs are under stabilizing selection (Ponjavic et al. 2007; Guttman et al. 2009; Marques and Ponting 2009). The level of sequence conservation, however, is very low compared with other functional transcripts (Pang et al. 2006; Marques and Ponting 2009). As a consequence it is hard to identify homologs in genome-wide searches based on sequence similarity. The low levels of sequence conservation provide only very limited contrast between intronic and exonic parts, so that it is difficult at best to infer complete gene structures for orthologs. More importantly, stabilizing selection maintaining the small, well-conserved sequence elements often located within the gene boundaries of long noncoding RNAs (lncRNAs) cannot be unambiguously attributed to the RNA product. Instead, such “phylogenetic footprints” may just as well be functional at the DNA level, e.g., as enhancer. The HOX clusters may serve as a particularly impressive example. On the one hand, many well-characterized mRNA-like lncRNAs (Mainguy et al. 2007; Rinn et al. 2007), including HOTAIR (Rinn et al. 2007; Tsai et al. 2010), HOTTIP (Wang et al. 2011), and several microRNA precursors (Tanzer et al. 2005), are transcribed from the intergenic regions, on the other hand, the region is packed with conserved functional DNA elements (Lee et al. 2006; Punnamoottil et al. 2010; Natale et al. 2011). The observable conservation of genomic sequence thus does not in itself provide sufficient information to disentangle the evolutionary history of lncRNAs.Beyond global sequence conservation, however, we can also utilize the conservation of gene structure, in particular the conservation of splice sites, to establish homology. Indeed, novel transcripts can be predicted successfully from multiple genome alignments based exclusively on conserved splice-site patterns (Hiller et al. 2009; Rose et al. 2011). A considerable fraction of the transcripts detected in this manner shows very little sequence conservation and resembles lncRNAs. Probably they would not have been detected based on sequence homology alone.The rapid development of sequencing technology has made it feasible to obtain high coverage transcriptome data sets for a wide variety of cell and tissue types. In addition to the systematic efforts to exhaustively catalog the human transcriptome in the ENCODE project and large cDNA resource amassed by the FANTOM project (Suzuki and Hayashizaki 2004), rapidly growing resources are also becoming available for a diversity of model organisms. As a consequence, comparative transcriptomics approaches become feasible (see, e.g., Baldo et al. [2011] and Bräutigam et al. [2011] and the review Hashimshony and Yanai [2010]). A recent study demonstrated that 30%–40% of nearly 2000 human lncRNAs show conserved expression in rodents or ungulates (Washietl et al. 2014), based on direct comparison of transcriptome sequencing data for six mammalian species. A similar approach investigating 11 tetrapod species reported 11,000 primate-specific lncRNAs contrasted by 2500 highly conserved ones (Necsulea et al. 2014). These numbers are somewhat lower (19% of lncRNAs are older than primates), presumably because only one nonprimate mammal was included and direct, BLAST-based homology search was used in this study. A maximum likelihood approach to estimate the number of lncRNAs from publicly available data resulted in an estimate of 40,000–50,000 lncRNAs of which ∼60%–70% are conserved between man and mouse (Managadze et al. 2013).In this contribution we make use of genome-wide multiple sequence alignments together with transcriptomics data to construct a comparative map of splice sites. We then use this resource to systematically study the conservation patterns of lncRNAs and their gene structures.
MATERIALS AND METHODS
Genome and transcriptome data
We use two reference alignments: (1) the MULTIZ-based alignment (Blanchette et al. 2004) of 46 vertebrate genomes provided through the UCSC genome browser and (2) the EPO (Paten et al. 2008) multiple alignment of 12 eutherian mammals downloaded from ENSEMBL (Release 63). We reduced the latter alignment to those eight species for which ENSEMBL and UCSC utilize the same genome versions: Homo sapiens, Pan troglodytes, Pongo abelii, Mus musculus, Rattus norvegicus, Bos taurus, Equus caballus, and Canis familiaris. In the following we will refer to these two multiple sequence alignments as the UCSC and the ENSEMBL alignment, respectively. (3) In order to investigate the conservation of zebrafish lncRNAs we use the eight-way zebrafish MULTIZ alignment (containing five teleosts, frog, mouse, and human) since the 46-way vertebrate alignment contains only sequences that are alignable to the human genome.As basis set for transcripts we use a recent RefSeq track (10/2012, 40,373 transcripts) obtained from UCSC as well as the GENCODE v.14 collection of transcripts (Harrow et al. 2006). In addition we extracted all splice sites supported by at least one expressed sequence tag (EST) in the data collection of the UCSC genome browser (downloaded 08/2012). MicroRNA and snoRNA annotations were taken from ENSEMBL.
Comparative map of splice sites
The exon annotations from RefSeq and the EST collection define the coordinates of validated splice sites. Alignment blocks with fewer than 20 nt on intronic side of the splice site are omitted. This excludes too short introns, which are likely artifacts (Hong et al. 2006), and allows us to score splice-site quality.For each validated splice site, we use a multiple sequence alignment to determine the corresponding (homologous) position in all other genomic sequences. This results in a collection of genomic positions that are known to be a functional splice site in at least one of the aligned species. For each splice-site position, we store for each of the aligned sequences, whether it is a donor or an acceptor, its MaxEntScan splice-site score (Yeo and Burge 2004) and information whether the potential splice site has been experimentally validated in a taxon that is included in the sequence alignment. A graphical representation of all splice sites in an interval of the genomic sequence of any of the included taxa can be generated using our web server (http://splicemap.bioinf.uni-leipzig.de); see Figure 1 for an example. We computed and evaluated splice-site maps separately from the UCSC and the ENSEMBL alignments.
FIGURE 1.
Splice-site map of the GAS5 locus. Each line represents a splice site, each column a vertebrate genome arranged in increasing phylogenetic distance from human; MaxEntScan scores for splice-site quality are shown as grayscale; missing data are indicated as gray background.
Splice-site map of the GAS5 locus. Each line represents a splice site, each column a vertebrate genome arranged in increasing phylogenetic distance from human; MaxEntScan scores for splice-site quality are shown as grayscale; missing data are indicated as gray background.
Assessment of conservation rate via MaxEntScan scoring
The evolution of splice sites cannot be studied meaningfully based only on the annotated splice sites because the transcriptomes of many species are poorly covered in current databases, in particular in their noncoding regions. Therefore we use MaxEntScan scoring by Yeo and Burge (2004) to draw conclusions on the conservation rates, in the following way. In brief, MaxEntScan models short sequence motifs—here the donor and acceptor motifs of splice junctions—using a probabilistic model based on the “Maximum Entropy Principle.” In contrast to position weight matrices or (inhomogeneous) Markov models, this makes it possible to account for both adjacent and non-adjacent dependencies between positions. The resulting gain in accuracy has been shown reliably predict missplicing mutations (Eng et al. 2004).A splice site is “predicted” to have a functional ortholog, if there is an orthologous site in the relevant genome with a MaxEntScan score >3.0. This value is estimated from score distributions in Figure 2. It shows the distribution of donor and acceptor scores of all splice sites in the human lncRNA set as well as the scores of all aligned and all validated orthologous splice-site candidates in the UCSC human–mouse alignment. While the majority of known splice sites has a score >3, we observe a clearly bimodal distribution composed of a large peak conforming to functional splice sites and a second broad distribution of scores ≤3 belonging to positions that most likely have lost their capability of acting as splice donors or splice acceptors. We emphasize that the score cutoff >3 is restrictive and will tend to underestimate the number of conserved splice sites since the MaxEntScan scores are gauged so that sites with positive scores are more likely to be functional than not (Yeo and Burge 2004). This is also consistent with the comparison of predicted and validated splice sites of human and mouse coding regions in Table 1 below. A splice site counts as “validated” if it is confirmed by RefSeq or EST annotation. It is considered as “conserved” if it has a predicted and/or validated functional ortholog in the concerned genome. We refer to the supplement for more details on this method.
FIGURE 2.
Conservation of splice sites of human lncRNAs in the mouse. Filled curves designate the distributions of MaxEntScan scores for human splice sites (dark gray) and orthologous positions that are known to be splice sites in mouse (light gray). The score distribution of all orthologous positions in mouse (gray) is a superposition of conserved functional splice sites and positions that have been destroyed by substitutions. The cut-off value of 3.0 is indicated by a thick light gray line.
TABLE 1.
Conservation of splice sites between human and mouse
Conservation of splice sites of human lncRNAs in the mouse. Filled curves designate the distributions of MaxEntScan scores for human splice sites (dark gray) and orthologous positions that are known to be splice sites in mouse (light gray). The score distribution of all orthologous positions in mouse (gray) is a superposition of conserved functional splice sites and positions that have been destroyed by substitutions. The cut-off value of 3.0 is indicated by a thick light gray line.Conservation of splice sites between human and mouseConservation rates on the transcript level are derived from its splice sites. A transcript is considered conserved if at least one splice site of the human transcript corresponds to a predicted or validated splice site.
lncRNA transcripts
Since many RefSeq noncoding transcripts are associated with coding loci, we focus our analysis on a restrictively filtered subset of the GENCODE data to ensure conservative estimates of lncRNA conservation. In order to prepare a high-quality set of human lncRNAs we started from the 21,271 well-characterized “Gencode v14 lncRNA” transcripts and applied a series of filtering steps.We discarded transcripts that overlapped within annotated protein-coding sequences or pseudogenes in sense or antisense direction annotated by at least one of GENCODE, ENSEMBL, UCSC, or RefSeq. For GENCODE, we could rely on the annotation with biotype classification for transcripts and genes. In the case of ENSEMBL, RefSeq, and UCSC we used the annotation of coding exons. Since some of the transcripts overlapping in sense-direction might just be noncoding isoforms of protein-coding transcripts, we opted to remove them. We also excluded transcripts located in antisense direction of these coding sequences since conservation of the coding sequence also constrains the sequence of the opposing transcripts, even though they are annotated as noncoding. We used RNAcode (Washietl et al. 2011), a tool that efficiently detects conserved open reading frames in multiple sequence alignments, and TBLASTN (Altschul et al. 1990) to remove transcripts with putative coding regions. We only kept those transcripts that did not contain exons overlapping with significant RNAcode hits (P-value <0.05) or, if an exon could not be scored by RNAcode due to low sequence conservation, TBLASTN hits (E-value <0.05). We also removed all unspliced entries. At this stage we retained 5703 transcripts. The last filtering step included the application of PhyloCSF (Lin et al. 2011). All remaining transcripts with a PhyloCSF score >100 and a possible ORF of length ≥30 were sorted out. These cutoffs were chosen accordingly to Cabili et al. (2011). This affected another 290 transcripts. Our final data set thus contains 5413 transcripts with 17,163 splice sites.We note that this lncRNA data set exhibits substantial overlap with the integrative compilation of 14,274 spliced human noncoding transcripts from different sources covering 24 tissues and cell types by Cabili et al. (2011). Three thousand one hundred forty-five of them are identically (99% reciprocal strand-specific overlap) represented in our set; the agreement increases to 3924 loci when a sequence overlap of at least 70% is required. We will refer to this collection of lncRNAs as the Cabili data set.An important subclass of spliced lncRNAs with a well-understood function are host genes of microRNAs and snoRNAs. We thus identified lncRNAs that overlapped known microRNAs and snoRNAs as annotated by ENSEMBL. This resulted in 128 transcripts hosting microRNAs (containing 602 unique splice sites) and 73 transcripts hosting snoRNAs (335 unique splice sites). Interestingly, snoRNA host genes and, to a lesser extent also microRNA host genes, on average have more introns than other lncRNAs (3.7 versus 2.9 versus 2.0 introns/transcript in all lncRNAs).A set of mouse lncRNAs involved in the circuitry controlling pluripotency and differentiation is described in Guttman et al. (2011). It consists of 2076 spliced transcripts with 6975 splice sites, of which 77% are also validated by EST or RefSeq data.A conservative set of 1133 lncRNAs expressed in zebrafish embryos was recently reported in Pauli et al. (2012). A second, smaller set of 691 zebrafish lincRNAs is expressed in brain development (Ulitsky et al. 2011); only 449 of them are spliced. Since the overlap of the two sets is small, we consider their union consisting of 1508 spliced transcripts with 5415 splice sites.
RESULTS
Predicted conservation of protein-coding splice sites shows specificity of the method
Table 1 and Figure 3 summarize the conservation of splice sites between human and mouse, calculated in the described way. Similar data are observed for other mammalian species (see Supplemental Table S4). The overwhelming majority of splice sites in the RefSeq data set delimits coding exons. More than 95% of these are alignable, and nearly 92% have experimentally validated orthologous splice sites in the mouse. A comparison of conserved splice sites that are experimentally confirmed with computationally predicted ones shows that both sets almost perfectly coincide; in fact, the predicted set is even slightly smaller than the validated one, emphasizing that the score cutoff of 3.0 is highly specific.
FIGURE 3.
Conservation of splice sites between human and mouse in different contexts. In nonwhite the fraction of all alignable splice sites is shown. Dark gray scales display the estimated conservation rate. Consequently, the fraction of alignable but likely nonconserved splice sites is shown in light gray. In protein-coding RNAs 95% of the splice sites are at least alignable to mouse, and of those almost all are conserved, while in lncRNAs the rate of alignable sites drops to ∼40%. The fraction of validated splice sites among the predicted ones turned from nearly 98% to only 13%, indicating that there are a lot of unannotated splice sites.
Conservation of splice sites between human and mouse in different contexts. In nonwhite the fraction of all alignable splice sites is shown. Dark gray scales display the estimated conservation rate. Consequently, the fraction of alignable but likely nonconserved splice sites is shown in light gray. In protein-coding RNAs 95% of the splice sites are at least alignable to mouse, and of those almost all are conserved, while in lncRNAs the rate of alignable sites drops to ∼40%. The fraction of validated splice sites among the predicted ones turned from nearly 98% to only 13%, indicating that there are a lot of unannotated splice sites.Only a small fraction of the RefSeq splice sites falls into UTRs, with >14-fold difference between 5′ and 3′ UTRs. Only about three quarters of these regions are aligned between human and mouse in the UCSC alignments. Still, most of the predicted splice sites are backed up by experimental data. The strong depletion of introns in the 3′ UTRs has been described previously and can be explained as a consequence of nonsense-mediated decay (NMD) or a larger tolerance for intron retention (see, e.g., Scofield et al. [2007]).
Conservation of splice sites provides lower bounds on the number of conserved lncRNAs
Only a moderate fraction of ∼3% of the splice sites of human lncRNAs are orthologous to known splice sites of annotated transcripts in other nonprimate Eutheria. This estimate is consistent with the observation that ∼12% of the lincRNAs compiled in Cabili et al. (2011) are syntenically paired with a corresponding transcript in another mammalian species as detectable by TransMap (Zhu et al. 2007). Furthermore noncoding transcripts are typically expressed at lower levels than their coding counterparts and are often restricted to specific cell-lines or tissues (The ENCODE Project Consortium 2007).Clearly, the poor sequence conservation of the lncRNAs (Marques and Ponting 2009) limits the number of human splice sites for which sequences from other eutherian families can be aligned. As a consequence, we can only determine a lower bound on the numbers of evolutionarily conserved splice sites in lncRNAs. The estimates therefore are limited by alignment coverage and quality. We refer to the Supplemental Material for a more detailed comparison of UCSC and ENSEMBL alignment.The small fraction of conserved lncRNAs, however, is mainly the result of the incompleteness of the transcript catalogs in nonhuman species. We therefore use the conservation of splice sites as measured by MaxEntScan scores to obtain more accurate estimates. As detailed in the Materials and Methods, a cutoff of 3.0 for the MaxEntScan scores is sufficiently specific that we already tend to underestimate the number of conserved splice sites.lncRNAs with many introns, such as GAS5 in Figure 6 below, tend to enrich poorly conserved splice sites with only marginal support by low MaxEntScan scores. At least some of these are probably mapping artifacts that artefactually reduce the estimates of splice-site conservation. Since we consider an lncRNA as conserved if at least one splice site of the human transcript corresponds to a predicted or experimentally known splice site, the high-scoring splice sites are sufficient to establish the ancient origin of lncRNAs. The biases introduced by spurious and low-scoring splice sites in the GENCODE data thus have little impact on the results at transcript level. Furthermore, we observe no strong dependence of splice-site conservation on the number of exons, although the average splice-site score slightly increases with the number of exons; see Supplemental Table S5. The Cabili data set (Cabili et al. 2011) yields very similar results as the filtered GENCODE data; see Supplemental Figure S6.
FIGURE 6.
Conserved splice sites of the GAS5 lncRNA. The GAS5 snoRNA host gene is among the most highly conserved lncRNAs. Its homologs are easily identifiable via the well-conserved snoRNAs (circles) located within its introns. Members of the SNORD80/Z15 family are shown in light gray. Black boxes indicate the major exons supported by RefSeq and/or EST data. Thin gray lines indicate splice sites that can be traced manually in at least one of the genome-wide alignments available in the UCSC browser. Note that only a subset of these is represented in any individual alignment (cf. Fig. 1). The transcript structure as well as its snoRNA payload has changed also by means of duplications and deletions.
The nearly constant conservation rate of ∼30% suggests that there is a population of highly conserved splice sites in ancient lncRNAs. On the other hand, it also indicates that sequence conservation in the remaining ∼70% of these highly conserved loci is unrelated to splicing and may not be conserved because of a function at the transcript level.
More than half of the GENCODE lncRNAs are conserved across the Eutheria
Table 2 summarizes our data for several mammalian species for which larger sets of transcriptome data are available.
TABLE 2.
Conservation of GENCODE lncRNAs in the UCSC alignment
Conservation of GENCODE lncRNAs in the UCSC alignmentThese data indicate that >38% (6541/17,163) of the individual splice sites and 71% (3862/5413) of the transcripts are conserved across the major eutherian families. When we include 15 available nonprimate vertebrate genomes, this number increases further to 4511 transcripts (83%) and 53% of the splice sites. This reveals the massive gap to an estimation of only 3% (506/17,163) conservation of splice sites and 9% (462/5413) of transcripts, where only orthologs in annotated transcripts are considered as conserved.Most recently, a subset of 1898 GENCODE lncRNAs expressed in a certain collection of human tissues was investigated for conserved expression in five other mammalian species (chimp, rhesus, cow, mouse, and rat) (Washietl et al. 2014). Expression from orthologous loci was observed for 35% (rat) to 80% (chimp) of the human transcripts. In these RNAs, conservation of between 20% and 60% of the observed human splice junctions were directly confirmed as conserved by dedicated transcriptome sequencing data. This is in good agreement with the estimated conservation of mouse splice sites in Table 1. Our numbers, furthermore, are in agreement with the estimate that 60%–70% of the intergenic lncRNAs are conserved between human and mouse (Managadze et al. 2013). This estimate is based on the comparison of lncRNA expression from syntenically conserved loci, without regard to gene structure. Thus we do expect our estimate to appreciably more conservative.A surprisingly large number of lncRNAs can be traced even further: 784 transcripts (14.5%) are conserved in at least one of the two marsupials (opossum, wallaby) and 446 can be found in the platypus genome.
Nearly 80% of the human lncRNAs may be older than the primates
At least a crude upper bound on the conservation of lncRNAs can be estimated by discarding all missing data and considering only the conservation of splice sites in those sequences that are present in the multiple sequence alignments. As expected, the estimates for individual species in Supplemental Table S3 are substantially larger than the conservative estimates of Table 2, which interprets all missing data as nonconservation (for GENCODE transcripts conserved in mouse, 50.7% compared with 35.3%). Surprisingly, the discrepancy, however, is rather small for the number of transcripts that are conserved in at least one of the four species: 79.6% versus 71.3% (see Fig. 4).
FIGURE 4.
Conservation of lncRNAs across 46 vertebrates. Indicated in light gray is the fraction of aligned splice sites, in dark gray the fraction of splice sites that are validated and/or predicted to be a functional splice site in the regarding species. In black the upper bounds on the fraction of conserved splice sites are shown. The numbers are estimated from the fraction of conserved splice sites within aligned sequence blocks only. A shows the conservation rate of 17,163 single splice sites, while B illustrates the conservation on the level of transcripts for 5413 lncRNAs.
Conservation of lncRNAs across 46 vertebrates. Indicated in light gray is the fraction of aligned splice sites, in dark gray the fraction of splice sites that are validated and/or predicted to be a functional splice site in the regarding species. In black the upper bounds on the fraction of conserved splice sites are shown. The numbers are estimated from the fraction of conserved splice sites within aligned sequence blocks only. A shows the conservation rate of 17,163 single splice sites, while B illustrates the conservation on the level of transcripts for 5413 lncRNAs.
Most human lncRNAs are either primate-specific or they date back to the origin of the Eutheria
Figure 5 summarizes the gains and losses of human GENCODE lncRNAs. The primate subtree is left unresolved in this analysis because the evolutionary distances within this clade are too small to distinguish splice sites under stabilizing selection from fortuitous conservation due to insufficient divergence time. Only 6.3% (343/5413) of the transcripts are primate specific, while >54% (2905/5413) arose with the Eutheria and another 21% (1114/5413) can be traced back to the origins of the Theria.
FIGURE 5.
Gains and losses of human GENCODE lncRNAs across the vertebrates. Events are inferred by the parsimony criterion: A gene is deemed lost along the edge leading to a maximal subtree for which it is not observed at any leaf; a gain event is placed on the edge leading to the last common ancestor of all observed occurrences. The vertebrate phylogeny is the phyloFit tree provided by the UCSC browser. The primate subtree is omitted.
Gains and losses of human GENCODE lncRNAs across the vertebrates. Events are inferred by the parsimony criterion: A gene is deemed lost along the edge leading to a maximal subtree for which it is not observed at any leaf; a gain event is placed on the edge leading to the last common ancestor of all observed occurrences. The vertebrate phylogeny is the phyloFit tree provided by the UCSC browser. The primate subtree is omitted.
Lineage-specific losses of lncRNAs are common
In contrast to the 71% of transcripts that are conserved between human and at least one of the other four eutherian species (see Table 2), there are few transcripts that are ubiquitously present. Rose et al. (2011) recently introduced a method to detect novel evolutionarily conserved splice sites and provided a collection of predicted splice sites that are well-conserved across the Eutheria. 2061 GENCODE lncRNAs have at least one splice site that is contained in this set of predictions. This fits well with only 814 transcripts that are conserved between human and “all” four eutherian species listed in Table 2. This suggests that lineage-specific losses are frequent.Indeed, we miss 12.2% (660/5413) of the ancestral lncRNAs in mouse and >19% (1047/5413) in armadillo. These numbers have to be taken with caution, however. Our conservative cutoffs tend to over-emphasize losses and misplace origination events toward the tips of the tree.
Gene structures of conserved lncRNAs evolve rapidly
Conserved lncRNAs exhibit a rapid evolution of their gene structure. To estimate the turnover of individual splice sites in conserved transcripts we consider the 814 transcripts present in human, mouse, rat, cow, and dog. The human transcripts comprise 3080 splice sites. Of these, 87% were ancestrally present. Most of the novel splice sites were gained throughout primate evolution. Complementarily, a comparable number of donors and acceptors have been lost in Glires (Supplemental Fig. S7). In some examples the changes of transcript structure are quite dramatic. In the ANRIL isoforms, entire groups of exons are primate specific, while only a few splice sites, mostly located at the 5′ and the 3′ ends are at least as old as the Eutheria (see Fig. 7A below). The visibly higher conservation until marmoset, is consistent with the finding that ANRIL is first fully developed in simians, after it went through a two-stage evolution (He et al. 2013). Another famous example is HOTAIR, where the 5′-most exons appear to be lacking in the mouse (Schorderet and Duboule 2011).
FIGURE 7.
Variation of splice-site conservation. The patterns of splice-site conservation vary substantially between different lncRNAs, even when their evolutionary age is comparable. The main panel refers to the UCSC 46-way alignment. In the case of ANRIL, only a few splice sites are conserved outside the primates. Although the mouse ortholog shares at least some functions with human ANRIL (Pasmant et al. 2010), there are only four shared conserved splice sites. HOTTIP, with few exons that are partially conserved, is also a rather typical chromatin-related lincRNA. In contrast, the overwhelming majority of splice sites is conserved in Rmst. MEG3 shows an intermediate pattern, with more lineage-specific losses. The snoRNA host gene SNHG1 contains several splice sites that are deeply conserved among vertebrates. Some are even found in teleosts. Experimentally known splice sites from zebrafish SNHG1 were searched also in the six-way zebrafish MULTIZ alignment (inset). Additional homologous splice sites in two teleosts demonstrate once more the limitations arising from alignment quality. The grayscales are explained in Figure 1. Thick vertical bars on the right mark splice sites that belong to a specific transcript (black: plus strand, gray: minus strand). Thin lines between these bars indicate conserved splice sites, which are not part of the annotated transcripts.
Alternative data sets lead to consistent results
Host genes of microRNAs and snoRNAs form subgroups with well-defined functions. Both groups of small structured RNAs are typically rather well conserved at least across the Eutheria. This is also true for their host genes, Table 3. There is little difference in the conservation of snoRNA and microRNA host genes, even though microRNAs can be processed from both exonic and intronic parts of a primary transcript (Kim 2005), while snoRNAs are obligatorily intronic at least in mammals (Maxwell and Fournier 1995). Interestingly, a much larger fraction of snoRNA host genes has experimentally validated conserved splice sites compared with microRNAs. This is probably due to their different expression patterns: MicroRNAs are often tissue or cell-type specific, while the snoRNAs are required ubiquitously.
TABLE 3.
Conservation of special subsets
Conservation of special subsetsThe fractions of alignable positions and predicted splice sites among the mouse pluripotency lncRNAs (Guttman et al. 2011) is comparable with the GENCODE data. At the level of transcripts we again find substantial conservation across the Eutheria: More than half of the transcripts are predicted to be conserved in human, and for 40% of these experimental evidence is available.For the zebrafish lncRNAs, a much lower conservation level of only 34% is observed among the other teleosts. The divergence of zebrafish and the Euteleostei is much older than the divergence of major eutherian groups (150 My versus 95 My from paleontological data (Benton and Donoghue 2007), or 230–333 My (Yamanoue et al. 2006) versus ∼100–120 My (Hasegawa et al. 2003) estimated from molecular data). This readily explains the smaller fraction and the lower conservation of alignable splice sites. Interestingly, >11% of transcripts are conserved also in Tetrapoda.
Hundreds of lncRNAs are conserved throughout the vertebrates
Only three GENCODE splice sites in three lncRNAs show conservation to the lamprey, namely AC011995.1-001, RP11-423H2.3-003, and RP11-123M21.1-001. These are neither microRNA nor snoRNA host genes. We find 87 conserved transcripts (including one snoRNA host genes) in at least one of the teleosts. 26% of them are even experimentally validated. Two hundred seventy-one transcripts (including 14 microRNA, 10 snoRNA) are conserved in at least one of the Sauropsida. The deep conservation of host genes does not come as a surprise since in many cases their payload is conserved at least throughout the vertebrates (Hertel et al. 2006; Lestrade and Weber 2006; Sempere et al. 2006; Marz et al. 2011).GAS5 is probably the best-studied snoRNA host gene, harboring ∼10 distinct snoRNAs in its introns (Smith and Steitz 1998). It has recently attracted considerable attention since its in general poorly conserved exonic product acts as a riborepressor that binds to the DNA-binding domain of the glucocorticoidreceptor (Kino et al. 2010; Williams et al. 2011). Its chicken homolog is described in detail in Shao et al. (2009). Large clusters of ESTs are easily identified as GAS5 homologs in frog (xenTro2 scaffold 1:6,870,168–6,878,818) and zebrafish (ENSDARG00000092337). The example of GAS5 clearly shows the limitations of genome-wide alignments. Although GAS5 is clearly conserved and functional (at least) across the gnathostomes (Fig. 6) the 46-way MULTIZ alignment does not contain the regions around the splice sites outside the Amniota; even in Sauropsida most parts are missing. Other well-studied examples of deeply conserved snoRNA host genes include UHG (SNHG1) (Fig. 7) and U87HG (Makarova and Kramerov 2009).Conserved splice sites of the GAS5 lncRNA. The GAS5 snoRNA host gene is among the most highly conserved lncRNAs. Its homologs are easily identifiable via the well-conserved snoRNAs (circles) located within its introns. Members of the SNORD80/Z15 family are shown in light gray. Black boxes indicate the major exons supported by RefSeq and/or EST data. Thin gray lines indicate splice sites that can be traced manually in at least one of the genome-wide alignments available in the UCSC browser. Note that only a subset of these is represented in any individual alignment (cf. Fig. 1). The transcript structure as well as its snoRNA payload has changed also by means of duplications and deletions.Variation of splice-site conservation. The patterns of splice-site conservation vary substantially between different lncRNAs, even when their evolutionary age is comparable. The main panel refers to the UCSC 46-way alignment. In the case of ANRIL, only a few splice sites are conserved outside the primates. Although the mouse ortholog shares at least some functions with human ANRIL (Pasmant et al. 2010), there are only four shared conserved splice sites. HOTTIP, with few exons that are partially conserved, is also a rather typical chromatin-related lincRNA. In contrast, the overwhelming majority of splice sites is conserved in Rmst. MEG3 shows an intermediate pattern, with more lineage-specific losses. The snoRNA host gene SNHG1 contains several splice sites that are deeply conserved among vertebrates. Some are even found in teleosts. Experimentally known splice sites from zebrafish SNHG1 were searched also in the six-way zebrafish MULTIZ alignment (inset). Additional homologous splice sites in two teleosts demonstrate once more the limitations arising from alignment quality. The grayscales are explained in Figure 1. Thick vertical bars on the right mark splice sites that belong to a specific transcript (black: plus strand, gray: minus strand). Thin lines between these bars indicate conserved splice sites, which are not part of the annotated transcripts.Not surprisingly, primary precursors of microRNAs are found among the best conserved lncRNAs. A well-studied case is Rmst, which harbors mir-1251. The human ortholog was described as differentially expressed in rhabdomyosarcoma subtypes (Chan et al. 2002). The mouse ortholog appeared as Pax-2 related gene in early hind-brain development (Bouchard et al. 2005). Its evolution was investigated in detail in Chodroff et al. (2010), demonstrating conservation of both the transcript and its expression patterns in opossum and chicken brains. The comparative splice-site map shows that Rmst is conserved also in Xenopus (Fig. 7). The imprinted MEG3 lncRNA exhibits a large number of differentially expressed isoforms (Zhang et al. 2010). It is an eutherian innovation apparently associated with the emergence of imprinting at the Dlk1 locus (Weidman et al. 2006). Indeed, only a single splice site close to the 3′ end of the transcripts is shared with a putative evolutionary precursor in the marsupials (Fig. 7). It hosts the snoRNA SNORD112 as well as the microRNA mir-770.The majority of the lncRNAs implicated in chromatin-based regulation can be traced throughout the Eutheria, although it is very likely that many of them are evolutionarily even older. A good example is HOTTIP (Fig. 7; Wang et al. 2011), where we lose the sequence conservation in most parts of the locus outside of the placental mammals. Although there are a few deeply conserved elements these do not include one of the splice-site sequences. Nevertheless, the transcript functions also in chick limb-buds (Wang et al. 2011), suggesting that the gene is considerably older than the Eutheria.Two zebrafish lncRNAs that are conserved across vertebrates were investigated in detail (Ulitsky et al. 2011). cyrano (oip5 antisense transcript) is required for proper embryonic development. Our splice-site map identifies conservation of splice sites across mammals. The sequence is not conserved enough, however, to support an alignment between teleosts and tetrapods. megamind (located antisense in an intron of birc6) regulates brain morphogenesis and eye development. The last acceptor site is conserved across gnathostomes in the eight-way zebrafish centered alignment.
SpliceMap web service
The splice-site maps based on several multiple sequence alignments are available as a web service. It can be used to produce overview maps such as those in Figures 1 and 7 and to export text files of predicted and validate splice sites. Either a list of splice-site coordinates or a genomic interval can be used as input.The website and the computation results are served by a set of Python scripts and rendered into static HTML using the Mako template engine. The jobs are scheduled in a queued fashion. Upon completion, the results are available under a personalized link for 2 wk. The service can be accessed at http://splicemap.bioinf.uni-leipzig.de.
DISCUSSION
The majority of the human long noncoding RNAs dates back at least to the radiation of the Eutheria, and thousands of these transcripts arose even earlier. The conservation of parts of their transcript structure constitutes compelling evidence for stabilizing selection, despite the often negligible constraints on the sequence itself. Utilizing the conservation of splice sites rather than measures of sequence similarity, furthermore, disentangles for a given locus the selective pressures on DNA elements from those that refer to the transcript. Our analysis, which suggests that some 70% of human lncRNAs date back to the eutherian ancestor is in agreement with an independent estimate of the conservation of lincRNAs conservation between man and mouse (Managadze et al. 2013) and with a direct comparison of lncRNA expression in six diverse mammals (Washietl et al. 2014).Despite the conservation at transcript level we observed a surprising amount of turnover at the level of individual splice sites, again in agreement with Washietl et al. (2014). We observe that many of the lncRNA loci exhibit a large number of splicing isoforms. As a consequence of the lack of detailed transcriptomics data for most species, it is currently impossible to trace the evolution of individual isoforms. The discrepancies among individual splice sites, however, leads us to hypothesize that differential selection of isoforms caused the observed rapid divergence of transcript structures. Together with a prolific innovation of new splice sites this process can quickly obscure the evolutionary relationships. Our analysis may still drastically underestimate the evolutionary age of lncRNAs.We suspect that, as in the case of HOTAIR or ANRIL, major changes of transcript structure go hand in hand with functional changes. This view is supported by major differences between isoforms, e.g., in the association of their expression levels with disease phenotypes (Burd et al. 2010; Holdt et al. 2010, 2013) or the change of function of HOTAIR in mouse that correlates with the loss of several exons (Schorderet and Duboule 2011). If our hypothesis is true, lncRNAs are likely to be the root cause for rapid phenotypic evolution, as their often chromatin-associated mode of action is subject to large functional changes by easy-to-achieve changes in gene structure. The selective inclusion or exclusion of protein binding sites would affect the composition of complexes of enhancers and chromatin modifiers (see, e.g., Mercer et al. [2009]), and thus rapidly alter the rules of transcriptional regulation without affecting the proteins machinery. A similar scenario can be drawn for the post-transcriptional regulation of the pool of microRNA composition by sponges such as HULC (Wang et al. 2010). We conclude that lncRNAs are an ancient component of vertebrate genomes with an unexpected and unprecedented evolutionary plasticity.
SUPPLEMENTAL MATERIAL
Supplemental material is available for this article.
Authors: Moran N Cabili; Cole Trapnell; Loyal Goff; Magdalena Koziol; Barbara Tazon-Vega; Aviv Regev; John L Rinn Journal: Genes Dev Date: 2011-09-02 Impact factor: 11.361
Authors: Kevin C Wang; Yul W Yang; Bo Liu; Amartya Sanyal; Ryan Corces-Zimmerman; Yong Chen; Bryan R Lajoie; Angeline Protacio; Ryan A Flynn; Rajnish A Gupta; Joanna Wysocka; Ming Lei; Job Dekker; Jill A Helms; Howard Y Chang Journal: Nature Date: 2011-03-20 Impact factor: 49.962
Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong Journal: Nature Date: 2007-06-14 Impact factor: 49.962
Authors: Jana Hertel; Manuela Lindemeyer; Kristin Missal; Claudia Fried; Andrea Tanzer; Christoph Flamm; Ivo L Hofacker; Peter F Stadler Journal: BMC Genomics Date: 2006-02-15 Impact factor: 3.969
Authors: Jose Antonio Corona-Gomez; Irving Jair Garcia-Lopez; Peter F Stadler; Selene L Fernandez-Valverde Journal: RNA Date: 2020-04-02 Impact factor: 4.942