Literature DB >> 34320353

Mammalian circular RNAs result largely from splicing errors.

Abstract

Ubiquitous in eukaryotes, circular RNAs (circRNAs) comprise a large class of mostly non-coding RNAs produced by back-splicing. Although some circRNAs have demonstrated biochemical activities, whether most circRNAs are functional is unknown. Here, we test the hypothesis that circRNA production primarily results from splicing error and so is deleterious instead of beneficial. In support of the error hypothesis, our analysis of RNA sequencing data from 11 shared tissues of humans, macaques, and mice finds that (1) back-splicing is much rarer than linear-splicing, (2) the rate of back-splicing diminishes with the splicing amount, (3) the overall prevalence of back-splicing in a species declines with its effective population size, and (4) circRNAs are overall evolutionarily unconserved. We estimate that more than 97% of the observed circRNA production is deleterious. We identify a small number of functional circRNA candidates, and the genome-wide trend strongly suggests that circRNAs are largely non-functional products of splicing errors.

Entities: Chemical

Keywords: back-splicing; circRNA; evolution; molecular error; natural selection

Mesh：

Substances：
RNA, Circular

Year: 2021 PMID： 34320353 PMCID： PMC8365531 DOI： 10.1016/j.celrep.2021.109439

Source DB: PubMed Journal: Cell Rep Impact factor: 9.423

INTRODUCTION

Circular RNAs (circRNAs) are a class of eukaryotic, endogenous, single-stranded, mostly non-coding RNA; unlike the regular RNAs formed by canonical linear-splicing, circRNAs are generated by back-splicing that covalently links a downstream splice-donor site to an upstream splice-acceptor site (Chen, 2016; Kristensen et al., 2019; Vicens and Westhof, 2014). Back-splicing requires canonical splicing signals (Starke et al., 2015), uses the canonical splicing machinery (Kristensen et al., 2019), and competes with canonical pre-mRNA splicing (Ashwal-Fluss et al., 2014). The length and location of circularized exons and the sequence content and length of the flanking introns of the back-spliced sites have been shown to impact circRNA biogenesis (Jeck et al., 2013; Memczak et al., 2013; Salzman et al., 2012; Zhang et al., 2014). circRNAs generally include canonical exons (Zhang et al., 2016), are predominantly cytoplasmic (Huang et al., 2018; Salzman et al., 2012), and are exceptionally stable (Enuka et al., 2016; Memczak et al., 2013). High-throughput RNA sequencing (RNA-seq) coupled with circRNA-specific bioinformatics has discovered numerous circRNAs (Glažar et al., 2014; Guo et al., 2014; Ivanov et al., 2015; Jeck et al., 2013; Ji et al., 2019; Salzman et al., 2012; Wang et al., 2014; Westholm et al., 2014). For example, over 50% of human protein-coding genes have been found to produce circRNAs (Ji et al., 2019). circRNAs are specific to tissue (Ji et al., 2019; Xia et al., 2017), cell type (Guo et al., 2014; Salzman et al., 2013), developmental stage (Szabo et al., 2015; Tan et al., 2017), and even subcellular location (Zhang et al., 2019). For instance, many circRNAs are dynamically expressed in the mammalian brain and are enriched in synapses (Ji et al., 2019; Rybak-Wolf et al., 2015; Xia et al., 2017). Some circRNAs act as microRNA sponges (Kristensen et al., 2019; Patop et al., 2019). The best known example is CDR1as/CiRS-7, which carries over 70 binding sites for miR-7, efficiently tethers miR-7, and drastically suppresses miR-7’s activity in binding its mRNA targets (Hansen et al., 2013; Memczak et al., 2013). Some circRNAs bind to and titrate out RNA-binding proteins (RBPs) (Abdelmohsen et al., 2017; Ashwal-Fluss et al., 2014). For instance, circMbl, derived from muscleblind (MBL/MBNL1), can titrate out extra MBL proteins (Ashwal-Fluss et al., 2014). Additionally, some circRNAs act as scaffolds to mediate the formation of complexes between specific enzymes and substrates (Du et al., 2016) and recruit proteins to particular locations (Chen et al., 2018). Furthermore, a small subset of circRNAs may take effect through their protein products resulting from cap-independent translation (Pamudurti et al., 2017). These demonstrated biochemical activities can be important, although they have been found in only a tiny fraction of all circRNAs. In fact, a genome-wide analysis suggested that most circRNAs are neither microRNA sponges nor translated (Guo et al., 2014). In the early days after the discovery of circRNAs (Hsu and Coca-Prados, 1979), these molecules were thought to be the product of erroneous splicing (Cocquerelle et al., 1993), a view that we refer to as the error hypothesis of circRNA production. However, the high prevalence of circRNAs, along with the demonstrated biochemical activities of a small number of them, has led to an alternative view that circRNAs are a large group of functional RNAs widely used in gene regulation (Barrett and Salzman, 2016; Chen, 2016; Ebbesen et al., 2017; Kristensen et al., 2019; Li et al., 2018; Memczak et al., 2013; Meng et al., 2017; Patop et al., 2019; Qu et al., 2017; Salzman, 2016). The popularity of this view is reflected by a rapid growth in the interest in circRNAs; only 8 years after the report of circRNAs produced from hundreds of human genes (Salzman et al., 2012), the term circRNAs appeared in the title or abstract of over 2,900 papers in 2020 alone. We will name this now prevailing view the adaptive hypothesis because circRNA production is beneficial according to this view. The adaptive hypothesis includes the scenario of exaptation in which circRNAs originate as functionless molecular errors but have since been co-opted to become functional and beneficial today. Despite the popularity of the adaptive hypothesis, the error hypothesis is not out of the question for most circRNAs. Back-splicing that creates circRNAs is a type of alternative splicing, which is known to be error prone (Melamud and Moult, 2009; Pickrell et al., 2010; Saudemont et al., 2017). Hence, back-splicing as a splicing error could occur to the transcripts of many genes. Furthermore, the error hypothesis is not inconsistent with the fact that only a tiny fraction of circRNAs have demonstrated biochemical activities. Furthermore, it is unknown how many of these activities are selected and how many have no appreciable fitness effects (Doolittle et al., 2014; Graur et al., 2013). Distinguishing between the error and adaptive hypotheses of circRNA production is important because it will shed light on the origin, function, and biological significance of this large group of ubiquitous RNAs of eukaryotes and guide future circRNA research. Here, we make a series of distinct predictions of the error hypothesis about genomic patterns of back-splicing and circRNAs that are not expected a priori under the adaptive hypothesis. By analyzing high-throughput RNA-seq data from multiple tissues of humans, macaques, and mice, we provide comprehensive evidence that the production of most mammalian circRNAs is due to splicing error and is selectively disfavored.

RESULTS

Back-splicing rates are generally very low

Under the error hypothesis, back-splicing is a splicing error, which is expected to be generally detrimental. Thus, natural selection should have minimized the rate of back-splicing, which is defined as the probability that a splicing event leads to back-splicing instead of linear-splicing. In contrast, the adaptive hypothesis does not predict a priori a low rate of back-splicing because, under this hypothesis, back-splicing rates should be high enough to yield sufficient circRNAs for them to have functional impacts (Palazzo and Lee, 2015). To distinguish between the error and adaptive hypotheses, we investigated back-splicing rates by using a RiboMinus RNA-seq dataset from the human, macaque, and mouse (see STAR Methods). We focused on the 11 tissues in the dataset that are shared among the 3 mammals to facilitate among-species comparisons (Table S1). We then identified linearly spliced reads, which indicate linear-splicing, and back-spliced reads, which indicate back-splicing (see STAR Methods). We define the splicing amount of a gene by the total amount of back-splicing and linear-splicing of the gene. To ensure a certain level of accuracy in the estimation of back-splicing rates, we considered only those protein-coding genes for which the expression level is at least 1 transcript per kilobase million (TPM) and the splicing amount is at least 1 spliced read. As previously reported for this dataset (Ji et al., 2019), a relatively large fraction of genes show back-splicing. Among the 11 tissues, the median fraction of genes exhibiting back-splicing is 27.2%, 37.8%, and 25.5% in the human, macaque, and mouse, respectively (first column in Figure 1A). However, the median fraction of splice sites subject to back-splicing is only 3.9%, 6.2%, and 2.9% for human, macaque, and mouse, respectively (second column in Figure 1A). Most importantly, the median rate of back-splicing, measured by the median fraction of spliced reads that are back-spliced, is only 0.2%, 0.16%, and 0.04% in human, macaque, and mouse, respectively (third column in Figure 1A), indicating that the overall back-splicing rate is three to four orders of magnitude lower than the linear-splicing rate. In any tissue of any of the three species, even when only genes exhibiting back-splicing (i.e., back-spliced genes) in the tissue are considered, back-spliced reads constitute no more than 2% of all spliced reads (fourth column in Figure 1A). We further examined the distribution of the fraction of spliced reads that are back-spliced among back-spliced genes. Again, we found this fraction to be below 10% in human and below 5% in the other species in most genes (Figure 1B). Due to the exceptional stability of circRNAs relative to linear RNAs, the actual back-splicing rates are likely even lower than the above estimates. Together, these observations show that the back-splicing rate is orders of magnitude lower than the linear-splicing rate, as expected if back-splicing is a splicing error.

Figure 1.

Low rates of back-splicing in mammals, see also Table S1

(A) Various measures of the rate of back-splicing in 11 tissues from 3 mammals. Only expressed and spliced genes are considered. From the left to the right are percentage of genes with back-splicing, percentage of splice sites that show back-splicing, percentage of spliced reads that are back-spliced, and percentage of spliced reads that are back-spliced among back-spliced genes.

(B) Distribution of the percentage of spliced reads that are back-spliced among back-spliced genes. In each boxplot, the left and right edges of a box represent the first (qu1) and third (qu3) quartiles, respectively; the vertical line inside the box indicates the median (md); and the whiskers extend to the most extreme values inside inner fences, md ± 1.5(qu3 − qu1).

Back-splicing rates decrease with splicing amount

Under the error hypothesis, there are at least three reasons why back-splicing is likely detrimental and selected against. First, back-splicing lowers the fraction of functional mRNA molecules. Second, it wastes materials and energy in producing and degrading circRNAs and possibly their protein products. Third, it may result in circRNAs and/or their protein products that are toxic. Under a given rate of back-splicing, the harm of back-splicing due to the above first cause is independent of the total splicing amount but that due to the second and third causes increases with the total amount of splicing. Hence, natural selection against back-splicing at a splice site (or in a gene) should intensify with the amount of splicing at the splice site (or in the gene). As a result, the error hypothesis predicts that the back-splicing rate should decrease with the splicing amount. In contrast, the adaptive hypothesis does not predict this negative correlation a priori because, under this hypothesis, the back-splicing rate depends on the specific function and regulation of the gene and/or the circRNA produced. To distinguish between the error and adaptive hypotheses, for each (expressed and spliced) gene, we estimated its splicing amount by the total number of spliced reads in the gene, which rises with the transcript concentration of the gene as well as its number of introns. Because natural selection against splicing error in a gene depends on the product of the above two variables, we do not consider them separately. We estimated the back-splicing rate of a gene by its proportion of spliced reads that are back-spliced. We started by focusing on back-spliced genes in the human kidney. Consistent with the prediction of the error hypothesis, the rank correlation (ρ) between the splicing amount of a gene and its back-splicing rate is significantly negative (ρ = −0.63, p < 10−300; Figure 2A). Qualitatively similar results were observed in all examined tissues of the human, macaque, and mouse (Figure 2B). For comparison, we marked in Figure 2A the host genes of two functional circRNAs, namely, circ-ZNF609 (Legnini et al., 2017) and circ-FBXW7 (Yang et al., 2018); the back-splicing rates are much greater in these two genes than in most other genes of comparable splicing amounts.

Figure 2.

The back-splicing rate of a gene (or supergene) decreases with the splicing amount of the gene (or supergene)

The rate of back-splicing is measured by the fraction of spliced reads that are back-spliced, see also Table S2.

(A) The back-splicing rate of a gene decreases with the splicing amount of the gene in the human kidney. Each dot represents a back-spliced gene, and the solid line shows the linear least-squares regression. Spearman’s rank correlation (ρ) and associated p value are presented. The host genes of two functional circRNAs are marked in red.

(B) Spearman’s correlation between the back-splicing rate of a gene and its splicing amount among back-spliced genes in each tissue of each mammal examined. All correlations have p < 10−126.

(C) The back-splicing rate of a supergene decreases with the median splicing amount of all genes belonging to the supergene in the human kidney. Each triangle represents a supergene. All supergenes have the same total splicing amount.

(D) Spearman’s correlation between the median splicing amount of a supergene and its back-splicing rate in each tissue of each mammal examined. All correlations have p < 0.01.

(E) The back-splicing rate in the human kidney of the paralog with the relatively low splicing amount tends to exceed that of the paralog with the relatively high splicing amount within a paralogous gene pair. The original data are used here. Each dot represents a paralogous gene pair. Dots above and below the diagonal are colored red and blue, respectively. Numbers of red and blue dots are indicated with the corresponding color. The p value is from a binomial test of the null hypothesis of equal numbers of red and blue dots.

(F) Proportion of paralogous gene pairs for which the back-splicing rate of the paralog with the relatively low splicing amount exceeds that of the paralog with the relatively high splicing amount in each tissue of each species examined. Both original (squares) and downsampled (triangles) data are used. All fractions are significantly greater than the chance expectation of 50% (p < 10−4).

The above correlation analysis is subject to two potential statistical problems. First, because the detectability of back-splicing increases with the splicing amount, both low and high rates of back-splicing are observable in genes of high splicing amounts whereas only high rates of back-splicing may be observed in genes of low splicing amounts. Consequently, a negative correlation between the back-splicing rate and splicing amount could have resulted simply from this potential detection bias. Second, because the splicing amount is used as the denominator in the estimation of the back-splicing rate, any measurement error of splicing amount can cause a spurious correlation between splicing amount and back-splicing rate. To avoid these potential problems, we used a supergene approach (see STAR Methods). Briefly, we ranked all genes by the splicing amount and grouped them into 10 bins such that each bin had the same total splicing amount. We then computed the overall back-splicing rate of each bin by considering all genes in the bin together as a supergene. The uniformity of the total splicing amount among bins rids the potential problems mentioned. In the human kidney, the back-splicing rate of a bin decreases almost monotonically with the median splicing amount of all genes in the bin (ρ = −0.99, p < 10−300; Figure 2C). Similar results were found in all tissues of the three mammals (Figure 2D). Note that this negative correlation cannot be caused by potentially false signals of back-splicing created by sequencing or other technical errors because such errors are random and so should not occur more frequently to genes of lower splicing amounts. Furthermore, the negative correlation could not have been caused by an impact of the amount of back-splicing on the total splicing amount (e.g., under the hypothesis that back-splicing is a functional regulation of gene expression) because the former is such a tiny fraction of the latter (Figure 1) that the variation of the former has effectively no influence on the variation of the latter. To exclude the possibility that the above results are statistical artifacts, we performed a computer simulation analogous to the analysis in Figure 2C, except that we randomly shuffled the back-splicing rates among genes before the analysis. As expected, no significant correlation was observed between the overall back-splicing rate of a bin and the median splicing amount of the genes belonging to the bin. Different genes differ in multiple aspects in addition to the splicing amount, and so they may not be comparable. To minimize the influences of potential confounding factors in the above analysis, we compared the back-splicing rates between paralogous genes because paralogs are similar in gene structure, DNA sequence, regulation, and function (Zhang, 2013). We required the splicing amount to be at least two times different between the two paralogs to ensure sufficient power of the analysis. Consistent with the error hypothesis, for a paralogous pair, the back-splicing rate tends to be higher for the gene of a relatively low splicing amount. For example, in the human kidney, 75.3% of paralogous pairs show such a trend, which is significantly more than the random expectation of 50% (p = 1.89 × 10−25, binomial test; Figure 2E). In the above analysis, we randomly chose two genes from each gene family annotated by Ensembl. To ensure that the observation in Figure 2E is robust, we repeated the above analysis 100 times; the significant trend in Figure 2E was confirmed in each of the 100 replications. Furthermore, the pattern in Figure 2E holds in all analyzed tissues of the three mammals (Figure 2F). Because the supergene approach cannot pair paralogous genes, we used a downsampling approach (see STAR Methods) to remove the potential statistical problems mentioned. Specifically, for each pair of paralogs, we randomly sampled the spliced reads from the paralog with a relatively high splicing amount to the number of spliced reads observed in the other paralog. Wefoundthe results from the downsampled data to be virtually identical to those from the original data (Figure 2F).

The back-splicing rate correlates negatively with splicing amount across tissues

The back-splicing rate of a gene can be influenced by cis-acting elements, which are present on the same DNA molecule as the gene, and trans-acting factors, which are not present on the same DNA as the gene. In the above comparison of backing-splicing among different genes in the same tissue, all genes are in the same environment of trans-acting factors, so the among-gene variation must be caused by the variation in cis-acting elements that affect splicing. Because different tissues can provide qualitatively or quantitatively different trans-acting factors, the back-splicing rate of the same gene may differ among tissues. The error hypothesis predicts that, for a given gene, natural selection against splicing error should intensify in the tissue where the splicing amount of the gene is higher. This should result in a negative correlation between the back-splicing rate and splicing amount across tissues for individual genes. In contrast, no such prediction is made a priori by the adaptive hypothesis because, under the adaptive hypothesis, the back-splicing rate of a gene in a tissue would depend on the specific function of the circRNA produced in that tissue. Because the back-splicing rate is low or even zero for most genes (Figure 1B), sampling error would swamp the potential signal in among-tissue comparisons of individual genes. To circumvent this problem, we randomly grouped every 250 genes into a supergene, except for the last supergene that comprised the remainder of fewer than 250 genes after the grouping. The number 250 was chosen to ensure that each supergene contains sufficient genes with back-splicing and that sufficient supergenes are present to permit a meaningful statistical analysis. We examined the back-splicing rate and splicing amount of each supergene in each tissue. To allow for an among-tissue comparison of the splicing amount, we computed the splicing amount of a supergene in a tissue by the number of spliced reads for the supergene per million total reads (SRPM) in the tissue. As shown in Figure 3A for an example, the back-splicing rate of this particular human supergene in a tissue generally decreases with its splicing amount in the tissue. Indeed, in each of the three species examined, significantly more than 50% of supergenes show this negative correlation (numbers at the bottom of Figure 3B), and this trend is robust as long as supergenes are composed of at least 250 genes. Because each supergene does not have the same splicing amount across tissues, to avoid potential statistical artifacts, we downsampled the spliced reads of a supergene in a tissue to the lowest observed level among all tissues for the supergene and then recomputed the back-splicing rate of the supergene in each tissue. The final results still hold (Figure 3). Thus, the variation of the back-splicing rate among tissues supports the error hypothesis. Because the among-tissue variation of the back-splicing rate is not always concordant among (super) genes, we infer that the variation not only is due to differences in trans-acting factors among tissues but also contributed by interactions between trans-acting factors of individual tissues and cis-acting elements of individual genes.

Figure 3.

Negative correlation between the back-splicing rate and splicing amount across tissues

The back-splicing rate is the number of back-spliced reads divided by the number of spliced reads in a supergene.

(A) The back-splicing rate of a particular supergene in a tissue decreases with the total splicing amount of the genes belonging to the supergene in the tissue. The virtually superimposed black and red lines are the linear least-squares regressions for the original and down-sampled data, respectively.

(B) Distribution of Spearman’s correlation coefficient between back-splicing rate and splicing amount across tissues for all supergenes. In each boxplot, the lower and upper edges of a box represent qu1 and qu3 quartiles, respectively; the horizontal line inside the box indicates the md; and the whiskers extend to the most extreme values inside inner fences, md ± 1.5(qu3 − qu1). Below each boxplot is the fraction of supergenes showing a negative correlation; all fractions significantly exceed 50% (p < 0.05, binomial test).

Back-splicing is not evolutionarily conserved

Back-splicing is expected to be evolutionarily conserved if it is beneficial; otherwise, it should be unconserved. Thus, a comparison between species that have been separated for a sufficiently long time allows differentiation between the adaptive and error hypotheses. To this end, we compared back-splicing between a primate and a rodent. If an orthologous splice-acceptor (or donor) is used in back-splicing in the same tissue of the two species, the acceptor (or donor) is considered shared between the two species (Figure 4A). For each tissue, we calculated the fraction of back-spliced acceptors in human or macaque that are shared with mouse. For example, in the kidney, human has 9,133 back-spliced acceptors, of which only 1,539 (or 16.9%) are shared with mouse. This fraction has a median value of 17.0% in human and 12.3% in macaque across the 11 tissues examined (Figure 4B).

Figure 4.

Fractions of human or macaque back-spliced acceptors or donors that are shared with mouse, see also Data S1 and S2

(A) A diagram illustrating various terms used in the analysis. Dotted curves show all potential back-splicing, while red and blue stars indicate realized back-splicing in observation and simulation, respectively.

(B) Fraction of human or macaque back-spliced acceptors shared with mouse.

(D) Fraction of human or macaque back-spliced donors shared with mouse.

(E) Difference between the fraction of back-spliced donors shared with mouse and the chance expectation.

In (B)–(E), the median value across the 11 tissues is provided in the parentheses after each species.

Nevertheless, sharing of a spliced acceptor or donor between species may not indicate functional back-splicing because non-functional back-splicing could be shared by chance. To estimate the amount of sharing expected by chance, we first examined patterns of back-splicing in human (of a given tissue) by computing the relative probabilities that a back-spliced read mapped to a splice donor is also mapped respectively to the first upstream acceptor, second upstream acceptor, and so on (Figure 4A). We then used the overall probability distribution (Data S1) of all back-spliced reads observed in the species and tissue to simulate back-splicing. If n back-spliced reads were observed for a donor, we would simulate n back-spliced reads for this donor, but the acceptors will be randomly decided based on the overall probability distribution determined above. The simulation allows the survey of the set of acceptors expected by chance when the donors are given. We repeated the analysis and surveyed the set of acceptors expected by chance in mouse and then computed the fraction of acceptors in human shared with mouse by chance. For example, this value is 14.6% for the kidney. Hence, the observed fraction of shared acceptors is only 16.9 − 14.6 = 2.3 percentage points above the chance expectation (Figure 4C), indicating that most acceptors shared between human and mouse are by chance. Similar patterns are observed for other human tissues and for all macaque tissues (Figure 4C). We similarly analyzed the sharing of back-spliced donors between a primate and a rodent (Data S2). Again, we found moderate sharing of donors given the acceptors (Figure 4D) but most are explainable by chance (Figure 4E). Note that some values in Figure 4C and Figure 4E are negative, which is likely due to sampling error caused by the stochasticity of evolution and/or simulation. Regardless, the small positive to small negative values suggest that there is little excess in between-species sharing of back-spliced acceptors (given donors) or donors (given acceptors) when compared with the chance expectation, supporting the error hypothesis and refuting the adaptive hypothesis. Note that some acceptors and donors are used more often than others for back-splicing (Data S1 and S2), but we do not know the molecular determinants of their relative usages, which await future mechanistic studies.

Conservation of splicing signals is uncorrelated with the amount of back-splicing

Back-splicing depends on the canonical splicing machinery and splicing signal. Thus, with proper controls, intraspecific and interspecific variations of splicing motifs—GU as the donor and AG as the acceptor—can indicate whether back-splicing is protected by purifying selection. If back-splicing is functional, motifs associated with larger amounts of back-splicing should be subject to stronger purifying selection. In contrast, if back-splicing results mostly from molecular error and is not beneficial, no such correlation is expected. To this end, we used the number of back-spliced reads associated with a donor (or acceptor) in a human tissue as the measure of its back-splicing amount in that tissue. We used (1) the number of single-nucleotide polymorphisms (SNPs) per site (SNP density) in humans, (2) the mean derived allele frequency (DAF) in humans, and (3) the percent sequence divergence between human and macaque at a donor (or acceptor) splicing motif—the first (or last) two nucleotides of the relevant intron—as indicators of purifying selection. All three indicators should decline with the level of purifying selection but have different properties. The interspecific sequence divergence measures long-term average purifying selection and is insensitive to interferences from selections at linked nucleotide sites but would be powerless if the functionality of the associated back-splicing is limited to humans. The other two indicators are useful even if the functionality of the associated back-splicing is limited to humans but could be influenced by linked selection. In addition, SNP density could be affected by mutation rate variation among sites, whereas DAF is robust to this variation. We observed only trivial and mostly statistically non-significant correlations between the back-splicing amount and SNP density among back-spliced donors or acceptors (Figure 5A). Note that these correlations may not be due to selection on back-splicing because the donors and acceptors are also used by linear-splicing. Indeed, we observed significant, negative correlations between the linear-splicing amount and SNP density across donors and acceptors (Figure 5B). To remove the confounding factor of linear-splicing, we performed partial correlations between the back-splicing amount and SNP density by controlling the corresponding liner-splicing amount. Interestingly, all the partial correlations are around zero and none of them are statistically significant (Figure 5C). Similar patterns were observed for DAF (Figures 5D–5F) and interspecific sequence divergence (Figures 5G–5I). Together, both intraspecific polymorphisms and interspecific divergences of donor and acceptor motifs suggest no purifying selection protecting back-splicing motifs, which is inconsistent with the adaptive hypothesis but supports the error hypothesis.

Figure 5.

Back-splicing signals are not protected by purifying selection

(A) Spearman’s correlation between human SNP density at a splice-acceptor or donor site and the associated back-splicing amount.

(B) Spearman’s correlation between human SNP density at a splice-acceptor or donor site and the associated linear-splicing amount.

(C) Partial rank correlation between human SNP density at a splice-acceptor or donor site and the associated back-splicing amount, upon the control of the linear-splicing amount.

(D–F) Same as (A)–(C) except that human SNP density is replaced with human mean derived allele frequency (DAF).

(G–I) Same as (A)–(C) except that human SNP density is replaced with human-macaque divergence. Statistical significance of a correlation is indicated by a dash (non-significant) or star (significant at p = 0.05) at the bottom of each panel.

Overall rate of back-splicing declines with the effective population size

If back-splicing arises from splicing error and is detrimental, natural selection will lower its rate. Because the strength of the selection increases with the effective population size (Ne) of the species (Ohta, 1992), the rate of back-splicing upon selection is expected to be lower in species with larger Ne. That is, the error hypothesis predicts that the back-splicing rate reduces from the human to macaque to mouse, given that Ne increases substantially from the human to macaque to mouse (Phifer-Rixey et al., 2012; Xue et al., 2016). In contrast, no such prediction is made a prior by the adaptive hypothesis because the back-splicing rate in a species would depend on the function of back-splicing and the environment of the species under the adaptive hypothesis. To compare the overall rate of back-splicing among the three species, we grouped the splicing data from all 11 tissues of each species. We first calculated the overall back-splicing rate of all (expressed and spliced) genes in each species, which is the total number of back-spliced reads divided by the total number of spliced reads. This rate is 0.26% in humans, 0.19% in macaques, and 0.09% in mice, with all between-species differences being significant (p < 10−15, Fisher’s exact test; Figure 6A). Because the number and types of genes vary among species, we also compared one-to-one orthologous genes among the three species. Now the back-splicing rate is 0.31% in humans, 0.22% in macaques, and 0.11% in mice (all p < 10−15, Fisher’s exact test; Figure 6B). We confirmed that the above pattern of interspecific differences holds when orthologous genes are stratified into groups of low (<10 SRPM), intermediate (10–100 SRPM), and high (>100 SRPM) splicing amounts (all p < 10−15, Fisher’s exact test; Figure 6C). These results are not caused by outliers, which is evident from the among-gene distribution of the back-splicing rate of each species (Figure 6D). Furthermore, a comparison of splicing rates of individual genes among the three species supports that the splicing rate generally reduces from human to macaque to mouse for orthologous genes (all p < 10−238, Wilcoxon signed-rank test; Figure 6D). Together, the relative overall back-splicing rates in the three mammals supports the error hypothesis. Nevertheless, because the above analyses were based on only three species, our finding should be further scrutinized in the future when circRNA data of the same tissues become available from additional species.

Figure 6.

The overall back-splicing rate in a species declines with its effective population size

(A) Overall back-splicing rates in the human, macaque, and mouse when all (expressed and spliced) genes are considered.

(B) Overall back-splicing rates in the three species when only one-to-one orthologous genes are considered. In (A) and (B), the difference between any two species is significant (p < 10−15, Fisher’s exact test).

(C) Among-species comparison of back-splicing rates of one-to-one orthologous genes after the genes are stratified into bins of low (<10 SRPM), intermediate (10 to 100 SRPM), and high (> 100 SRPM) splicing amounts according to the mean splicing amount across species. SRPM, number of spliced reads per million total reads in a sample. At each of the three levels of splicing amount, the difference in back-splicing rate between any two species is significant (p < 10−15, Fisher’s exact test).

(D) Boxplot showing the distribution of the back-splicing rate among one-to-one orthologous genes in each species. The difference between any two species is significant (p < 10−238, Wilcoxon signed-rank test). In each boxplot, the lower and upper edges of a box represent qu1 and qu3, respectively; the horizontal line inside the box indicates the md; and the whiskers extend to the most extreme values inside inner fences, md ± 1.5(qu3 − qu1).

Most back-splicing is deleterious

Together, the above analyses strongly suggest that most back-splicing events are deleterious. Below, we use an established method to estimate the fraction of back-splicing that is deleterious (Li and Zhang, 2019; Saudemont et al., 2017; Xu and Zhang, 2020a). This estimation is based on the reasonable assumption that the fitness effect of a back-splicing event at a splice site before the action of natural selection is independent of the splicing amount at the site. Because the strength of natural selection against back-splicing increases with the splicing amount, we assume that all deleterious splicing has been selectively removed in genes of the highest splicing amounts. In other words, the observed back-splicing rate in these genes is the non-deleterious back-splicing rate (ND). Similarly, we assume that none of the deleterious back-splicing has been selectively purged in genes of the lowest splicing amounts. That is, the observed back-splicing rate in these genes reflects the total back-splicing rate (T). Thus, the fraction of deleterious back-splicing is Fdel = (T-ND)/T = 1 − ND/T. We defined genes of the lowest and highest splicing amounts by using a variety of cutoffs. In theory, using more stringent cutoffs makes the estimate of Fdel more accurate but less precise due to the reduction in sample size. When the data from all tissues were combined, we found Fdel to be greater than 96% for each species under any combination of cutoffs (Figure 7A). For example, when the cutoffs of <1 SRPM and >500 SRPM were adopted in defining genes of the lowest and highest splicing amounts, respectively, human T = 5.96 × 10−3 and ND = 6.46 × 10−5, so Fdel = 98.9%. Similarly, under these cutoffs, Fdel = 99.8% for the macaque and 99.1% for the mouse. Note that the above Fdel values of the three species are not directly comparable because the same SRPM cutoffs mean different degrees of validity of the above two assumptions for different species as a result of their different Ne values.

Figure 7.

Fraction of deleterious back-splicing

(A) Estimated fractions of deleterious back-splicing for each species when data from all tissues are combined, under different cutoffs for genes of the lowest and highest splicing amounts.

(B) Estimated fractions of deleterious back-splicing in each tissue of each species examined under the cutoffs boxed in (A), either when all genes are considered or when RIMS1 is excluded. Back-splicing of RIMS1 in the human brain is unusually abundant and may be beneficial.

Under these same cutoffs, we also estimated Fdel for each tissue in each species. All Fdel values are >73% except for the human brain, which has an Fdel of 42% (Figure 7B). Upon examination of the 29 genes of >500 SRPM in the human brain, we found a gene (RIMS1) with an unusually high back-splicing rate of 8.7%. Because the circRNAs produced from RIMS1 were reported to be potentially functional in neurons (Chen et al., 2019; Ji et al., 2019; You et al., 2015), we re-estimated Fdel after removing RIMS1. Now, Fdel = 91.6% in the human brain and remains virtually unchanged in the other tissues or species (Figure 7B). Although Fdel varies among tissues, we observed a median Fdel of 98.8%, 98.4%, and 98.0% in the human, macaque, and mouse, respectively (Figure 7B), which are similar to the Fdel estimates from all tissues together (Figure 7A). Note that our Fdel estimates are conservative because very slightly deleterious back-splicing may not have been fully removed by selection in the genes of the highest splicing amounts and because some strongly deleterious back-splicing may have been removed by selection even in the genes of the lowest splicing amounts. Note that Fdel measures the fraction of back-splicing that is deleterious before the action of purifying selection. Because some deleterious back-splicing has been removed by selection, the fraction of observed back-splicing that is deleterious should be lower. The deleterious fraction of observed back-splicing (Odel) can be estimated by regarding the overall back-splicing rate from all genes (Figure 6A) as T and the back-splicing rate from genes of >500 SRPM as ND. When the data from all tissues are merged, Odel is 97.5% for human, 99.7% for macaque, and 98.9% for mouse, respectively. The Odel values are only slightly lower than the corresponding Fdel values because apparently most deleterious back-splicing has not been selectively purged due to the preponderance of genes of relatively low splicing amounts. It has been estimated that the fraction of linear-splicing that arises from error is about 70% in humans (Saudemont et al., 2017), suggesting that error accounts for a much greater proportion of back-splicing than linear-splicing. Taken together, our estimation demonstrates that the vast majority of all or observed back-splicing is deleterious, which is broadly consistent with the finding of virtually no excess in between-species sharing of back-spliced acceptors or donors over the chance expectation and the finding of no purifying selection protecting back-splicing signals.

DISCUSSION

The discovery of a large number of circRNAs from many eukaryotes and the demonstration that some of them possess biochemical activities have led to the prevailing view that circRNA production is generally beneficial. In this work, we challenged this adaptive view by providing comprehensive evidence for an alternative view that back-splicing that leads to circRNA production arises mostly from splicing error and is detrimental. Our evidence, based on the transcriptomes of 11 tissues from each of the human, macaque, and mouse, comprises the findings that (1) the back-splicing rate is orders of magnitude lower than the linear-splicing rate, (2) the back-splicing rate in a gene decreases with the splicing amount of the gene, (3) the back-splicing rate of a gene in a tissue tends to reduce with the splicing amount of the gene in the tissue when multiple tissues are compared, (4) there is little between-species sharing of back-spliced acceptors or donors beyond the chance expectation, (5) purifying selection protecting the motifs for back-splicing is lacking, and (6) the overall rate of back-splicing in a species declines with its effective population size. None of these observations are predicted a prior by the adaptive hypothesis, but all fit the predictions of the error hypothesis. Although most of the evidence is derived from the 11 tissues analyzed, the above fifth line of evidence is based on the polymorphism and divergence of genome sequences and so is not limited to the specific tissues analyzed. Together, the empirical evidence strongly suggests that most mammalian back-splicing events and by inference most circRNA productions are detrimental rather than beneficial. Aside from the above evidence, there are several observations reported in the literature that are consistent with the error hypothesis or inconsistent with the adaptive hypothesis. First, the proposal that binding to microRNAs is a general function of circRNAs has been challenged (Enuka et al., 2016; Guo et al., 2014; Ragan et al., 2019). For example, Ragan et al. (2019) reported that only about 12% of circRNAs contain microRNA binding sites. Guo et al. (2014) found only two circRNAs with more microRNA binding sites than expected by chance, and Enuka et al. (2016) found no enrichment of microRNA binding sites in circRNAs in general. Second, although some circRNAs can act as the sponge of RBPs (Zang et al., 2020), this activity is not generalizable because a genome-wide analysis did not find enriched binding sites of RBPs in circRNAs compared with their corresponding linear mRNAs (You et al., 2015). Third, researchers failed to detect a significant association of circRNAs with polysomes (Guo et al., 2014; You et al., 2015), arguing against the notion that circRNAs generally function through their protein products (Kristensen et al., 2019; Li et al., 2018; Pamudurti et al., 2017). circRNA translation depends on having internal ribosome entry sites (IRESs) (Pamudurti et al., 2017), but <1.5% of circRNAs contain IRESs (Fan et al., 2019). Recently, however, IRES-like elements were found in many circRNAs, and hundreds of circRNA-encoded peptides were identified from mass spectrometry data (Fan et al., 2019). Notwithstanding, the translation of a circRNA does not prove that it confers an advantage because the translation itself could be an error due to spurious translational initiation and the protein product could be functionless or even toxic. In fact, the sequence similarity of circRNA orthologs between human and mouse is no higher than that of their neighboring linear exons, suggesting a lack of circRNA-specific purifying selection (Guo et al., 2014). Finally, it is worth stressing that the existence of regulation or regulatory mechanisms of circRNA biogenesis or degradation (Conn et al., 2015; Liang et al., 2017; Zhang et al., 2014) is not evidence for circRNA functionality because this phenomenon can arise as a byproduct of other biological processes. As an analogy, simply because trash is removed once a week on a particular day, the amount of trash in a house would exhibit a cyclic pattern resembling regulation, but the regulation does not prove that the trash is useful. It is worth noting that the brain shows more back-spliced genes and a higher back-splicing rate than the other 10 tissues studied here (Figure 1). Nevertheless, for the following four reasons, this observation does not necessarily support the proposal that circRNAs play important roles in the brain (Rybak-Wolf et al., 2015). First, even in the brain, the back-splicing rate is still very low (Figure 1), and other analyses (Figures 2, 3, 4, 5, and 6) do not suggest that the brain is an outlier of the general patterns of circRNAs observed across tissues. Second, there is no between-species sharing of back-spliced acceptors or donors beyond the chance expectation in the brain (Figure 4C and 4E), and the fraction of deleterious back-splicing is at least 73.2% (Figure 7B) in the brain. Third, alternative splicing and its regulation are more complex in the brain than in other tissues (Raj and Blencowe, 2015), which might increase the chance of splicing error. Fourth, because brain cells typically have longer lifespans than other cells (Magrassi et al., 2013), circRNAs, which are exceptionally stable, might accumulate to a higher level in the brain than in other tissues, explaining why the back-splicing rate looks higher (Figure 1A) and the fraction of deleterious back-splicing (Figure 7B) looks lower in the brain than in other tissues. Taken together, our findings and those discussed above provide unequivocal evidence that most back-splicing events are detrimental splicing errors and that most circRNAs do not have beneficial functions. Thus, circRNA is generally a class of junk RNA (Brosius, 2005; Palazzo and Lee, 2015). This conclusion requires a paradigm shift in circRNA research. Instead of assuming that all or most circRNAs are functional in today’s research, our conclusion requires that we treat all circRNAs as non-functional until proven otherwise. Nevertheless, our conclusion does not preclude the occasional observation of circRNAs that possess beneficial functions, as has been suggested for CDR1as/CiRS-7 and several other circRNAs mentioned in the Introduction. In this context, it is highly valuable to identify functional circRNAs even though they account for only a small fraction of all circRNAs. Although identifying functional circRNAs is beyond the scope of the present study, we explored the possibility that highly expressed genes with exceptionally high back-splicing rates host functional circRNAs. Specifically, we regressed between the back-splicing rate and splicing amount across genes in Figure 2A and calculated Cook’s distance for each gene (see STAR Methods). We then defined a gene as an outlier if its Cook’s distance is more than 4 times the mean Cook’s distance of all genes. Among these outliers, 15 genes have a back-splicing rate of at least 10% and an expression level of at least 10 TPM (Table S2). Interestingly, these 15 genes include the host genes of 2 known functional circRNAs aforementioned, namely, circ-ZNF609 and circ-FBXW7 (Figure 2A). We suggest that the circRNAs from the remaining 13 genes be studied in the future as candidates of functional circRNAs and the approach proposed above for identifying potentially functional circRNAs be systematically evaluated. Our findings on back-splicing, along with previous studies on linear-splicing (Melamud and Moult, 2009; Pickrell et al., 2010; Saudemont et al., 2017), demonstrate that splicing is generally error prone. Together, they echo other recent findings that a number of steps in transcription and translation are fallible, including, for example, transcriptional initiation, RNA synthesis, polyadenylation, posttranscriptional modification, translational initiation, translational elongation or decoding, translational termination, and posttranslational modification (Gout et al., 2017; Landry et al., 2009; Li and Zhang, 2019; Liu and Zhang, 2018a, 2018b; Park and Zhang, 2011; Ribas de Pouplana et al., 2014; Xu et al., 2019; Xu and Zhang, 2014, 2018, 2020b). These findings indicate that cellular life is far less perfected than is commonly portrayed, which has broad and profound implications for biology (Lynch, 2007, 2014; Warnecke and Hurst, 2011; Zhang and Yang, 2015).

STAR★METHODS

RESOURCE AVAILABILITY

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Jianzhi Zhang (jianzhi@umich.edu).

Materials availability

This study did not generate new unique reagents. This paper analyzes existing, publicly available data. The accession numbers for the datasets are listed in the Key resources table.

KEY RESOURCES TABLE

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data
RiboMinus RNA-seq data	Ji et al., 2019	https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA000751
RNA-seq data	Ji et al., 2019	https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA000751
RNase R+ data	Ji et al., 2019	https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA000751
Polymorphism data	IGSR	ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/
Genome assembly and annotation data	ENSEMBL	http://may2017.archive.ensembl.org/index.html
Software and algorithms
R	The R Foundation	https://www.r-project.org/
Perl	The Perl Foundation	https://www.perl.org/
Python	Python Software Foundation	https://www.python.org/
LiftOver	UCSC	https://genome.ucsc.edu/util.html
BioMart	ENSEMBL	http://useast.ensembl.org/biomart/martview//38de6b77bceb11d76acd2a1d1b231382
BWA	Li and Durbin 2009	http://bio-bwa.sourceforge.net/
Tophat2	Kim et al., 2013	https://ccb.jhu.edu/software/tophat/index.shtml
STAR	Dobin et al., 2013	https://github.com/alexdobin/STAR
CircExplorer2	Zhang et al., 2016	https://circexplorer2.readthedocs.io/en/latest/
CIRI2	Gao et al., 2018	http://159.226.67.237:8080/new/download_file.php

This paper does not report original code. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

EXPERIMENTAL MODEL AND SUBJECT DETAILS

The original datasets analyzed in the present study are provided in the Key resources table.

METHOD DETAILS

Linear-splicing and back-splicing

Because the regular RNA-seq captures poly(A)-enriched linear RNAs, many other RNA species that are not linear or do not have poly(A) tails are lost. RNase R treatment during the RNA library construction can efficiently enrich back-spliced circRNAs but will filter out linear RNAs. Thus, neither type of datasets is appropriate for our study. To identify back-splicing and linear-splicing simultaneously, we chose data from RNA-seq experiments that only remove rRNAs by RiboMinus treatment during the library construction. The recently published RNA-seq data by Ji et al. (2019), comprising deeply sequenced transcriptomes from 11 tissues each from humans, macaques, and mice (Table S1), fulfill our requirements. We downloaded the data from NGDC (https://ngdc.cncb.ac.cn/). Back-splicing is inferred from RNA-seq reads that span back-spliced junctions and therefore map non-linearly to the genome. Several pipelines and algorithms have been developed to identify specifically non-linear reads and predict the landscape of circRNAs (Szabo and Salzman, 2016). According to comparisons of many circRNA prediction pipelines (Hansen et al., 2016; Zeng et al., 2017), we chose CIRCexplorer2 (Zhang et al., 2016) and CIRI2 (Gao et al., 2018) to infer back-splicing. The former tool uses STAR (Dobin et al., 2013) as the mapper and is dependent on gene annotations, while the latter is based on the mapper BWA (Li and Durbin, 2009) and predicts back-splicing de novo. We first used CIRCexplorer2 under default parameters to identify back-spliced sites and associated back-spliced reads. We then added back-spliced sites identified by CIRI2 under default parameters that were not reported by CIRCexplorer2 and added the associated back-spliced reads. Linear-splicing was retrieved by Tophat2 (Kim et al., 2013) using default parameters, including annotated and newly identified splicing junctions. The genome assemblies used were GRCh38 for human, Mmul 8.0.1 for macaque, and GRCm38 for mouse, all downloaded with gene annotations from Ensembl release 89. Although the genomic annotation is less extensive for the macaque than for the human and mouse, this variation should not bias our analyses because (1) we used de novo identification of back- and linear-splicing sites in addition to annotations and (2) most of our analyses were within-species comparisons. Because the start and end of a gene are variable due to alternative transcriptional initiation and alternative polyadenylation, we defined a gene from the nucleotide that is 500 bp upstream the annotated 5′-most transcriptional start site (Forrest et al., 2014) to the nucleotide that is 1000 bp downstream the annotated 3′-most polyadenylation site (Derti et al., 2012). Any splicing junction located in a defined gene region was considered to belong to the gene. Only splicing junctions uniquely mapped to a gene were considered in the study.

Splicing amount

The total number of linear- and back-spliced reads mapped to a splice site is the splicing amount of the splice site. The total number of linear- and back-spliced reads mapped to all splice sites of a gene is the splicing amount of the gene. To allow comparing the splicing amount of a gene among samples, we computed the number of spliced reads per million total reads in the sample (SRPM). The back-splicing rate at a splice site, in a gene, or in a supergene was calculated as the number of back-spliced reads mapped to the site, gene, or supergene, relative to the total number of spliced reads mapped to the site, gene, or supergene. We used RNA-seq downloaded from NGDC to measure gene expression levels (Table S1). The reads were mapped to the human (GRCh38), macaque (Mmul 8.0.1), or mouse (GRCm38) genome using TopHat2 (Kim et al., 2013). Fragment per kilobase of transcripts per million mapped reads (FPKM) of a gene was first calculated by cufflinks (Trapnell et al., 2012) and then converted to TPM using the formula of TPM = (FPKM × 106)/(sum of FPKM) (Li and Dewey, 2011). Only genes expressed (TPM ≥ 1) and spliced (# of spliced reads ≥ 1) were considered in our study.

Corrections of unequal surveys of splicing events among genes

Due to the variation of the splicing amount among genes, splicing is surveyed more extensively for some genes than other genes by RiboMinus RNA-seq. To remove the potential influence of this unequal survey, we used two different approaches unless otherwise mentioned. The first is the supergene approach. Unless otherwise noted, we first ranked all genes by their splicing amounts. We then grouped the genes into 10 bins representing 10 supergenes, requiring the total splicing amount per bin to be the same for all bins. Numbers of back-spliced, linear-spliced, and spliced reads were respectively summed up across all genes in the bin. The supergene approach cannot be used under certain circumstances. Under these circumstances, we used downsampling. For example, when comparing a pair of paralogous genes in a sample, we randomly picked the number of spliced reads from the gene of the relatively high splicing amount to the level observed in the gene of the relatively low splicing amount. This downsampling equalized the survey depth of splicing between the two genes. The supergene approach is preferred over downsampling when both can be used, because the former uses all data while the latter uses only part of the data. In the analysis among 11 tissues, because many back-spliced genes would have no back-spliced reads in multiple tissues upon downsampling, causing a loss of statistical power, we combined the supergene approach with downsampling, as described in Results.

Cook’s distance

Cook’s distance (D) is used in regression analysis to find influential outliers in a set of predictor variables (Cook, 1977). An observation with Cook’s distance larger than four times the mean Cook’s distance was deemed an outlier in this study. Cook’s distance of gene i is where is the jth gene’s fitted response value, is the jth gene’s fitted response value when gene i is removed, n is the number of genes, MSE is the mean squared error, and p is the number of coefficients in the regression model.

Paralogs and orthologs

Paralogous genes were downloaded from Ensembl (release 89; May 2017) for the three species. We obtained 3,678 human protein-coding gene families with 51,657 pairs of paralogs, 3,912 macaque gene families with 46,718 pairs of paralogs, and 3,856 mouse gene families with 79,968 pairs of paralogs, respectively. We randomly selected from each gene family only one paralogous pair that exhibits a two-fold or greater difference in splicing amount to allow a sufficient statistical power. Orthologous genes among human, macaque, and mouse were downloaded from Ensembl (release 89; May 2017), and only one-to-one orthologous genes were considered in our analysis. The numbers of orthologs between human and macaque, between human and mouse, and between macaque and mouse are 19,754, 16,797, and 15,170, respectively. From these data, we obtained 14,882 one-to-one orthologs among the three mammals. To identify orthologous splice sites between species, we used the UCSC liftOver tool (https://genome.ucsc.edu/util.html) to align the splice sites of one species with the genome of another species.

Polymorphism and divergence

Human polymorphism data, including allele frequencies, from Interim Phase 3 of the 1000 Genomes project (Sudmant et al., 2015), were downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ (last accessed Feb. 28, 2020). This dataset comprises the genotypes of 2,504 individuals from 26 populations and includes a total of 78,136,341 autosomal SNPs. Only SNPs were included in the analysis. The nucleotide observed at a SNP was categorized as ancestral if it is the same as the nucleotide of the ‘AA’ field in the polymorphism VCF file; other nucleotides at the SNP are derived. The derived allele frequency at a SNP is the frequency of the derived allele at the SNP. Nucleotide differences at splice-acceptors and donors were based on a comparison between human (GRCh38) and macaque (Mmul 8.0.1) genomes through liftOver from UCSC.

QUANTIFICATION AND STATISTICAL ANALYSIS

Statistical analyses are described in Results, figure legends, and the above Method details section. R was used in statistical analysis (see Key resources table).

85 in total

1. Human coding RNA editing is generally nonadaptive.

Authors: Guixia Xu; Jianzhi Zhang
Journal: Proc Natl Acad Sci U S A Date: 2014-02-24 Impact factor: 11.205

2. Analysis of intron sequences reveals hallmarks of circular RNA biogenesis in animals.

Authors: Andranik Ivanov; Sebastian Memczak; Emanuel Wyler; Francesca Torti; Hagit T Porath; Marta R Orejuela; Michael Piechotta; Erez Y Levanon; Markus Landthaler; Christoph Dieterich; Nikolaus Rajewsky
Journal: Cell Rep Date: 2014-12-31 Impact factor: 9.423

3. The Output of Protein-Coding Genes Shifts to Circular RNAs When the Pre-mRNA Processing Machinery Is Limiting.

Authors: Dongming Liang; Deirdre C Tatomer; Zheng Luo; Huang Wu; Li Yang; Ling-Ling Chen; Sara Cherry; Jeremy E Wilusz
Journal: Mol Cell Date: 2017-11-22 Impact factor: 17.970

Review 4. Circular RNA: an emerging key player in RNA world.

Authors: Xianwen Meng; Xue Li; Peijing Zhang; Jingjing Wang; Yincong Zhou; Ming Chen
Journal: Brief Bioinform Date: 2017-07-01 Impact factor: 11.622

5. Human C-to-U Coding RNA Editing Is Largely Nonadaptive.

Authors: Zhen Liu; Jianzhi Zhang
Journal: Mol Biol Evol Date: 2018-04-01 Impact factor: 16.240

6. A promoter-level mammalian expression atlas.

Authors: Alistair R R Forrest; Hideya Kawaji; Michael Rehli; J Kenneth Baillie; Michiel J L de Hoon; Vanja Haberle; Timo Lassmann; Ivan V Kulakovskiy; Marina Lizio; Masayoshi Itoh; Robin Andersson; Christopher J Mungall; Terrence F Meehan; Sebastian Schmeier; Nicolas Bertin; Mette Jørgensen; Emmanuel Dimont; Erik Arner; Christian Schmidl; Ulf Schaefer; Yulia A Medvedeva; Charles Plessy; Morana Vitezic; Jessica Severin; Colin A Semple; Yuri Ishizu; Robert S Young; Margherita Francescatto; Intikhab Alam; Davide Albanese; Gabriel M Altschuler; Takahiro Arakawa; John A C Archer; Peter Arner; Magda Babina; Sarah Rennie; Piotr J Balwierz; Anthony G Beckhouse; Swati Pradhan-Bhatt; Judith A Blake; Antje Blumenthal; Beatrice Bodega; Alessandro Bonetti; James Briggs; Frank Brombacher; A Maxwell Burroughs; Andrea Califano; Carlo V Cannistraci; Daniel Carbajo; Yun Chen; Marco Chierici; Yari Ciani; Hans C Clevers; Emiliano Dalla; Carrie A Davis; Michael Detmar; Alexander D Diehl; Taeko Dohi; Finn Drabløs; Albert S B Edge; Matthias Edinger; Karl Ekwall; Mitsuhiro Endoh; Hideki Enomoto; Michela Fagiolini; Lynsey Fairbairn; Hai Fang; Mary C Farach-Carson; Geoffrey J Faulkner; Alexander V Favorov; Malcolm E Fisher; Martin C Frith; Rie Fujita; Shiro Fukuda; Cesare Furlanello; Masaaki Furino; Jun-ichi Furusawa; Teunis B Geijtenbeek; Andrew P Gibson; Thomas Gingeras; Daniel Goldowitz; Julian Gough; Sven Guhl; Reto Guler; Stefano Gustincich; Thomas J Ha; Masahide Hamaguchi; Mitsuko Hara; Matthias Harbers; Jayson Harshbarger; Akira Hasegawa; Yuki Hasegawa; Takehiro Hashimoto; Meenhard Herlyn; Kelly J Hitchens; Shannan J Ho Sui; Oliver M Hofmann; Ilka Hoof; Furni Hori; Lukasz Huminiecki; Kei Iida; Tomokatsu Ikawa; Boris R Jankovic; Hui Jia; Anagha Joshi; Giuseppe Jurman; Bogumil Kaczkowski; Chieko Kai; Kaoru Kaida; Ai Kaiho; Kazuhiro Kajiyama; Mutsumi Kanamori-Katayama; Artem S Kasianov; Takeya Kasukawa; Shintaro Katayama; Sachi Kato; Shuji Kawaguchi; Hiroshi Kawamoto; Yuki I Kawamura; Tsugumi Kawashima; Judith S Kempfle; Tony J Kenna; Juha Kere; Levon M Khachigian; Toshio Kitamura; S Peter Klinken; Alan J Knox; Miki Kojima; Soichi Kojima; Naoto Kondo; Haruhiko Koseki; Shigeo Koyasu; Sarah Krampitz; Atsutaka Kubosaki; Andrew T Kwon; Jeroen F J Laros; Weonju Lee; Andreas Lennartsson; Kang Li; Berit Lilje; Leonard Lipovich; Alan Mackay-Sim; Ri-ichiroh Manabe; Jessica C Mar; Benoit Marchand; Anthony Mathelier; Niklas Mejhert; Alison Meynert; Yosuke Mizuno; David A de Lima Morais; Hiromasa Morikawa; Mitsuru Morimoto; Kazuyo Moro; Efthymios Motakis; Hozumi Motohashi; Christine L Mummery; Mitsuyoshi Murata; Sayaka Nagao-Sato; Yutaka Nakachi; Fumio Nakahara; Toshiyuki Nakamura; Yukio Nakamura; Kenichi Nakazato; Erik van Nimwegen; Noriko Ninomiya; Hiromi Nishiyori; Shohei Noma; Shohei Noma; Tadasuke Noazaki; Soichi Ogishima; Naganari Ohkura; Hiroko Ohimiya; Hiroshi Ohno; Mitsuhiro Ohshima; Mariko Okada-Hatakeyama; Yasushi Okazaki; Valerio Orlando; Dmitry A Ovchinnikov; Arnab Pain; Robert Passier; Margaret Patrikakis; Helena Persson; Silvano Piazza; James G D Prendergast; Owen J L Rackham; Jordan A Ramilowski; Mamoon Rashid; Timothy Ravasi; Patrizia Rizzu; Marco Roncador; Sugata Roy; Morten B Rye; Eri Saijyo; Antti Sajantila; Akiko Saka; Shimon Sakaguchi; Mizuho Sakai; Hiroki Sato; Suzana Savvi; Alka Saxena; Claudio Schneider; Erik A Schultes; Gundula G Schulze-Tanzil; Anita Schwegmann; Thierry Sengstag; Guojun Sheng; Hisashi Shimoji; Yishai Shimoni; Jay W Shin; Christophe Simon; Daisuke Sugiyama; Takaai Sugiyama; Masanori Suzuki; Naoko Suzuki; Rolf K Swoboda; Peter A C 't Hoen; Michihira Tagami; Naoko Takahashi; Jun Takai; Hiroshi Tanaka; Hideki Tatsukawa; Zuotian Tatum; Mark Thompson; Hiroo Toyodo; Tetsuro Toyoda; Elvind Valen; Marc van de Wetering; Linda M van den Berg; Roberto Verado; Dipti Vijayan; Ilya E Vorontsov; Wyeth W Wasserman; Shoko Watanabe; Christine A Wells; Louise N Winteringham; Ernst Wolvetang; Emily J Wood; Yoko Yamaguchi; Masayuki Yamamoto; Misako Yoneda; Yohei Yonekura; Shigehiro Yoshida; Susan E Zabierowski; Peter G Zhang; Xiaobei Zhao; Silvia Zucchelli; Kim M Summers; Harukazu Suzuki; Carsten O Daub; Jun Kawai; Peter Heutink; Winston Hide; Tom C Freeman; Boris Lenhard; Vladimir B Bajic; Martin S Taylor; Vsevolod J Makeev; Albin Sandelin; David A Hume; Piero Carninci; Yoshihide Hayashizaki
Journal: Nature Date: 2014-03-27 Impact factor: 49.962

7. On the immortality of television sets: "function" in the human genome according to the evolution-free gospel of ENCODE.

Authors: Dan Graur; Yichen Zheng; Nicholas Price; Ricardo B R Azevedo; Rebecca A Zufall; Eran Elhaik
Journal: Genome Biol Evol Date: 2013 Impact factor: 3.416

8. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.

Authors: Bo Li; Colin N Dewey
Journal: BMC Bioinformatics Date: 2011-08-04 Impact factor: 3.307

9. Insights into the biogenesis and potential functions of exonic circular RNA.

Authors: Chikako Ragan; Gregory J Goodall; Nikolay E Shirokikh; Thomas Preiss
Journal: Sci Rep Date: 2019-02-14 Impact factor: 4.379

Review 10. The emerging landscape of circular RNA in life processes.

Authors: Shibin Qu; Yue Zhong; Runze Shang; Xuan Zhang; Wenjie Song; Jørgen Kjems; Haimin Li
Journal: RNA Biol Date: 2016-08-11 Impact factor: 4.652

12 in total

Review 1. The emerging roles of circRNAs in cancer and oncology.

Authors: Lasse S Kristensen; Theresa Jakobsen; Henrik Hager; Jørgen Kjems
Journal: Nat Rev Clin Oncol Date: 2021-12-15 Impact factor: 66.675

Review 2. Gene product diversity: adaptive or not?

Authors: Jianzhi Zhang; Chuan Xu
Journal: Trends Genet Date: 2022-05-28 Impact factor: 11.821

3. Exploring the cellular landscape of circular RNAs using full-length single-cell RNA sequencing.

Authors: Wanying Wu; Jinyang Zhang; Xiaofei Cao; Zhengyi Cai; Fangqing Zhao
Journal: Nat Commun Date: 2022-06-10 Impact factor: 17.694

Review 4. Accentuating CircRNA-miRNA-Transcription Factors Axis: A Conundrum in Cancer Research.

Authors: Deepti Singh; Prashant Kesharwani; Nabil A Alhakamy; Hifzur R Siddique
Journal: Front Pharmacol Date: 2022-01-11 Impact factor: 5.810

5. CSCD2: an integrated interactional database of cancer-specific circular RNAs.

Authors: Jing Feng; Wenbo Chen; Xin Dong; Jun Wang; Xiangfei Mei; Jin Deng; Siqi Yang; Chenjian Zhuo; Xiaoyu Huang; Lin Shao; Rongyu Zhang; Jing Guo; Ronghui Ma; Juan Liu; Feng Li; Ying Wu; Leng Han; Chunjiang He
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

Review 6. Non-Darwinian Molecular Biology.

Authors: Alexander F Palazzo; Nevraj S Kejiou
Journal: Front Genet Date: 2022-02-16 Impact factor: 4.599

Review 7. Role of Circular RNAs in the Regulation of Immune Cells in Response to Cancer Therapies.

Authors: Ángeles Carlos-Reyes; Susana Romero-Garcia; Estefania Contreras-Sanzón; Víctor Ruiz; Heriberto Prado-Garcia
Journal: Front Genet Date: 2022-02-02 Impact factor: 4.599

8. FMRP ligand circZNF609 destabilizes RAC1 mRNA to reduce metastasis in acral melanoma and cutaneous melanoma.

Authors: Qingfeng Shang; Haizhen Du; Xiaowen Wu; Qian Guo; Fenghao Zhang; Ziqi Gong; Tao Jiao; Jun Guo; Yan Kong
Journal: J Exp Clin Cancer Res Date: 2022-05-10

9. Context-specific effects of sequence elements on subcellular localization of linear and circular RNAs.

Authors: Maya Ron; Igor Ulitsky
Journal: Nat Commun Date: 2022-05-05 Impact factor: 17.694

Review 10. Non-coding RNAs and epithelial mesenchymal transition in cancer: molecular mechanisms and clinical implications.

Authors: Hashem Khanbabaei; Saeedeh Ebrahimi; Juan Luis García-Rodríguez; Zahra Ghasemi; Hossein Pourghadamyari; Milad Mohammadi; Lasse Sommer Kristensen
Journal: J Exp Clin Cancer Res Date: 2022-09-16