Literature DB >> 23468607

The vast, conserved mammalian lincRNome.

David Managadze¹, Alexander E Lobkovsky, Yuri I Wolf, Svetlana A Shabalina, Igor B Rogozin, Eugene V Koonin.

Abstract

We compare the sets of experimentally validated long intergenic non-coding (linc)RNAs from human and mouse and apply a maximum likelihood approach to estimate the total number of lincRNA genes as well as the size of the conserved part of the lincRNome. Under the assumption that the sets of experimentally validated lincRNAs are random samples of the lincRNomes of the corresponding species, we estimate the total lincRNome size at approximately 40,000 to 50,000 species, at least twice the number of protein-coding genes. We further estimate that the fraction of the human and mouse euchromatic genomes encoding lincRNAs is more than twofold greater than the fraction of protein-coding sequences. Although the sequences of most lincRNAs are much less strongly conserved than protein sequences, the extent of orthology between the lincRNomes is unexpectedly high, with 60 to 70% of the lincRNA genes shared between human and mouse. The orthologous mammalian lincRNAs can be predicted to perform equivalent functions; accordingly, it appears likely that thousands of evolutionarily conserved functional roles of lincRNAs remain to be characterized.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
RNA, Long Noncoding

Year: 2013 PMID： 23468607 PMCID： PMC3585383 DOI： 10.1371/journal.pcbi.1002917

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

The great majority of mammalian genome sequences are transcribed, at least occasionally, a phenomenon known as pervasive transcription [1]–[4]. More specifically, tiling array analyses of several human chromosomes have shown that over 90% of the bases are transcribed in at least one cell type [1], [5]–[8]. The analogous analysis in mouse has demonstrated transcription for over 60% of the genome [9]–[11]. Among the transcripts there are numerous long intergenic non-coding RNA (lincRNA), i.e. RNA molecules greater than 200 nucleotides in length that are encoded outside other identified genes. Some of the lincRNAs have been shown to perform various regulatory roles but the majority remain functionally uncharacterized [7], [12]–[17]. Furthermore, the fraction of the genome allotted to lincRNAs remains unknown. A popular view that the vast majority of lincRNAs are by-products of background transcription, “simply the noise emitted by a busy machine” [18], [19], is rooted in their typically low abundance and poor evolutionary conservation compared to protein-coding sequences and small RNAs such as miRNAs and snoRNAs [20]. However, some of the lincRNAs do contain strongly conserved regions [21], and most lincRNAs show reduced substitution and insertion/deletion rates suggestive of purifying selection [12], [22], [23]. Given the general lack of strong sequence conservation, identification of lincRNAs on genome scale relies on expression analysis which makes comprehensive characterization of the mammalian lincRNome an elusive goal. The combination of different experimental approaches applied to transcriptomes of several species has resulted in continuous discovery of new transcripts [24], with the FANTOM project alone cataloguing more than 30,000 putative long non-coding transcripts in mouse tissues by full-length cDNA cloning [11], [25]. The Support Vector Machine method has been applied to classify transcripts from the FANTOM3 project into coding and non-coding ones and accordingly estimate the number of long non-coding RNA in mouse. This analysis has led to the identification of 14,000 long non-coding RNAs and an estimate of the total number of such RNAs in the FANTOM3 data at approximately 28,000 [26]. Here we re-analyze the most reliable available sets of human and mouse lincRNAs using the latest next generation sequencing (RNAseq) data and apply a maximum likelihood approach to obtain a robust estimate of the size of the mammalian lincRNome. The results suggest that mammalian genomes are likely to encode at least twice as many lincRNAs as proteins.

Results

Estimation of the sizes of human and mouse lincRNomes

We performed comparative analysis of the recently reported validated sets of 4662 human lincRNAs [27] and 4156 mouse lincRNAs [12], [20], [23] (see Methods for details) in an attempt to produce robust estimates of the human and mouse lincRNome sizes, and to measure the turnover of lincRNA genes in mammalian evolution. The validated sets consist of lincRNA species for which a specific profile of expression across tissues – and hence distinct functionality – are supported by multiple lines of evidence. Assuming that these sets of lincRNAs are random samples from human and mouse lincRNomes, comparison of the validated sets should yield robust estimates of the lincRNome size for each species. For this analysis, we deliberately chose to employ the validated sets only rather than the available larger sets of reported putative lincRNAs in order to reduce the effect of transcriptional noise and other artifacts. A substantial fraction of the vast mammalian transcriptome, most likely the lower expressed transcripts, is expected to be non-functional. Therefore, to minimize the contribution of transcriptional noise, cut-off values were imposed on expression levels of lincRNA genes and their putative orthologs that were used for the lincRNome size estimation. Similarly, a series of cut-off values was applied for the fraction of indels in pairwise genomic alignments (see Methods for details). A computational pipeline was developed to compare the sets of validated lincRNAs from human and mouse and to identify expressed orthologs by mapping the sequences to the respective counterpart genome and searching the available RNAseq data [28] (Figure 1). We then applied a maximum likelihood (ML) technique to estimate the total number of lincRNA genes in the human and mouse genomes as well as the number of orthologous lincRNA genes (see Online Methods). The following simplifying assumptions were made:

Figure 1

Computational pipeline to characterize the lincRNome.

The subset of orthologous lincRNAs (Kb) was obtained by comparing genomic positions of mouse and human lincRNA genes (minimal overlap 100 nucleotides), with further manual inspection of the genomic alignments. This comparison yielded 196 pairs of unique orthologous pairs of human and mouse lincRNA genes (Kb). Of the 4662 human lincRNAs (Lh), corresponding alignable regions in mouse were detected for 3529. These sequences were designated putative orthologs and checked for evidence of expression using RNAseq data for mouse tissues. Of the 3369 putative lincRNAs, for which the exon models could be determined, 2872 showed expression level greater than zero (Kh). Similarly, the subset of mouse lincRNAs with expressed putative orthologs (Km) was identified by searching for evidence of expression in human tissues. Of the 4156 mouse lincRNAs (Lm), for 3157 corresponding alignable regions with expression level greater than zero were identified in mouse. After applying ORF (<120 nucleotides), indel and expression thresholds (see Methods for details), final results (Figure 2 and Table 1) were obtained using a Maximum Likelihood Model (see Methods for details) and Lm, Lh, Km, Kh, Kb as input parameters (shown by dashed arrows) to estimate the size of the human lincRNome (Nh), the mouse lincRNome (Nm) and the orthologous subset of the two lincRNomes (Nb). For details of the procedures see Methods.

Computational pipeline to characterize the lincRNome.

Figure 2

The human and mouse lincRNomes.

Table 1

RPKM-based estimates of the numbers of all and orthologous lincRNAs.

	ORF thresholds
	<90 nt	<120 nt	<150 nt	None
Lh	2,961	3,603	3,989	4,662
Lm	2,806	3,332	3,644	4,156
Kh	1,655	2,030	2,249	2,641
Km	1,947	2,308	2,531	2,888
Kb	130	155	170	196
Nh	44,346±7308	53,649±8099	59,389±8563	68,693±9224
Nm	35,722±5817	43,638±6516	48,207±6876	55,999±7444
Nb	24,786±4007	30,227±4481	33,483±4741	38,914±5135
%conservation(human/mouse)	56/69	56/69	56/69	57/69

Two expression and four indel thresholds were applied to putative orthologous lincRNA genes (Kh and Km).

A lincRNA sequence in one species has at most one ortholog in the other species (that is lineage-specific duplications are disregarded). The sets of experimentally validated lincRNAs are random samples from complete sets of lincRNAs (lincRNomes) for the corresponding species. The experimentally validated lincRNA sets for human and mouse are uncorrelated with each other. Let Lh and Lm be the sizes of the experimentally validated sets of lincRNAs for human and mouse, respectively. Also let Kh be the number of confirmed human lincRNAs that have an expressed orthologous sequence in mouse and Km be the corresponding number of mouse lincRNAs. Finally, Kb is the number of confirmed, expressed human lincRNAs whose orthologs in mouse are also confirmed lincRNAs. If the orthology relations between the human and mouse lincRNAs are strictly one-to-one, the number of confirmed mouse lincRNAs for which the human ortholog is also a confirmed lincRNA should be Kb as well. This is indeed the case in practice, with a few exceptions. Given assumption (1), the lincRNAs can be partitioned into three pools: i) those present in both species, pool size Nb, ii) unique to human, Nh-Nb, and ii) unique to mouse, Nm-Nb; here Nh and Nm are the total sizes of the complete human and mouse lincRNomes, respectively. Assumption (2) allows us to compute the probability of observing a particular set of Kh, Km and Kb simply by counting the number of possible samples of Lh and Lm lincRNAs drawn at random from the respective pools of Nh and Nm that result in the given set of Kh, Km and Kb values:Maximizing the probability P in Eq. (1) with respect to Nh, Nm and Nb, we obtain (see Methods for details): To assess the robustness of the estimates, ranges of open reading frame size thresholds used to eliminate putative protein-coding genes and RPKM (reads per kilobase of exon model per million mapped reads [29]) thresholds used to gauge the expression level were employed (Tables 1 and 2). The ML estimates converged at approximately 50,000 lincRNAs encoded in the human genome and approximately 40,000 lincRNAs encoded in the mouse genome (Table 1 and Figure 2). These are conservative estimates given the use of strict thresholds on predicted open reading frame size and expression level (Table 1), so the actual numbers of lincRNAs are expected to be even greater.

Table 2

Estimates of the numbers of all and orthologous lincRNAs with varying expression thresholdsa.

	Expression thresholds (RPKM)
	90%	80%	70%	60%	50%	40%	30%	20%	10%
Lh	3141	2792	2443	2094	1745	1396	1047	698	349
Lm	2530	2249	1967	1686	1405	1124	844	562	281
Kh	1819	1638	1433	1242	1042	842	640	427	217
Km	1838	1653	1451	1242	1050	847	646	437	226
Kb	145	139	130	117	96	80	72	49	25
Nh	42,458	39,814	33,202	27,267	22,228	19,085	14,780	9,393	6,225
Nm	33,719	31,738	26,502	21,682	17,897	15,250	11,830	7,502	4,897
Nb	24,424	23,057	19,479	15,994	13,184	11,396	8,914	5,742	3,808
%	58/72	58/73	59/74	59/74	59/74	60/75	60/75	61/77	61/78

Indel threshold: 95%, ORF threshold: 120 nt (see Methods). Expression thresholds were applied to lincRNA genes (Lh, Lm, and Kb) and putative orthologous lincRNA genes (Kh and Km).

% stands for conservation percentage as in Table 1.

The human and mouse lincRNomes.

The figure shows the estimated numbers of lincRNA genes in human and mouse, and the estimate for the size of the set of orthologous lincRNAs. The circles show the estimated sizes of the human and mouse lincRNomes (Nh and Nm, respectively), and the overlap shows the estimated number of orthologous lincRNAs (Nb). For each of these values, the analytically determined 95% confidence intervals are indicated; bootstrap analysis yielded more narrow confidence intervals (see Methods for details). The small, filled circles (blue and green for human and mouse, respectively) show the validated sets of lincRNAs (Lh and Lm, respectively), and the overlap area between these circles shows orthologous expressed, validated lincRNAs (Kb). Two expression and four indel thresholds were applied to putative orthologous lincRNA genes (Kh and Km). Indel threshold: 95%, ORF threshold: 120 nt (see Methods). Expression thresholds were applied to lincRNA genes (Lh, Lm, and Kb) and putative orthologous lincRNA genes (Kh and Km). % stands for conservation percentage as in Table 1. Approximately two-thirds of the lincRNA genes were estimated to share orthologous relationships (Figure 2 and Table 1). The subsets of lincRNAs with the increasing expression levels were found to be smaller and slightly but consistently more conserved (Table 2), a result that is compatible with our previous observation of positive correlation between sequence conservation and expression level among lincRNAs [23]. We next used the length distributions of human and mouse lincRNAs in the validated sets to estimate the total lengths of the lincRNomes and the fraction of the genome occupied by the lincRNA-encoding sequences, once again under the assumption that the validated sets are representative of the entire lincRNomes. Strikingly, the fraction of the human and mouse euchromatic genome sequence dedicated to encoding lincRNAs was found to be more than twofold greater than the fraction allotted to protein-coding sequences and greater even than the total fraction encoding mRNAs (including untranslated regions) (Table 3).

Table 3

The fractions of the human and mouse genomes allotted to protein-coding and lincRNA-coding sequencesa.

	Human	Mouse
Protein-coding genes, number	19,042	20,210
Protein-coding sequences, total length (Mb)/% genome	32.6/1.23	33.5/1.27
UTRs, total length (Mb)/% genome	21/0.79	19/0.72
lincRNAs, validated set, number	4,662	4,156
lincRNAs, validated set, total length (Mb)	6.2b	7.8
lincRNAs, genomic estimate, number	53,649	43,638
lincRNAs, genomic estimate, total length (Mb)/%genome	72.5/2.7	81.9/3.1

The data on protein-coding genes and the total size of the euchromatic genomes are from [31], [63].

The total length of the human lincRNome is likely to be an underestimate caused by the use of RNAseq data to calculate the lengths of lincRNAs in the human validated set [27].

The data on protein-coding genes and the total size of the euchromatic genomes are from [31], [63]. The total length of the human lincRNome is likely to be an underestimate caused by the use of RNAseq data to calculate the lengths of lincRNAs in the human validated set [27].

Discussion

The relatively poor sequence conservation and often low expression of lincRNAs hamper robust estimation of the size of the lincRNome from expression data alone and render comparative-genomic estimation an essential complementary approach. Strikingly, the estimates obtained here by combining comparative genomic and expression analysis suggest that the mammalian lincRNome is at least twice the size of the proteome [30], [31]. Given that intron-encoded long-non-coding RNAs and non-coding RNAs encoded in complementary strands of protein-coding genes (long antisense RNAs) [32] are disregarded in these estimates, the total set of lncRNAs and the fraction of the genome dedicated to the lincRNA genes are likely to exceed the respective values for protein-coding genes several-fold. In order to assess the reliability and robustness of the model with respect to parameters, we produced series of estimates of the total size of the human and mouse lincRNomes and their conserved subset with varying thresholds on expression level, extent of sequence similarity and the maximum allowed open reading frame size. Nevertheless, it is impossible to rule out some sources of bias that might have affected the estimates. For example, some orthologous lincRNA genes might remain undetected because they were not included in the UCSC genome alignments due to high divergence or synteny breaks in (for example, inversions or translocations). Such under-detection of orthologs could cause an underestimate of evolutionary conserved lincRNA genes although it has been reported that the of breakpoints is not large (<250) for the human/mouse genomic comparison [33], so this type of bias is likely to be negligible. Another, potentially more serious source of bias could be a correlation between two lists of lincRNA genes which again would result in biased estimates of evolutionary conserved lincRNA genes. However, because the human and mouse lincRNA sets were obtained using quite different approaches [12], [20], [23], [27], there is no reason to expect that any strong correlation between the two lists would be caused by the employed experimental and/or computational procedures. An under-estimate of the number of orthologous lincRNAs as well as the total size of the mouse lincRNome also might be caused by smaller RNAseq dataset for mouse (10 tissue/cell types, see Methods for details) compared to human (16 tissue/cell types). This difference could explain the systematically smaller predicted numbers of mouse lincRNA genes (Tables 1 and 2). More generally, given that expression of a large fraction of lincRNAs appears to be tissue-specific, the availability of sufficient data for relatively small numbers of tissue/cell types could cause substantial underestimate of the size of both lincRNomes and their conserved fraction. Thus, the estimates obtained here should be regarded as highly conservative, essentially low bounds the lincRNome size and the set of orthologous lincRNA genes. Some of the transcripts identified as lincRNAs potentially might represent fragments generated from long (alternative) 5′UTRs or 3′UTRs of protein-coding genes. Such transcripts could results from utilization of alternative poly(A) addition signals and/or could represent alternative splice forms separated by long introns [3], [18], [19], [34]. If many purported lincRNAs actually are fragments of protein-coding genes, one would expect a strong correlation to exist between the expression of lincRNAs and neighboring protein-coding genes. Cabili and co-workers analyzed this correlation for the set of validated human lincRNA genes [27]. Their analysis focused on those protein-coding genes that had a lincRNA neighbor on one side and a coding neighbor on the other side, and used a paired test to compare the correlation between each protein-coding gene and its lincRNA neighbor with that between the same protein-coding gene and its protein-coding gene neighbor. This comparison showed a weak opposite trend, namely that expression of pairs of coding gene neighbors was, on average, slightly but significantly more strongly correlated than the expression of neighboring lincRNA/protein-coding gene pairs. The results of this analysis appear to be best compatible with the hypothesis that any co-expression between lincRNAs and their protein-coding neighbors results from proximal transcriptional activity in the surrounding open chromatin [35]. These findings effectively rule out the possibility that the majority of lincRNAs are fragments of neighboring protein-coding genes although there are anecdotal observations that 3′UTR-derived RNAs can function not only in cis to regulate protein expression but also intrinsically and independently in trans, likely as non-coding RNAs [36]. The possibility that some lincRNA genes encode short peptides that are translated, perhaps in a tissue-specific manner, is the subject of an ongoing debate [13], [37]–[40]. It is extremely hard to rule out such a role for a fraction of purported lincRNAs as becomes obvious from the long-standing attempts to investigate potential functions of the thousands upstream open reading frames (uORFs) that are present in 5′UTR of protein-coding genes in eukaryotes [41]–[44]. Although some of the uORFs are translated, the functions of the produced peptides if any remain unclear [45]. Even application of modern high-throughput techniques in simple eukaryotic model systems so far have failed to clarify this issue. For example, analysis of 1048 uORFs in yeast genes has supported translation of 153 uORFs [46]. Furthermore, numerous uORF translation start sites were found at non-AUG codons, the frequency of these events was even higher than for uAUG codons even though the frequency of non-AUG starting codons is extremely low for protein-coding genes [46]. Another intriguing recent discovery is the potential presence, in the yeast genome, of hundreds of transiently expressed ‘proto-genes’ that are suspected to reflect the process of de novo gene birth [40]. However, the functionality of these peptides remains an open question. Establishing functionality of short ORFs in mammalian genomes is an even more difficult task. For example, analysis of translation in mouse embryonic stem cells revealed thousands of currently unannotated translation products. These include amino-terminal extensions and truncations and uORFs with regulatory potential, initiated at both AUG and non-AUG codons, whose translation changes after differentiation [47]. However, contrary to these emerging indications of abundant production of short peptides, a recent genome-wide study has reported very limited translation of lincRNAs in two human cell lines [48]. In general, at present it appears virtually impossible to annotate an RNA unequivocally as protein-coding or noncoding, with overlapping protein-coding and noncoding transcripts further confounding the issue. Indeed, it has been suggested that because some transcripts can function both intrinsically at the RNA level and to encode proteins, the very dichotomy between mRNAs and ncRNAs is false [38]. Taking all these problems into account, here we adopted a simple, conservative approach by excluding from the analysis lincRNAs containing relatively long ORFs, under a series of ORF length thresholds. However, it should be noted that human and mouse lincRNAs used in this study had been previously filtered for the presence of evolutionary conserved ORFs and the presence of protein domains, and the most questionable transcripts were removed at this stage [12], [20], [23], [27]. For example, 2305 human transcripts were excluded from the stringent human lincRNA set [27] under the coding potential criteria (the presence of a Pfam domain, a positive PhyloCSF score, or previously annotated as pseudogenes). The majority of these discarded transcripts (1533) were previously annotated as pseudogenes [27]. Similar to the stringent set of lincRNAs, these transcripts are expressed at lower and more tissue-specific patterns than bona fide protein-coding genes, suggesting that these effectively are non-coding transcripts. Nevertheless, Cabili and co-workers employed a conservative approach and excluded them from the stringent lincRNA set [27]. Questions about functional roles of lincRNAs and the fraction of the lincRNAs that are functional loom large. For a long time, the prevailing view appeared to be that, apart from a few molecular fossils such as rRNA, tRNA and snRNAs, RNAs did not play an important role in extant cells. More recently, the opposite position has become popular, namely that (almost) every detectable RNA molecule is functional. It has been repeatedly pointed out that this view is likely to be too extreme [49], [50]. Although it has been shown that many lincRNA genes are evolutionarily conserved and perform various functions [7], [12]–[17], an unknown fraction of lincRNAs should be expected to result from functionally irrelevant background transcription [19]. In the present work, phylogenetic conservation is the principal support of functional relevance of lincRNAs. Given that neutrally evolving sequences in human and mouse genomes are effectively saturated with mutations and show no significant sequence conservation [51]–[53], expression of non-coding RNAs at orthologous genomic regions in human and mouse should be construed as strong evidence of functionality. It should be noted, however, that sequence conservation gives the low bound for the number of functional lincRNAs, and the lack of conservation is not a reliable indication of lack of function. First, the possibility exists that orthologous genes diverge to the point of being undetectable by sequence comparison, e.g. because short conserved, functionally important stretches are interspersed with longer non-conserved regions, as is the case in Xist, H19, and similar lincRNAs [54], [55] [20]. The results of this work predict that, despite the fact that on average sequence conservation between orthologous lincRNAs is much lower than the conservation between protein-coding genes [12], [23], 60 to 70% of the lincRNAs appear to share orthologous relationship between human and mouse, which is only slightly lower than the fraction of protein-coding genes with orthologs, approximately 80% [51]. These findings imply that, even if many of the species-specific lincRNAs are non-functional, mammalian lincRNAs perform thousands of evolutionarily conserved functional roles most of which remain to be identified.

Methods

The human and mouse validated lincRNA sets

As the human lincRNA data set, the ‘stringent set’ of 4662 lincRNAs, which is a subset of the over 8000 human lincRNAs described in a recent comprehensive study [27], was used. The validated set of mouse lincRNA genes was constructed by merging our previously published set of 2390 lincRNA transcripts with the set of 3051 transcripts produced by Ponting and coworkers [12]. After the merge, a unique list of 4989 GenBank transcript IDs was generated, coordinates of the newest mouse assembly, mm9, were downloaded in BED format from the UCSC Table Browser [56], and entries shorter than 200 nt were discarded. Overlapping chromosomal coordinates were merged using the mergeBed utility from BEDtools package [57], with the command line option -s (“force strandedness”, i.e. merge overlapping features only if they are on the same strand), and unique IDs were assigned to the resulting 4156 mouse lincRNA clusters. (format: mlclust_N where mlclust stands for mouse lincRNA cluster, and N is a unique integer number; see Supporting Table S1).

Expression of lincRNAs

Expression of the lincRNAs was assessed by analysis of the available RNAseq data. For human, the run files of the Illumina Human Body Map 2.0 project for adipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph node, ovary, prostate, skeletal muscle, testis, thyroid, white blood cells, were downloaded from The NCBI Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra; Study ERP000546; runs ERR030888 to ERR030903). For mouse, RNAseq data of the ENCODE project [58] for tissues: bone marrow, cerebellum, cortex, ES-Bruce4, heart, kidney, liver, lung, mouse embryonic fibroblast cells (MEF) and spleen, were downloaded from the UCSC Table Browser [56] FTP site (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeLicrRNAseq/). Pre-built Bowtie indices of human and mouse, based on UCSC hg19 and mm9, were downloaded from Bowtie FTP site (ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/hg19.ebwt.zip and ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/mm9.ebwt.zip, respectively). The reads were aligned with the cognate genomic sequences using TopHat [59]. The TopHat-generated alignments were analyzed using an ad hoc Python script that accepts alignments and genomic coordinates in SAM and BED formats, respectively, and uses the HTSeq Python package (http://www-huber.embl.de/users/anders/HTSeq) to calculate the number of aligned reads (“counts”). The RPKM (i.e. reads per kilobase of exon model per million mapped reads [29]) values were calculated from the counts values. Because we were interested to determine whether particular regions are expressed in any of the analyzed tissues, the maximum value among all tissues was assigned as the expression level of lincRNA genes and putative orthologous lincRNA genes.

Identification of open reading frames (ORFs)

An ORF was defined as a continuous stretch of codons starting from the ATG codon or beginning of the cDNA (to take into account potentially truncated cDNAs) and ending with a stop codon. The ORFs were identified by using the ATG_EVALUATOR program [60] combined with the ORF predictor from the GeneBuilder package [61] with relaxed parameters (the program was required to correctly predict 95% of the human and mouse cDNA training sets [61]). Control experiments with independent human and mouse cDNA data sets [61] showed a 94–98% true positive rate depending on the ORF length threshold (90, 120 or 150 nucleotides). However, a high rate of false positives is expected for such relaxed parameters. Analysis of human and mouse introns and UTRs data sets showed false positives rates of 10–20% depending on the threshold [60], [61]. For the purpose of the present analysis, false positives in ORF identification represent random removal of lincRNA sequences from the samples resulting in conservative estimates of the total lincRNA number. Thus, we used the ORF cut-off values of 90, 120 or 150 nucleotides to remove putative mRNAs for short proteins separately from the human and mouse sets of lincRNAs.

Comparative genomic analysis of the lincRNA sets

To obtain the subset of human lincRNAs with expressed orthologs in mouse (Kh), human lincRNA gene coordinates of assembly hg19 were converted to mouse mm9 using the liftOver tool of the UCSC Genome Browser [62]. Out of the 4662 human lincRNAs (Lh), for 3529 putative orthologous regions were identified in the mouse genome. These sequences were checked for the evidence of expression in mouse tissues using the RNAseq data. Exon coordinates of putative lincRNAs were obtained by mapping their coordinates onto exons of all known genes of mm9 assembly of UCSC Genome Browser. The sums of exons were then used in expression level calculation to normalize for sequence length. Out of the 3369 putative lincRNAs for which the exon models could be determined, 2872 had expression level greater than zero. Similarly, the subset of mouse lincRNAs with expressed putative orthologs in human (Km) was found by converting the coordinates of initial 4156 mouse lincRNAs (Lm) from mm9 to hg19 and searching for the evidence of expression in human tissues. The exon models could be determined for 3656 of the 3677 putative lincRNAs, out of which 3157 had expression level greater than zero. The subset of orthologous lincRNAs (Kb) was obtained by selecting those lincRNAs whose putative orthologs in another species overlap with the validated lincRNAs of that species. That is, we searched for the overlap of putative orthologs of human lincRNAs (in hg19 coordinates) with the mouse lincRNAs (in mm9 coordinates, minimal overlap 100 nucleotides). The overlap was determined using intersectBed from BEDtools package with the command line option -s (“force strandedness”). This resulted in 196 pairs of unique human and mouse lincRNAs. Approximate indel values were estimated from the sequence length differences between the lincRNAs and their orthologs, i.e. the following formula was used:where LllincRNA is the total length of lincRNA exons, and Lortholog is the total length of the exons of lincRNA ortholog. Manual examination of orthologous lincRNA alignments and putative orthologs suggested that approximately 5% of the alignments with the largest INDEL values were unreliable. Thus, all lincRNA alignments with INDEL >95% were removed from further analysis. Similarly, a cut-off was imposed on expression level of putative human and mouse orthologs of lincRNA. This cut-off was set at the lowest 5% of the expression levels of the 196 orthologous validated lincRNA genes (Supporting Table S1). All putative orthologs of lincRNA genes with lower expression values were discarded under the premise that these low values could represent experimental noise, i.e. the top 95% of the expression values EXP95% was used for all analyses (Table 1 and Supporting Table S1). In addition, EXP90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10% were calculated to compare subsets of lincRNAs expressed at different levels (Table 2). We also used different sets of expression/indel filters combined with the 5 input parameters (see Results) in different experiments (Tables 1 and 2); no substantial differences between results were found (see Discussion for details). For calculating the 5 input parameters (see Results), all the collected information was stored in an SQLite database, and after applying ORF, indel and expression thresholds, final data sets were assembled (Tables 1, 2 and Supporting Table S1).

Maximum likelihood estimates

Using the experimentally validated sets of human and mouse lincRNAs and the assumptions described in the main text the probability of observing a particular set of Kh, Km and Kb for the given values of Lh and Lm is given by equation (1) in the main text. Using the Sterling's approximation for the factorial, we obtain the system of nonlinear equations for the sizes Nh and Nm of the pools and their overlap Nb that maximize the likelihood P in Eq. (1) Solving the system (3–5) for Nh, Nm and Nb we obtain Equation (2) (see main text). The confidence region around the maximum likelihood estimate Eq. (5) is an ellipsoid in the {Nh,Nm,Nb} space. The directions of its axes are given by the eigenvectors of the Jacobian matrix J of second derivatives of log P and the magnitudes of the ellipsoid's axes are given by the inverse square roots of the negatives of the eigenvalues. Computing the second derivatives of log P and evaluating them at the maximum likelihood point, we obtain We found that the confidence ellipsoid is highly elongated, and therefore the estimates for the pool sizes are strongly correlated with each other. The analytically estimated 95% confidence intervals are shown in Table 1. In addition, a bootstrap analysis of the lincRNA numbers was performed. For this purpose, the initial sets of human and mouse lincRNAs were randomly resampled 1000 times and the calculation of the final numbers was performed using 95% indel and expression (RPKM) levels, and all ORF thresholds. The results of bootstrap analysis are given in the Supporting Table S1. The 95% confidence intervals estimated using the boostrapping procedure (Supporting Table S1) were smaller than the analytically obtained 95% confidence intervals (Table 1), thus we used the latter as conservative estimates of the 95% confidence intervals. Comprehensive information on the human and mouse lincRNA sets. (XLS) Click here for additional data file.

63 in total

Review 1. Non-coding RNAs: hope or hype?

Authors: Alexander Hüttenhofer; Peter Schattner; Norbert Polacek
Journal: Trends Genet Date: 2005-05 Impact factor: 11.639

2. The eukaryotic genome as an RNA machine.

Authors: Paulo P Amaral; Marcel E Dinger; Tim R Mercer; John S Mattick
Journal: Science Date: 2008-03-28 Impact factor: 47.728

3. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

4. Dark matter transcripts: sound and fury, signifying nothing?

Authors: Richard Robinson
Journal: PLoS Biol Date: 2010-05-18 Impact factor: 8.029

5. Most "dark matter" transcripts are associated with known genes.

Authors: Harm van Bakel; Corey Nislow; Benjamin J Blencowe; Timothy R Hughes
Journal: PLoS Biol Date: 2010-05-18 Impact factor: 8.029

Review 6. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

7. The transcriptional landscape of the mammalian genome.

Authors: P Carninci; T Kasukawa; S Katayama; J Gough; M C Frith; N Maeda; R Oyama; T Ravasi; B Lenhard; C Wells; R Kodzius; K Shimokawa; V B Bajic; S E Brenner; S Batalov; A R R Forrest; M Zavolan; M J Davis; L G Wilming; V Aidinis; J E Allen; A Ambesi-Impiombato; R Apweiler; R N Aturaliya; T L Bailey; M Bansal; L Baxter; K W Beisel; T Bersano; H Bono; A M Chalk; K P Chiu; V Choudhary; A Christoffels; D R Clutterbuck; M L Crowe; E Dalla; B P Dalrymple; B de Bono; G Della Gatta; D di Bernardo; T Down; P Engstrom; M Fagiolini; G Faulkner; C F Fletcher; T Fukushima; M Furuno; S Futaki; M Gariboldi; P Georgii-Hemming; T R Gingeras; T Gojobori; R E Green; S Gustincich; M Harbers; Y Hayashi; T K Hensch; N Hirokawa; D Hill; L Huminiecki; M Iacono; K Ikeo; A Iwama; T Ishikawa; M Jakt; A Kanapin; M Katoh; Y Kawasawa; J Kelso; H Kitamura; H Kitano; G Kollias; S P T Krishnan; A Kruger; S K Kummerfeld; I V Kurochkin; L F Lareau; D Lazarevic; L Lipovich; J Liu; S Liuni; S McWilliam; M Madan Babu; M Madera; L Marchionni; H Matsuda; S Matsuzawa; H Miki; F Mignone; S Miyake; K Morris; S Mottagui-Tabar; N Mulder; N Nakano; H Nakauchi; P Ng; R Nilsson; S Nishiguchi; S Nishikawa; F Nori; O Ohara; Y Okazaki; V Orlando; K C Pang; W J Pavan; G Pavesi; G Pesole; N Petrovsky; S Piazza; J Reed; J F Reid; B Z Ring; M Ringwald; B Rost; Y Ruan; S L Salzberg; A Sandelin; C Schneider; C Schönbach; K Sekiguchi; C A M Semple; S Seno; L Sessa; Y Sheng; Y Shibata; H Shimada; K Shimada; D Silva; B Sinclair; S Sperling; E Stupka; K Sugiura; R Sultana; Y Takenaka; K Taki; K Tammoja; S L Tan; S Tang; M S Taylor; J Tegner; S A Teichmann; H R Ueda; E van Nimwegen; R Verardo; C L Wei; K Yagi; H Yamanishi; E Zabarovsky; S Zhu; A Zimmer; W Hide; C Bult; S M Grimmond; R D Teasdale; E T Liu; V Brusic; J Quackenbush; C Wahlestedt; J S Mattick; D A Hume; C Kai; D Sasaki; Y Tomaru; S Fukuda; M Kanamori-Katayama; M Suzuki; J Aoki; T Arakawa; J Iida; K Imamura; M Itoh; T Kato; H Kawaji; N Kawagashira; T Kawashima; M Kojima; S Kondo; H Konno; K Nakano; N Ninomiya; T Nishio; M Okada; C Plessy; K Shibata; T Shiraki; S Suzuki; M Tagami; K Waki; A Watahiki; Y Okamura-Oho; H Suzuki; J Kawai; Y Hayashizaki
Journal: Science Date: 2005-09-02 Impact factor: 47.728

8. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling.

Authors: Nicholas T Ingolia; Sina Ghaemmaghami; John R S Newman; Jonathan S Weissman
Journal: Science Date: 2009-02-12 Impact factor: 47.728

9. A user's guide to the encyclopedia of DNA elements (ENCODE).

Authors:
Journal: PLoS Biol Date: 2011-04-19 Impact factor: 8.029

10. The reality of pervasive transcription.

Authors: Michael B Clark; Paulo P Amaral; Felix J Schlesinger; Marcel E Dinger; Ryan J Taft; John L Rinn; Chris P Ponting; Peter F Stadler; Kevin V Morris; Antonin Morillon; Joel S Rozowsky; Mark B Gerstein; Claes Wahlestedt; Yoshihide Hayashizaki; Piero Carninci; Thomas R Gingeras; John S Mattick
Journal: PLoS Biol Date: 2011-07-12 Impact factor: 8.029

33 in total

Review 1. Evolutionary impact of transposable elements on genomic diversity and lineage-specific innovation in vertebrates.

Authors: Ian A Warren; Magali Naville; Domitille Chalopin; Perrine Levin; Chloé Suzanne Berger; Delphine Galiana; Jean-Nicolas Volff
Journal: Chromosome Res Date: 2015-09 Impact factor: 5.239

2. Identification of 4438 novel lincRNAs involved in mouse pre-implantation embryonic development.

Authors: Jie Lv; Hui Liu; Shihuan Yu; Hongbo Liu; Wei Cui; Yang Gao; Tao Zheng; Geng Qin; Jing Guo; Tiebo Zeng; Zhengbin Han; Yan Zhang; Qiong Wu
Journal: Mol Genet Genomics Date: 2014-11-27 Impact factor: 3.291

3. Identification and characterization of long intergenic non-coding RNAs related to mouse liver development.

Authors: Jie Lv; Zhijun Huang; Hui Liu; Hongbo Liu; Wei Cui; Bao Li; Hongjuan He; Jing Guo; Qi Liu; Yan Zhang; Qiong Wu
Journal: Mol Genet Genomics Date: 2014-07-11 Impact factor: 3.291

Review 4. Non-coding RNA regulation of endothelial and macrophage functions during atherosclerosis.

Authors: Binod Aryal; Yajaira Suárez
Journal: Vascul Pharmacol Date: 2018-03-15 Impact factor: 5.773

5. Overexpression of LncRNA-ROR predicts a poor outcome in gallbladder cancer patients and promotes the tumor cells proliferation, migration, and invasion.

Authors: Shou-Hua Wang; Ming-Di Zhang; Xiao-Cai Wu; Ming-Zhe Weng; Di Zhou; Zhi-Wei Quan
Journal: Tumour Biol Date: 2016-07-23

Review 6. The persistent contributions of RNA to eukaryotic gen(om)e architecture and cellular function.

Authors: Jürgen Brosius
Journal: Cold Spring Harb Perspect Biol Date: 2014-07-31 Impact factor: 10.005

Review 7. Long non-coding RNAs: novel targets for nervous system disease diagnosis and therapy.

Authors: Irfan A Qureshi; Mark F Mehler
Journal: Neurotherapeutics Date: 2013-10 Impact factor: 7.620

8. Long noncoding RNA H19 contributes to gallbladder cancer cell proliferation by modulated miR-194-5p targeting AKT2.

Authors: Shou-Hua Wang; Xiao-Cai Wu; Ming-Di Zhang; Ming-Zhe Weng; Di Zhou; Zhi-Wei Quan
Journal: Tumour Biol Date: 2016-01-23

Review 9. Emerging landscape of non-coding RNAs in oral health and disease.

Authors: P Perez; S I Jang; I Alevizos
Journal: Oral Dis Date: 2013-06-20 Impact factor: 3.511

10. A vast pool of lineage-specific microproteins encoded by long non-coding RNAs in plants.

Authors: Igor Fesenko; Svetlana A Shabalina; Anna Mamaeva; Andrey Knyazev; Anna Glushkevich; Irina Lyapina; Rustam Ziganshin; Sergey Kovalchuk; Daria Kharlampieva; Vassili Lazarev; Michael Taliansky; Eugene V Koonin
Journal: Nucleic Acids Res Date: 2021-10-11 Impact factor: 16.971