| Literature DB >> 19812767 |
Yurong Xin1, Giulio Quarta, Hin Hark Gan, Tamar Schlick.
Abstract
Recent studies of mammalian transcriptomes have identified numerous RNA transcripts that do not code for proteins; their identity, however, is largely unknown. Here we explore an approach based on sequence randomness patterns to discern different RNA classes. The relative z-score we use helps identify the known ncRNA class from the genome, intergene and intron classes. This leads us to a fractional ncRNA measure of putative ncRNA datasets which we model as a mixture of genuine ncRNAs and other transcripts derived from genomic, intergenic and intronic sequences. We use this model to analyze six representative datasets identified by the FANTOM3 project and two computational approaches based on comparative analysis (RNAz and EvoFold). Our analysis suggests fewer ncRNAs than estimated by DNA sequencing and comparative analysis, but the verity of our approach and its prediction requires more extensive experimental RNA data.Entities:
Keywords: fraction model; putative non-coding RNA; randomness test
Year: 2008 PMID: 19812767 PMCID: PMC2735967 DOI: 10.4137/bbi.s443
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Selected genomic sequences in our randomness analysis of the three phylogenetic domains. Acc. No. denotes the accession number in GenBank.
| Domain | Organism | Acc. No. | Length (nt) |
|---|---|---|---|
| NC_000917 | 2,178,400 | ||
| NC_002607 | 2,014,239 | ||
| NC_005791 | 1,661,137 | ||
| NC_003551 | 1,694,969 | ||
| NC_003552 | 5,751,492 | ||
| NC_003901 | 4,096,345 | ||
| NC_007796 | 3,544,738 | ||
| NC_000916 | 1,751,377 | ||
| NC_007426 | 2,595,2,097,152 | ||
| Archaea | NC_005877 | 1,545,895 | |
| NC_003364 | 2,222,430 | ||
| NC_000868 | 1,765,118 | ||
| NC_003413 | 1,908,256 | ||
| NC_007181 | 2,225,959 | ||
| NC_002754 | 2,992,245 | ||
| NC_003106 | 2,694,756 | ||
| NC_006624 | 2,088,737 | ||
| NC_002578 | 1,564,906 | ||
| NC_002689 | 1,584,804 | ||
| NC_005966 | 3,598,621 | ||
| NC_003062 | 2,841,581 | ||
| NC_006513 | 4,296,230 | ||
| NC_002570 | 4,202,352 | ||
| NC_002927 | 5,339,179 | ||
| NC_002696 | 4,016,947 | ||
| NC_004369 | 3,147,090 | ||
| NC_007514 | 2,572,079 | ||
| NC_007907 | 5,727,534 | ||
| Bacteria | NC_004668 | 3,218,031 | |
| NC_000913 | 4,639,675 | ||
| NC_006814 | 1,993,564 | ||
| NC_003210 | 2,944,528 | ||
| NC_002946 | 2,153,922 | ||
| NC_007577 | 1,709,204 | ||
| NC_007761 | 4,381,608 | ||
| NC_005027 | 7,145,576 | ||
| NC_003923 | 2,820,462 | ||
| NC_005835 | 1,894,877 | ||
| NC_007086 | 5,148,708 | ||
| NW_045800.1 | 6,709,423 | ||
| NC_003071.3 | 19,705,359 | ||
| NC_003281.4 | 13,783,316 | ||
| NW_634459.1 | 2,669,025 | ||
| NW_634120.1 | 2,112,237 | ||
| NT_033779.3 | 22,407,834 | ||
| NC_004354.2 | 22,224,390 | ||
| NT_006316.15 | 22,487,426 | ||
| Eukarya | NT_033903.7 | 14,395,596 | |
| NT_039305.5 | 37,613,096 | ||
| NT_039474.5 | 26,734,816 | ||
| NC_004316 | 2,271,477 | ||
| NW_047692.2 | 2,154,120 | ||
| NW_047511.1 | 2,865,177 | ||
| NC_001136.6 | 1,531,916 | ||
| NC_001147.4 | 1,091,287 | ||
| NC_003424.2 | 5,572,983 |
The composition of the training ncRNA class and other two constructs.
| Group | Training set | Version 2 | Version 3 | Source |
|---|---|---|---|---|
| Noncode ncRNA | 4251 | 4251 | 4251 | Noncode |
| RNAdb ncRNA | 2204 | 2204 | 2204 | RNAdb |
| rRNA | 129 | 0 | 10 | EID |
| tRNA | 78 | 0 | 10 | Rfam |
| spliceosomal RNA | 28 | 0 | 10 | Rfam |
| Total sequence number | 6690 | 6455 | 6485 | |
| Relative | 6.93 | 6.77 | 7.33 | |
| Relative | 0.33 | 0.11 | 0.17 |
EID: European ribosomal RNA database.
The datasets used in the fraction model.
| Class | Relative | Sequences | Total length (nt) | Source |
|---|---|---|---|---|
| Genomic sequence (mouse) | 2.47 (0.04) | 100 | 1,232,506,963 | GenBank (listed in Table S3) |
| Intergenic region (mouse) | 2.36 (0.04) | 14,615 | 840,729,376 | GenBank (listed in Table S3) |
| Intron (mouse) | 3.19 (0.10) | 85,672 | 74,985,163 | the Exon-Intron database |
| Non-coding RNA | 6.93 (0.33) | 7,698 | 2,451,312 | RNAdb, Noncode, Rfam, European ribosomal RNA database |
The six putative ncRNA datasets.
| Dataset | Sequences | Total length (nt) | Ave length (nt) |
|---|---|---|---|
| FANTOM3 putative ncRNA | 34,030 | 67,856,244 | 1,994 |
| FANTOM3 stringent putative ncRNA | 2,886 | 4,535,792 | 1,572 |
| RNAz set1.P0.5 | 91,676 | 12,474,689 | 136 |
| RNAz set1.P0.9 | 35,985 | 5,475,570 | 152 |
| RNAz set2.P0.5 | 20,391 | 2,798,941 | 137 |
| EvoFold | 48,479 | 1,869,205 | 39 |
Figure 1Application of the DNA monkey test to biological sequence analysis. The DNA test counts 10-letter missing words from a 4-letter alphabet (A, C, G, T/U). Three overlapping 10-letter words are shown in blue, green, and red, respectively.
Figure 2Sequence manipulation scheme for randomness analysis.
Figure 3The degree of randomness (relative z-scores) for different sequence classes in the three phylogenetic domains by the DNA test: (a) the three-domain collection; (The intron class corresponds only to eukaryotes). (b) Archaeal dataset; (c) Bacterial dataset; (d) Eukaryotic dataset; (e) the repeat (control) dataset. The distribution of the relative z-score of each sequence class is estimated by the density function in R (R Development Core Team 2006). The color legend in the inset of (a) applies to (b), (c), and (d).
Relative z-scores of the six nucleotide sequence classes in the three domains.
| Class | Collection | Archaea | Bacteria | Eukarya | ||||
|---|---|---|---|---|---|---|---|---|
| Range | Mean (std) | Range | Mean (std) | Range | Mean (std) | Range | Mean (std) | |
| Genome | 0.9–11.7 | 2.9 (2.2) | 1.2–11.5 | 3.6 (2.5) | 1.2–15.2 | 3.0 (2.3) | 1.0–5.8 | 2.2 (1.0) |
| Intergene | 1.3–2.2 | 1.6 (0.2) | 1.2–1.3 | 1.2 (0.02) | 4.1–5.1 | 4.6 (0.2) | 1.4–2.2 | 1.7 (0.2) |
| Intron | 1.6–2.0 | 1.8 (0.1) | / | / | / | / | 1.6–2.0 | 1.8 (0.1) |
| mRNA | 5.7–7.3 | 6.3 (0.2) | 2.8–3.5 | 3.1 (0.2) | 4.0–5.1 | 4.6 (0.3) | 3.7–5.1 | 4.4 (0.3) |
| ncRNA | 6.3–7.7 | 6.9 (0.3) | / | / | 8.7–9.5 | 9.1 (0.1) | 6.1–6.5 | 6.2 (0.1) |
| Repeat | 24.5–230.8 | 58.0 (32.9) | 24.5–230.8 | 58.0 (32.9) | 24.5–230.8 | 58.0 (32.9) | 24.5–230.8 | 58.0 (32.9) |
Figure 4The degree of randomness of the six putative ncRNA datasets measured by the DNA test. The relative z-score distribution of the six datasets is denoted as follows: (a) EvoFold, (b) RNAz set2.P0.5, (c) FANTOM3 putative, (d) RNAz set1.P0.5, (e) FANTOM3 stringent and (f) RNAz set1.P0.9.
Figure S1The pairwise sequence similarity of the ncRNA class. The ncRNA class is divided into subgroups with the window size of 50 nt. The sequence similarity within subgroups is analyzed by the EMBOSS program. The error bar shows the standard deviation of similarity scores in a subgroup.
Figure S4The pairwise sequence similarity of selected sequences of the EvoFold dataset. The dataset is divided into subgroups with the window size of 50 nt. The sequence similarity within subgroups is analyzed by the EMBOSS program. The error bar shows the standard deviation of similarity scores in a subgroup.
Figure 5The model used to assess the ncRNA fraction in the FANTOM3, RNAz and EvoFold datasets. The mean relative z-score of the six datasets is shown in dashed lines in the same order as Fig. 4. Error bars show standard deviations of relative z-scores. The four predictions for the FANTOM3, the FANTOM3 stringent, the RNAz set1.P0.5 and the RNAz set1.P0.9 datasets by f3(z) are highlighted in bullets.
The ncRNA fractions predicted by the model. Relative z-scores are shown in mean values and standard deviations.
| Dataset | Relative | ||||||
|---|---|---|---|---|---|---|---|
| % | Sequence # | % | Sequence # | % | Sequence # | ||
| Fantom putative | 2.63 (0.05) | <5% | <1,701 | 22% | 7,487 | 18% | 6,125 |
| Fantom stringent | 3.32 (0.06) | 9.7% | 280 | 50% | 1,443 | 47% | 1,356 |
| RNAz set1. P0.5 | 3.08 (0.04) | <5% | <4,584 | 42% | 38,503 | 39% | 35,754 |
| RNAz set1. P0.9 | 3.49 (0.04) | 21% | 7,557 | 53% | 19,072 | 52% | 18,712 |
| RNAz set2. P0.5 | 2.21 (0.01) | <5% | <1,020 | <5% | <1,020 | <5% | <1,020 |
| EvoFold | ~1.44 (0.02) | <5% | <2,424 | <5% | <2,424 | <5% | <2,424 |
The relative z-score of the EvoFold dataset is estimated by concatenated sequences mixed with EvoFold predictions and known ncRNAs because the total length of the EvoFold dataset is shorter than the required length of the DNA test.
Figure 6Thermodynamic analysis of selected sequences of the FANTOM3 putative ncRNA dataset (<400 nt) and ten known ncRNA families. The passing rate, tested sequence number and dataset name are shown above the passing rate bar.
Figure 7The six submodels used to simulate systematic errors in our fraction model. The f4 and f5 are constructed by 10% and 25% group I intron sequences in the ncRNA partition of the ncRNA/genome model, respectively; and the f6 and f7 are constructed by 10% and 25% rRNA sequences in the ncRNA partition, respectively. The submodels f8 and f9 are ncRNA/genome models using version 2 and version3 ncRNA reference dataset (Table 1), respectively. The mean relative z-score of the six test datasets is shown in dashed lines in the same order as Fig. 4.
The systematic errors caused by biased ncRNA training datasets. Four submodels f4–f7 are created to simulated biased training data using 10% and 25% group I intron and rRNA sequences, repectively. The submodel labels are same as Fig. 7.
| Dataset | |||||
|---|---|---|---|---|---|
| Fantom putative | 31% | 22% | 18% | 16% | 12% |
| Fantom stringent | 72% | 56% | 47% | 42% | 34% |
| RNAz set1.P0.5 | 60% | 47% | 39% | 35% | 28% |
| RNAz set1.P0.9 | 78% | 61% | 52% | 47% | 38% |
| RNAz set2.P0.5 | <5% | <5% | <5% | <5% | <5% |
| EvoFold | <5% | <5% | <5% | <5% | <5% |
Same dataset as described in Table 4.
Figure S2The pairwise sequence similarity of the FANTOM3 stringent dataset. The dataset is divided into subgroups with the window size of 50 nt. The sequence similarity within subgroups is analyzed by the EMBOSS program. The error bar shows the standard deviation of similarity scores in a subgroup.
100 mouse genomic RefSeqs serve as sources for the mouse genome and intergene classes in our fraction model. Acce No. denotes the accession number in GenBank.
| Acce No. | Acce No. | Acce No. | Acce No. |
|---|---|---|---|
| NT_039173 | NT_039360 | NT_039548 | NT_039702 |
| NT_039185 | NT_039361 | NT_039563 | NT_039711 |
| NT_039186 | NT_039385 | NT_039573 | NT_039713 |
| NT_039189 | NT_039413 | NT_039578 | NT_078297 |
| NT_039190 | NT_039420 | NT_039580 | NT_078355 |
| NT_039202 | NT_039424 | NT_039586 | NT_078380 |
| NT_039206 | NT_039436 | NT_039589 | NT_078925 |
| NT_039212 | NT_039438 | NT_039590 | NT_080546 |
| NT_039229 | NT_039455 | NT_039595 | NT_081117 |
| NT_039230 | NT_039457 | NT_039596 | NT_082868 |
| NT_039234 | NT_039460 | NT_039609 | NT_095756 |
| NT_039238 | NT_039461 | NT_039617 | NT_108905 |
| NT_039240 | NT_039462 | NT_039618 | NT_108907 |
| NT_039260 | NT_039471 | NT_039625 | NT_109313 |
| NT_039267 | NT_039474 | NT_039636 | NT_109314 |
| NT_039268 | NT_039475 | NT_039638 | NT_109317 |
| NT_039301 | NT_039476 | NT_039641 | NT_109320 |
| NT_039302 | NT_039477 | NT_039649 | NT_110856 |
| NT_039314 | NT_039482 | NT_039650 | NT_111909 |
| NT_039340 | NT_039490 | NT_039655 | NT_111916 |
| NT_039343 | NT_039495 | NT_039657 | NT_161953 |
| NT_039350 | NT_039496 | NT_039676 | NT_162143 |
| NT_039353 | NT_039500 | NT_039678 | NT_162293 |
| NT_039356 | NT_039501 | NT_039699 | NT_162294 |
| NT_039359 | NT_039515 | NT_039700 | NT_163365 |