| Literature DB >> 24917120 |
Christian Anthon, Hakim Tafer, Jakob H Havgaard, Bo Thomsen, Jakob Hedegaard, Stefan E Seemann, Sachin Pundhir, Stephanie Kehr, Sebastian Bartschat, Mathilde Nielsen, Rasmus O Nielsen, Merete Fredholm, Peter F Stadler, Jan Gorodkin1.
Abstract
BACKGROUND: Annotating mammalian genomes for noncoding RNAs (ncRNAs) is nontrivial since far from all ncRNAs are known and the computational models are resource demanding. Currently, the human genome holds the best mammalian ncRNA annotation, a result of numerous efforts by several groups. However, a more direct strategy is desired for the increasing number of sequenced mammalian genomes of which some, such as the pig, are relevant as disease models and production animals.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24917120 PMCID: PMC4124155 DOI: 10.1186/1471-2164-15-459
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Pipeline. The modules of the RNA pipeline. Module 1 (annotation): The annotation pipeline takes as input any number of sequences and runs a number of external annotation tools on it (see text for details). This leads to the initial annotation of the RNA loci in the sequence. A naming and resolving tool decides on the final annotation of the locus. The 3,393 ncRNA genes cover 11 conflicts of annotation, 34 loci moved to the medium confident annotation during the curation step, 165 novel miRNA loci found exclusively by miRDeep, and 3,183 ncRNA genes found by homology. LncRNA loci and cis-regulatory elements are annotated separately. Module 2 (multiple alignments): The multiple alignment pipeline runs on a genomic scale and aligns the genomic sequence of the input genome against any number of other genomes, finally forming multiple alignments in MAF blocks. Module 3(post processing): The post-processing part of the pipeline adds context to the RNAs, which in many cases will allow for a curation of the structured RNA loci. The numbers in parenthesis are those obtained after the removal of 34 annotations as part of the curation procedure: 12 tRNAs, 15 homology based miRNAs, and 7 de novo miRNAs.
Cutoffs for individual tools at different global confidence levels
| Comp. strategy | High | Medium | Low |
|---|---|---|---|
| BLAST | 95% id 95% length | 92.5% id 92.5% length | 90% id 90% length |
| Infernal | BLAST E=1e-3 | BLAST E=0.1 | No BLAST filter |
| Infernal E=1e-3 | Infernal E=1e-3 | Infernal E=1e-3 | |
| Infernal miRNA | Not applied | BLAST E=1e-3 | BLAST E=0.1 |
| Infernal E=1e-9 | Infernal E=1e-6 | ||
| tRNAscan-SE | High | High | As is |
| RNAmmer | As is | As is | As is |
| snoStrip | As is | As is | As is |
| miRDeep | Hand cleaned | As is | As is |
The results from the individual tools are merged at 3 different cutoff levels: high, medium and low. This table shows the correspondence between the cutoff levels of the merged annotation and those of the tools. For each computational (comp) strategy, we define these levels. Note that Infernal screen have been divided into Infernal (without miRNAs) and Infernal miRNA which is only for miRNA families. The Infernal results were filtered by the family specific gathering scores as well. For As is, we refer to the programs default values for the respective versions (see the Methods section for version numbers) without subsequent cleaning.
Results of the homology based pipeline
| RNA class | High | |
|---|---|---|
| Families | Loci | |
| cisreg-elements | 31 | 139 |
| lncRNA-loci | 58 | 58 |
| miRNA | 321 | 359 |
| ribozyme | 3 | 8 |
| rRNA | 5 | 185 |
| snoRNA | 211 | 638 |
| snRNA | 10 | 1,030 |
| tRNA | 51 | 810 |
| Other | 7 | 153 |
| Conflict | 9 | 11 |
| Sum | 706 | 3,391 |
The combined results of the sequence similarity search, structure homology search and class specific tools at the high confident cutoff level (See Table 1). The column RNA class contains cisreg-elements: cis-regulatory elements from Rfam/Infernal; lncRNA-loci: Infernal lncRNA structure loci; the next 7 rows contain (full length) ncRNA genes, miRNA: BLAST from miRBase and miRDeep predictions; ribozyme: ribozymes from Rfam/Infernal; rRNA: ribosomoal RNAs primarily from RNAmmer; snRNA and snoRNA: BLAST results and results from Infernal/Rfam; tRNA: tRNAs tRNAs from BLAST; tRNAscan-SE and Infernal/Rfam; lncRNA-loci: structural loci from larger genes(lncRNAs); other: RNA families from Rfam not belonging to one of the other classes; conflict: conflicts of annotation. Loci are the number of RNA loci of a given class; Families are a subdivision of classes into RNAs with the same name. 12 tRNAs and 15 miRNAs were moved to the medium confident annotation as part of the curation procedure. See text for details. Note that for the final high confident annotation we add 165 RNA-seq based miRNA candidates, reaching the total of 3,556 high confident RNA loci.
Figure 2Homology based annotation in overview. The Venn diagrams for counting the high confident structured RNA loci found with sequence homology search (BLAST), structure homology search (Infernal) and class specific tools (tRNAscan-SE, snoStrip, and RNAmmer). (a) The diagram includes all 3,418 high confident structured RNA loci obtained by homology. (b) The similarity search is confined to the Rfam seed sequences (excluding the miRNA families in Rfam). 19 loci found by high confident BLAST against the Rfam sequences is missed by high confident structure homology search (18 + 1 in the red and purple areas of the right hand side of the figure). The reason is our additional Infernal E-value cutoff of 1e-3 imposed on all families. See text for detailed discussion. The numbers in parenthesis are after removal of 12 tRNAs and 15 miRNAs loci removed in the cleanup procedure. A total of 3,391 RNA loci were found. Of these 1,011 loci were found by sequence similarity, 2,314 were found by structure similarity, and 1,505 were detected by class specific methods.
Annotation of unannotated block groups
| miRDeep-unannotated | miRDeep-miRNA | Sum | |
|---|---|---|---|
| dba-unannotated | 417 | 4 | 421 |
| dba-miRNA | 46 | 31 | 77 |
| dba-rRNA | 1 | 0 | 1 |
| dba-snoRNA | 37 | 5 | 42 |
| dba-snRNA | 6 | 0 | 6 |
| dba-tRNA | 39 | 0 | 39 |
| dba-annotated | 129 | 36 | 165 |
| Sum | 546 | 40 | 586 |
The transcripts from the small RNA study without annotation by the homology based pipeline were analysed with blockbuster resulting in 586 transcripts (block-groups). The table displays a comparison of the de novo annotation of these block-groups by deepBlockAlign and with miRDeep. The table shows a comparison of the deepBlockAlign and miRDeep annotation of the 586 unannotated block groups. The second column contains the 546 block groups not annotated by miRDeep, the third column the 40 block groups annotated by miRDeep and the fourth is the sum of the two previous columns, i.e. all 586 block groups. The rows contain the deepBlockAlign annotations: the second row contains the block groups without deepBlockAlign(dba) annotation, rows 3–7 contains the deepBlockAlign classifications, row 8 is the sum of rows 3–7, and finally row 9 is the sum of rows 2 and 8, that is all 586 block groups and depending on the column, their miRDeep annotation.
Figure 3Phyologentic tree. Phylogenetic tree for the pig genome multiple alignments. This phylogenetic tree is derived from the human phylogenetic tree from UCSC based on the 46-way alignment. The organisms has been reordered to put pig on top and the branch lengths have been ignored. A tree in this form is needed as parameter for the TBA/MultiZ program. The tree is without branch lengths since these have not been recalculated for the pig genome.
Pairwise alignments
| Genome | Species | LASTZ/chaining options | Coverage | Alignment type |
|---|---|---|---|---|
| danRer7 | Zebrafish | Distant | 1.69 | rbest |
| xenTro2 | Frog | Distant | 1.94 | rbest |
| galGal3 | Chicken | Distant | 3.61 | rbest |
| ornAna1 | Platypus | Distant | 5.97 | rbest |
| monDom5 | Opossum | Distant | 10.70 | syntenic |
| eriEur1 | Hedgehog | Close | 19.08 | rbest |
| echTel1 | Tenrec | Close | 20.43 | rbest |
| rn4 | Rat | Close | 25.76 | syntenic |
| mm9 | Mouse | Close | 27.71 | syntenic |
| dasNov2 | Armadillo | Close | 30.10 | rbest |
| choHof1 | Sloth | Close | 30.94 | rbest |
| tarSyr1 | Tarsier | Close | 36.80 | rbest |
| oryCun2 | Rabbit | Close | 42.11 | syntenic |
| felCat4 | Cat | Close | 43.42 | rbest |
| loxAfr3 | Elephant | Close | 47.52 | syntenic |
| turTru1 | Dolphin | Close | 53.93 | rbest |
| hg19 | Human | Close | 55.93 | syntenic |
| bosTau5 | Cow | Close | 58.13 | syntenic |
| canFam2 | Dog | Close | 58.83 | syntenic |
| equCab2 | Horse | Close | 63.90 | syntenic |
| 21way | Multiple alignment | 78.98 |
Pairwise and multiple alignments of the pig genome. The other genomes were obtained from the UCSC genome browser website in their lower-case masked form. Masking was performed by the UCSC with RepeatMasker and Tandem repeat masker. The UCSC genome designation is given in the first column. The options for the pairwise alignments are given in the third column as either closely or distantly related to the pig in accordance with the choices made by UCSC for the human genome. The distance to pig has implications for the LASTZ and axtChain options as listed in Additional file 1: Table S19. The coverage of the pig genome is given in % in the fourth column based on the number of non-Ns covered by the pairwise alignment after cleaning of the alignments as specified. The pairwise alignments are cleaned either by synteny (small alignment chains are deleted when they would otherwise break synteny) or by deleting all but the best alignments where the target genome is multiply-covered. Both methods reduces the coverage of the pig genome by the alignment. The choice of cleanup method is given in the last column. A graphical representation of the coverage is given in Figure 4.
Figure 4Coverage of the pig genome. Absolute coverage of the pig genome by the genomes used for the multiple alignment. The coverage is based on the cleaned alignments with single best coverage of the pig genome. However, depending on the method of cleaning the coverage of the target genomes (x-axis) may be multiple in some locations. Far left, in green, is the result of the coverage of the pig genome by any genome in the multiple alignments.
Synteny of the RNAs of the homology based pipeline
| RNA class | # loci | hg19 | RNAs conserved in N other organisms | ||||
|---|---|---|---|---|---|---|---|
| Syntenic | Conserved | Both | 1 | 5 | 15 | ||
| cisreg-elements | 139 | 80 | 84 | 65 | 116 | 86 | 31 |
| lncRNA-loci | 58 | 57 | 53 | 53 | 58 | 57 | 7 |
| miRNA | 369 | 303 | 349 | 292 | 360 | 349 | 102 |
| putative-miRNA | 155 | 121 | 25 | 20 | 65 | 25 | 1 |
| ribozyme | 8 | 8 | 3 | 3 | 3 | 3 | 0 |
| rRNA | 185 | 143 | 0 | 0 | 6 | 1 | 0 |
| snoRNA | 638 | 473 | 266 | 221 | 400 | 282 | 43 |
| snRNA | 1,030 | 674 | 24 | 13 | 119 | 26 | 4 |
| tRNA | 810 | 549 | 274 | 199 | 389 | 284 | 14 |
| other | 153 | 111 | 10 | 10 | 17 | 11 | 0 |
| Conflict | 11 | 10 | 10 | 10 | 10 | 10 | 4 |
| Sum | 3,556 | 2,529 | 1,098 | 886 | 1,543 | 1,134 | 206 |
The columns are, RNA class, # RNA loci. # loci in human syntenic blocks, # loci conserved in human by 80% sequence identity. # loci both syntenic and conserved that is the number of ncRNAs in syntenic blocks where the ncRNA is actually conserved in human. # loci RNAs conserved in N other organisms, grouped by number of loci conserved in at least 1, 5, or 15 other organisms. Conservation is determined by the sequence identity in the pairwise alignments. The RNA loci are located in the pairwise alignments and the sequence identity is calculated when at least 80% of an RNA locus is covered. The RNA locus is counted as conserved in that organism if the locus has a sequence identity of at least 80%. In the table the number of RNAs conserved in at least N (N=1; N=5 or N=15) of the other genomes: bosTau5, canFam2, choHof1, danRer7, dasNov2, echTel1, equCab2, eriEur1, felCat4, galGal3, hg19, loxAfr3, mm9, monDom5, ornAna1, rn4, oryCun2, tarSyr1, turTru1, xenTro2.
Overlap of the RNAz predicted with the high confident annotation
| RNA class | Annotation | RNAz overlap | ||
|---|---|---|---|---|
| Families | Loci | Families | Loci | |
| cisreg-elements | 31 | 139 | 4 | 10 |
| lncRNA-loci | 58 | 58 | 3 | 3 |
| miRNA | 330 | 369 | 222 | 241 |
| putative-miRNA | 135 | 155 | 14 | 16 |
| ribozyme | 3 | 8 | 0 | 0 |
| rRNA | 5 | 185 | 2 | 2 |
| snoRNA | 211 | 638 | 57 | 72 |
| snRNA | 10 | 1,030 | 9 | 20 |
| tRNA | 51 | 810 | 36 | 154 |
| other | 7 | 153 | 3 | 4 |
| conflict | 9 | 11 | 4 | 6 |
| sum | 850 | 3,556 | 354 | 528 |
Comparison of the strand specific RNAz results with the result of the automatic annotation pipeline. The columns are, RNA class, # RNA families in the high confident annotation, # RNA loci in the high confident annotation, # RNA families that overlap with the RNAz predictions, # RNA loci that overlap with the RNAz predictions.
Curated annotation
| # high confident loci | # curated loci | # pseudogenes | # loci,#pseudogenes subtracted | |
|---|---|---|---|---|
| cisreg-elements | 139 | 93 | 31 | 108 |
| lncRNA-loci | 58 | 0 | 0 | 58 |
| miRNA | 369 | 125 | 0 | 369 |
| putative-miRNA | 155 | 0 | 0 | 155 |
| ribozyme | 8 | 1 | 5 | 3 |
| rRNA | 185 | 3 | 182 | 3 |
| snoRNA | 638 | 269 | 278 | 360 |
| snRNA | 1,030 | 69 | 960 | 70 |
| tRNA | 810 | 0 | 0 | 810 |
| Other | 153 | 3 | 125 | 28 |
| Conflict | 11 | 8 | 0 | 11 |
| Sum | 3,556 | 571 | 1,581 | 1,975 |
The high confident annotation is a combination of the results of the high confident homology pipeline and the miRDeep results. See Table 2 for row labels. Column labels: high confident is the high confident annotation prior to curation; curated are the number of loci curated by methods explained in the text; pseudogenes are the loci expected to be PolII/PolIII transcript, but failing to be so, ribosomal RNAs not part of the cluster on chromosome 6, and cis-regulatory elements without gene context. In the column overlaps to structured RNA loci annotated by homology as well as putative miRNAs are given. Curated annotation contains loci that are a) curated or b) loci not tested in the curation procedure, e.g., miRNA loci. High confident: is the complete high confident annotation (homology + miRDeep). 8 miRNAs detected by miRDeep, but not by high confident BLAST where re-annotated in the section miRNAs in the pig genome. A table with the 3,877 medium confident loci and 36,647 low confident loci are found in Additional file 1: Table S20.