| Literature DB >> 35645337 |
Alexey Stupnikov1,2, Vitaly Bezuglov3,4, Ivan Skakov4, Victoria Shtratnikova3, J Richard Pilsner5, Alexander Suvorov3,6, Oleg Sergeyev3.
Abstract
Transcriptomics analysis of various small RNA (sRNA) biotypes is a new and rapidly developing field. Annotations for microRNAs, tRNAs, piRNAs and rRNAs contain information on transcript sequences and loci that is vital for downstream analyses. Several databases have been established to provide this type of data for specific RNA biotypes. However, these sources often contain data in different formats, which makes the bulk analysis of several sRNA biotypes in a single pipeline challenging. Information on some transcripts may be incomplete or conflicting with other entries. To overcome these challenges, we introduce ITAS, or Integrated Transcript Annotation for Small RNA, a filtered, corrected and integrated transcript annotation containing information on several types of small RNAs, including tRNA-derived small RNA, for several species (Homo sapiens, Rattus norvegicus, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans). ITAS is presented in a format applicable for the vast majority of bioinformatic transcriptomics analysis, and it was tested in several case studies for human-derived data against existing alternative databases.Entities:
Keywords: RNA-seq; differential gene expression; microRNA; piRNA; small RNA; small RNA fragments; tRNA-derived small RNA; transcript annotation; transcriptomics
Year: 2022 PMID: 35645337 PMCID: PMC9150019 DOI: 10.3390/ncrna8030030
Source DB: PubMed Journal: Noncoding RNA ISSN: 2311-553X
Statistics on completeness of human small RNA transcript entries in the databases prior to any correcting or filtering.
| RNA Type | Database | Loci and Sequence | Loci Only | Sequence Only | Genome Version |
|---|---|---|---|---|---|
| precursor microRNA | miRBase | 1002 | 913 | 2 | hg38 |
| mature microRNA | miRBase | 1477 | 1405 | 1 | hg38 |
| piRNA | piRNAdb | 812,343 | 0 | 2 | hg38 |
| tRNA | GtRNAdb | 430 | 187 | 2 | hg38 |
| rRNA | UCSC | 1752 | 0 | 186 | hg38 |
| tsRNA | MINTbase | 125,285 | 0 | 0 | hg19 |
Figure 1Visualization database completeness of both sequence data and loci data for various small RNA types. (A) For microRNA. (B) For piRNA. (C) For mature tRNA. (D) For rRNA.
Figure 2Distribution of loci number per transcript in the databases for various small RNA types. (A) For microRNA. (B) For piRNA. (C) For mature tRNA. (D) For rRNA.
Figure 3Statistics for database-delivered sequence (fasta) and genome locus-delivered sequence (getfasta) for MintBase transcripts. (A) Difference in lengths between fasta and getfasta in absolute values of frequency. Bar for value = −1 was larger than others and was excluded for visualization. (B) Difference in lengths between fasta and getfasta in log-transformed values of frequency. (C) Plot for lengths of fasta and getfasta for transcripts. (D) Hamming distance distribution for transcripts where fasta and getfasta sequences have mismatches.
Statistics on correction events in human RNA transcript entries: cases when both locus and sequence were present (Loci & Seq, no correction), only locus or only sequence (Loci only, Seq only, sequence or locus retrieved from genome); cases that required extending entry’s locus, sequence or both (Ext Loci & Seq, Loci & Ext Seq, Ext Loci & Ext Seq); cases with transcript loci intersections within same database (Inbase conflicts) and intersections between different databases (Interbase conflicts). * No intersection events were considered for tsRNAs, as fragments of the same tRNA naturally have intersecting loci.
| RNA Type | Database | Loci & Seq | Loci Only | Seq Only | Ext Loci & Seq | Loci & Ext Seq | Ext Loci & Ext Seq | Inbase Conflict | Interbase Conflict |
|---|---|---|---|---|---|---|---|---|---|
| mature | miRBase | 1291 | 1227 | 1 | 0 | 0 | 0 | 140 | 21 |
| microRNA | |||||||||
| piRNA | piRNAdb | 422,017 | 0 | 1 | 0 | 0 | 0 | 388,826 | 1268 |
| tRNA | GtRNAdb | 225 | 176 | 2 | 0 | 0 | 0 | 0 | 214 |
| rRNA | UCSC | 1591 | 0 | 186 | 0 | 0 | 0 | 37 | 126 |
| tRNA-derived | MINTbase | 0 | 0 | 8120 | 115,040 | 33 | 8 | * | * |
Statistics on correction events in tsRNA entries for various species: cases when both locus and sequence were present (Loci & Seq, no correction), only locus or only sequence (Loci only, Seq only), and cases that required extending entry’s locus, sequence or both (Ext Loci & Seq, Loci & Ext Seq, Ext Loci & Ext Seq).
| Species | Database | Loci & Seq | Loci Only | Seq Only | Ext Loci & Seq | Loci & Ext Seq | Ext Loci & Ext Seq |
|---|---|---|---|---|---|---|---|
|
| MINTBase | 0 | 0 | 8120 | 115,040 | 33 | 8 |
|
| tRFdb | 335 | 0 | 0 | 0 | 0 | 8 |
|
| tRFdb | 147 | 0 | 0 | 0 | 0 | 0 |
|
| tRFdb | 247 | 0 | 0 | 0 | 0 | 0 |
Statistics for unique transcript IDs for Integrated Transcript Annotation for Small RNA (ITAS), correction and filtration of intersection inside databases and between databases.
| Mature | ||||||
|---|---|---|---|---|---|---|
| Species | Genome Version | microRNA | piRNA | tRNA | rRNA | tsRNA |
|
| hg38 | 2330 | 14,439 | 403 | 1776 | 18,948 |
|
| mm39 | 1870 | 9715 | 1044 | 1376 | 13,105 |
|
| rn6 | 616 | 7976 | 966 | 239 | 9797 |
|
| ce11 | 397 | 8376 | 633 | 5 | 9411 |
|
| dm6 | 435 | 8296 | 154 | 93 | 8978 |
Statistics for unique transcript IDs for ITAS, after filtration and correction. Intersections between sRNA types were not filtered.
| Mature | ||||||
|---|---|---|---|---|---|---|
| Species | Genome Version | microRNA | piRNA | tRNA | rRNA | tsRNA |
|
| hg38 | 2543 | 14,605 | 619 | 1840 | 26,731 |
|
| mm39 | 1870 | 9739 | 1135 | 1430 | 65 |
|
| rn6 | 647 | 8079 | 1173 | 274 | - |
|
| ce11 | 424 | 8654 | 721 | 5 | 18 |
|
| dm6 | 481 | 500,536 | 295 | 165 | 22 |
Numbers of identified differentially expressed sRNA transcripts (sRNA) (p-value ) and tRNA-derived small RNA (tsRNA) (adjusted p-value ) for case studies processed with SPORTS pipeline with default annotation vs. genome alignment (bowtie + Rsubread/kallisto) pipeline based on ITAS.
| Data | SPORTS, sRNA | ITAS, sRNA | SPORTS, tsRNA | ITAS, tsRNA |
|---|---|---|---|---|
| Donkin et al. [ | 11 | 66 | 5 | 3 |
| Ingerslev et al. [ | 43 | 212 | 12 | 24 |
| Hua et al. [ | 26 | 242 | 0 | 12 |
Figure 4Top 10 differentially expressed small RNA (sRNA) using Donkin et al.’s data, processed with SPORTS vs. Integrated Transcript Annotation for Small RNA (ITAS)-based genome alignment (bowtie + Rsubread) pipelines.
Figure 5Top differentially expressed tRNA-derived small RNA (tsRNA) using Donkin et al.’s data, processed by SPORTS vs. ITAS-based genome alignment (bowtie + Rsubread + Kallisto) pipelines.
Top 10 differentially expressed small RNA (sRNA) transcripts and tRNA-derived small RNA (tsRNA) with adjusted p-value for Donkin et al. case study processed with SPORTS pipeline with default annotation vs. genome alignment (bowtie + Rsubread/Kallisto) pipeline based on ITAS.
| Transcript Name | SPORTS, |
|---|---|
| tRNA-Ser-CGA | 3.23 |
| hsa-mir-155 | 0.001 |
| other-rRNA | 0.002 |
| mt-tRNA-Glu-TTC | 0.009 |
| mt-tRNA-Phe-GAA | 0.019 |
| mt-tRNA-Trp-TCA | 0.019 |
| mt-tRNA-Ala-TGC | 0.022 |
| mt-tRNA-Ser-TGA | 0.022 |
| tRNA-His-GTG | 0.028 |
| 16S-rRNA | 0.035 |
|
|
|
| tRNA-Ile-AAT-5-end | 0.099 |
| mt-tRNA-Glu-TTC-CCA-end | 0.099 |
| mt-tRNA-Ala-TGC-5-end | 0.099 |
| tRNA-Ile-GAT-5-end | 0.099 |
| mt-tRNA-Phe-GAA-CCA-end | 0.099 |
|
|
|
| hsa-piR-33029 | 1.84 |
| hsa-miR-155-5p | 0.002 |
| 5S-dup10-seq-371 | 0.003 |
| hsa-miR-497-5p | 0.006 |
| hsa-piR-8652 | 0.006 |
| hsa-miR-195-3p | 0.007 |
| hsa-piR-33047 | 0.008 |
| hsa-miR-6516-5p | 0.008 |
| hsa-miR-3663-5p | 0.012 |
| hsa-miR-518c-3p | 0.012 |
|
|
|
| tRF-38-P4R8YP9LON4VN18-799 | 1.09 |
| tRF-27-79MP9P9NH5N-6856 | 1.29 |
| tRF-41-PSQP4PW3FJIKE7UMD-431 | 0.01 |
Figure 6An overview of all steps for processing considered databases of small RNAs for 5 species.
Figure 7An overview of all processing steps for integration of tsRNA data for four species.