Junchao Shi1, Eun-A Ko2, Kenton M Sanders2, Qi Chen3, Tong Zhou4. 1. Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89557, USA. Electronic address: junchaoshi@nevada.unr.edu. 2. Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89557, USA. 3. Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89557, USA. Electronic address: cqi@med.unr.edu. 4. Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89557, USA. Electronic address: tongz@med.unr.edu.
Abstract
High-throughput RNA-seq has revolutionized the process of small RNA (sRNA) discovery, leading to a rapid expansion of sRNA categories. In addition to the previously well-characterized sRNAs such as microRNAs (miRNAs), piwi-interacting RNAs (piRNAs), and small nucleolar RNA (snoRNAs), recent emerging studies have spotlighted on tRNA-derived sRNAs (tsRNAs) and rRNA-derived sRNAs (rsRNAs) as new categories of sRNAs that bear versatile functions. Since existing software and pipelines for sRNA annotation are mostly focused on analyzing miRNAs or piRNAs, here we developed the sRNA annotation pipelineoptimized for rRNA- and tRNA-derived sRNAs (SPORTS1.0). SPORTS1.0 is optimized for analyzing tsRNAs and rsRNAs from sRNA-seq data, in addition to its capacity to annotate canonical sRNAs such as miRNAs and piRNAs. Moreover, SPORTS1.0 can predict potential RNA modification sites based on nucleotide mismatches within sRNAs. SPORTS1.0 is precompiled to annotate sRNAs for a wide range of 68 species across bacteria, yeast, plant, and animal kingdoms, while additional species for analyses could be readily expanded upon end users' input. For demonstration, by analyzing sRNA datasets using SPORTS1.0, we reveal that distinct signatures are present in tsRNAs and rsRNAs from different mouse cell types. We also find that compared to other sRNA species, tsRNAs bear the highest mismatch rate, which is consistent with their highly modified nature. SPORTS1.0 is an open-source software and can be publically accessed at https://github.com/junchaoshi/sports1.0.
High-throughput RNA-seq has revolutionized the process of small RNA (sRNA) discovery, leading to a rapid expansion of sRNA categories. In addition to the previously well-characterized sRNAs such as microRNAs (miRNAs), piwi-interacting RNAs (piRNAs), and small nucleolar RNA (snoRNAs), recent emerging studies have spotlighted on tRNA-derived sRNAs (tsRNAs) and rRNA-derived sRNAs (rsRNAs) as new categories of sRNAs that bear versatile functions. Since existing software and pipelines for sRNA annotation are mostly focused on analyzing miRNAs or piRNAs, here we developed the sRNA annotation pipelineoptimized for rRNA- and tRNA-derived sRNAs (SPORTS1.0). SPORTS1.0 is optimized for analyzing tsRNAs and rsRNAs from sRNA-seq data, in addition to its capacity to annotate canonical sRNAs such as miRNAs and piRNAs. Moreover, SPORTS1.0 can predict potential RNA modification sites based on nucleotide mismatches within sRNAs. SPORTS1.0 is precompiled to annotate sRNAs for a wide range of 68 species across bacteria, yeast, plant, and animal kingdoms, while additional species for analyses could be readily expanded upon end users' input. For demonstration, by analyzing sRNA datasets using SPORTS1.0, we reveal that distinct signatures are present in tsRNAs and rsRNAs from different mouse cell types. We also find that compared to other sRNA species, tsRNAs bear the highest mismatch rate, which is consistent with their highly modified nature. SPORTS1.0 is an open-source software and can be publically accessed at https://github.com/junchaoshi/sports1.0.
Expanding classes of small RNAs (sRNAs) have emerged as key regulators of gene expression, genome stability, and epigenetic regulation [1], [2]. In addition to the previously well-characterized sRNA classes such as microRNAs (miRNAs), Piwi-interacting RNA (piRNAs), and small nucleolar RNA (snoRNAs), recent analysis of sRNA-seq data has led to the identification of expanding novel sRNA families. These include tRNA-derived sRNAs (tsRNAs; also known as tRNA-derived fragments, tRFs) and rRNA-derived sRNAs (rsRNAs) [3]. tsRNAs and rsRNAs have been discovered in a wide range of species with evolutionary conservation, supposedly due, in part, to the highly conservative sequence of their respective precursors, i.e., tRNAs and rRNAs [3]. Interestingly, tsRNAs and rsRNAs have been abundantly found in unicellular organisms (e.g., protozoa), where canonical sRNA pathways such as miRNA, siRNA, and piRNAs are entirely lacking [4], [5], [6]. The dynamic regulation of tsRNAs and rsRNAs in these unicellular organisms suggests that they are among the most ancient classes of sRNAs for intra- and inter-cellular communications [7].Moreover, recent emerging evidence from mammalian species have highlighted the diverse biological functions mediated by tsRNAs, including regulating ribosome biogenesis, translation initiation, retrotransposon control, cancer metastasis, stem cell differentiation, neurological diseases, and epigenetic inheritance [3], [8], [9], [10], [11], [12], [13], [14], [15]. Although tsRNAs are known to be involved in regulating these processes at both post-transcriptional and translational levels [11], [14], [16], the exact molecular mechanisms of how tsRNAs exert their functions have not been fully understood. Compared to tsRNAs, rsRNAs are more recently discovered and also exhibit tissue-specific distribution. Dynamic expression of rsRNAs is associated with diseases such as metabolic disorders and inflammation [17], [18], [19]. The diverse biological functions of tsRNAs and rsRNAs and their strong disease associations are now pushing the new frontier of sRNA research.Currently, there are multiple existing general sRNA annotation software and pipelines [20], [21], [22], [23], [24], and some have been developed aiming to analyze tsRNAs [25], [26], [27]. However, there is still a lack of specialized tools that can simultaneously analyze both tsRNAs and rsRNAs in addition to other canonical sRNAs. Here, we provide the sRNA annotation pipeline optimized for rRNA- and tRNA-derived sRNAs (SPORTS1.0), which can annotate and profile canonical sRNAs such as miRNAs and piRNAs, and is also optimized to analyze tsRNAs and rsRNAs from sRNA-seq data. In addition, SPORTS1.0 can help predict potential RNA modification sites based on nucleotide mismatches within sRNAs.
Method
The source code of SPORTS1.0 is written in Perl and R. The whole package and installation instructions are available on Github (https://github.com/junchaoshi/sports1.0). SPORTS1.0 can apply to a wide-range of species and the annotation references of 68 species are precompiled for downloading (Table S1).The workflow of SPORTS1.0 consists of four main steps, i.e., pre-processing, mapping, annotation output, and annotation summary (Figure 1). SRA, FASTQ, and FASTA are the acceptable formats for data input. By calling Cutadapt [28] and Perl scripts extracted from miRDeep2 [29], SPORTS1.0 outputs clean reads by removing sequence adapters, and discarding sequences with length beyond the defined range and those with bases other than ATUCG. The clean reads obtained in pre-processing step are sequentially mapped against reference genome, miRBase [30], rRNA database (collected from NCBI), GtRNAdb [31], piRNA database [32], [33], Ensembl [34] and Rfam [35], upon users’ setting. sRNA sequences are first annotated by Bowtie [36]. Next, a Perl script precompiled in SPORTS1.0 is used to identify the locations of tsRNAs regarding whether they are derived from 5′ terminus, 3′ terminus, or 3′CCA end of tRNAs. Then an R script precompiled in SPORTS1.0 is applied to obtain rsRNA expression level and positional mapping information regarding their respective rRNA precursors (5.8S, 18S, 28S, etc.).
Figure 1
Workflow of SPORTS1.0
SPORTS1.0 contains four main steps, i.e., pre-processing, mapping, annotation output, and annotation summary, as outlined in the figure.
Workflow of SPORTS1.0SPORTS1.0 contains four main steps, i.e., pre-processing, mapping, annotation output, and annotation summary, as outlined in the figure.SPORTS1.0 can also be used to analyze sequence mismatch information if mismatches are allowed during alignment process. Such information can help predict potential modification sites that have caused nucleotide misincorporation during the reverse transcription (RT) process as previously reported [37]. In the current version, a mismatch site is designated using criteria as previously described [37]. Binomial distribution is used to address whether the observed mismatch enrichment is significantly higher than the base-calling error. Here, we define p as the base-calling error rate, n as the number of nucleotides perfectly fitted to the reference sites, n as the number of mismatched nucleotides, and n as the sum of n and n. The probability of observing not larger than k perfectly matched nucleotides out of n can be calculated as:SPORTS1.0 provides two methods to evaluate n number. The first option is to simply calculate n as the read number of sequences containing particular mismatches. Since some sequences may align to multiple reference loci, using this method may result in an increased false-positive rate. A second method is thus included, in which read number of sequences from multiple matching loci are uniformly distributed (based on the assumption that each of these multiple sites equally expresses RNAs) and consequently generates an adjusted n.SPORTS1.0 summary output includes annotation details for each sequence and length distribution along with other statistics. (See sample output Figures 2 and 3, Tables S2 and S3). User guideline is provided online (https://github.com/junchaoshi/sports1.0).
Figure 2
Exemplary annotation and profiling of sRNA-seq datasets generated by SPORTS1.0
Categorization and length distribution analysis of different sRNA types in mouse sperm (A), bone marrow cells (B), and intestinal epithelial cells (C). RPM, reads per million clean reads; Unanno, unannotated; MG, match genome; UMG, unmatch genome.
Figure 3
Cell-specific rsRNA profiles revealed by SPORTS1.0
Subtypes of rRNA precursors (5.8S, 18S, 28S, etc.) for rsRNAs from mouse sperm (A), bone marrow cells (B), and intestinal epithelial cells (C). Comparison of rsRNA-generating loci from different rRNA precursors reveals distinct pattern between sperm (D), bone marrow cells (E), and intestinal epithelial cells (F). RPM, reads per million clean reads.
Exemplary annotation and profiling of sRNA-seq datasets generated by SPORTS1.0Categorization and length distribution analysis of different sRNA types in mouse sperm (A), bone marrow cells (B), and intestinal epithelial cells (C). RPM, reads per million clean reads; Unanno, unannotated; MG, match genome; UMG, unmatch genome.Cell-specific rsRNA profiles revealed by SPORTS1.0Subtypes of rRNA precursors (5.8S, 18S, 28S, etc.) for rsRNAs from mouse sperm (A), bone marrow cells (B), and intestinal epithelial cells (C). Comparison of rsRNA-generating loci from different rRNA precursors reveals distinct pattern between sperm (D), bone marrow cells (E), and intestinal epithelial cells (F). RPM, reads per million clean reads.
Results
As an example, we used SPORTS1.0 to analyze sRNA-seq datasets from mouse sperm (GSM2304822 [38]), bone marrow cells (GSM1604100 [39]), and intestinal epithelial cells (GSM1975854 [40]) samples. Graphic output by SPORTS1.0 reveals distinct sRNA profiles in sperm (Figure 2A), bone marrow cells (Figure 2B), and intestinal epithelial cells (Figure 2C) samples. tsRNAs and rsRNAs are found equally or more abundantly than the well-known miRNAs or piRNAs (length distribution data for each sRNA type are exemplified in Table S2). In particular, tsRNAs are dominant in sperm, and rsRNAs are highest in bone marrow cells, whereas intestinal epithelial cells contain an appreciable amount of both tsRNAs and rsRNAs in addition to a miRNA peak.Importantly, SPORTS1.0 found an appreciable portion of rsRNAs annotated in sperm (48.7%), bone marrow cell (11.1%) and intestinal epithelial cell (61.1%) samples that are previously deemed as “unmatch genome” (UMG) (Figure 2A−C upper pie chart). This is because these newly annotated rsRNAs are derived from rRNA genes (rDNA), which were not assembled and shown in current mouse genome (mm10) [41], and thus were discarded before analysis by previous sRNA analyzing pipelines. SPORTS1.0 can now annotate and analyze these rsRNAs, including providing the subtypes of rRNA precursors (5.8S, 18S, 28S, etc.) from which they are derived from (Figure 3A−C), as well as the loci mapping information (Figure 3D−F). Interestingly, our analyses revealed that the specific loci that generate rsRNAs are completely distinct among sperm, bone marrow cell, and intestinal epithelial cell samples (Figure 3D−F), suggesting distinct biogenesis and functions of these rsRNAs. Similarly, SPORTS1.0 also revealed tissue-specific landscape of tsRNAs in terms of their relative abundance (Figure 2A−C lower pie chart) and the tRNA loci where they are derived from (5’ terminus, 3’ terminus, 3’CCA end, etc.) (Figure 4 and Figures S1–S3). Since tsRNAs from different loci bear distinct biological functions [3], the tissue-specific tsRNA composition may represent features that define the unique functions of respective tissue/cell types.
Figure 4
Cell-specific tsRNA profiles revealed by SPORTS1.0
Examples of 3 cell-specific tsRNA profiles revealed in mouse sperm (A), bone marrow cells (B), and intestinal epithelial cells (C). Full tsRNA mapping results against tRNA loci are included in Figures S1–S3 for sperm (Figure S1), bone marrow cells (Figure S2), and intestinal epithelial cells (Figure S3), respectively. RPM, reads per million clean reads.
Cell-specific tsRNA profiles revealed by SPORTS1.0Examples of 3 cell-specific tsRNA profiles revealed in mouse sperm (A), bone marrow cells (B), and intestinal epithelial cells (C). Full tsRNA mapping results against tRNA loci are included in Figures S1–S3 for sperm (Figure S1), bone marrow cells (Figure S2), and intestinal epithelial cells (Figure S3), respectively. RPM, reads per million clean reads.In addition, SPORTS1.0 also revealed distinct mismatch rates among different types of sRNAs (Figure 5 and Table S3), with tsRNAs showing the highest. The detected mismatch sites represent the modified nucleotides that might have caused misincorporation of nucleotides during the RT process. The relatively higher mismatch rate detected in tsRNA sequences is consistent with their highly modified nature. The mismatch sites detected by SPORTS1.0 could provide a potential source for further analyses of RNA modifications within sRNAs.
Figure 5
sRNA mismatch statistics by SPORTS1.0
The percentage of unique sequences that contain significantly-enriched mismatches out of total number of unique sequences from each subtype of sRNAs (miRNAs, piRNAs, tsRNAs, and rsRNAs) is provided for sperm (A), bone marrow cells (B), and intestinal epithelial cells (C). EMS, enrichment mismatch sequences; TUS, total unique sequences.
sRNA mismatch statistics by SPORTS1.0The percentage of unique sequences that contain significantly-enriched mismatches out of total number of unique sequences from each subtype of sRNAs (miRNAs, piRNAs, tsRNAs, and rsRNAs) is provided for sperm (A), bone marrow cells (B), and intestinal epithelial cells (C). EMS, enrichment mismatch sequences; TUS, total unique sequences.Finally, SPORTS1.0 can analyze sRNAs of a wide range of species, depending on the availability of their reference genome and sRNA sequences (Figure 6 and Table S1). The species to be analyzed and their associated sRNA references are subject to update in future versions, or can be customized by the end users.
Figure 6
Species recompiled for analysis by SPORTS1.0
The 68 species and their respective reference database included in SPORTS1.0 precompiled for analysis.
Species recompiled for analysis by SPORTS1.0The 68 species and their respective reference database included in SPORTS1.0 precompiled for analysis.
Conclusion
SPORTS1.0 is an easy-to-use and flexible pipeline for analyzing sRNA-seq data across a wide-range of species. Using mice as example, SPORTS1.0 provides a far more complicated sRNA landscape than having been previously seen, highlighting a tissue-specific dynamic regulation of tsRNAs and rsRNAs. SPORTS1.0 can also predict potential RNA modification sites based on nucleotide mismatches within sRNAs, and show a distinct pattern between different sRNA types. SPORTS1.0 may set the platform for potential new discoveries in biomedical and evolutionary studies that are related to sRNAs.The real voyage of discovery consists not in seeking new landscapes, but in looking with new eyes.
Authors’ contributions
JS, TZ, and QC conceived the idea and wrote the manuscript. JS and TZ developed the SPORTS1.0 software and analyzed the RNA-seq data. JS, EK, KMS, QC, and TZ contributed to the interpretation of the results. All authors read and approved the final manuscript.
Authors: Bailey C E Peck; Amanda T Mah; Wendy A Pitman; Shengli Ding; P Kay Lund; Praveen Sethupathy Journal: J Biol Chem Date: 2017-01-04 Impact factor: 5.157
Authors: Antonio Rueda; Guillermo Barturen; Ricardo Lebrón; Cristina Gómez-Martín; Ángel Alganza; José L Oliver; Michael Hackenberg Journal: Nucleic Acids Res Date: 2015-05-27 Impact factor: 16.971
Authors: Eric P Nawrocki; Sarah W Burge; Alex Bateman; Jennifer Daub; Ruth Y Eberhardt; Sean R Eddy; Evan W Floden; Paul P Gardner; Thomas A Jones; John Tate; Robert D Finn Journal: Nucleic Acids Res Date: 2014-11-11 Impact factor: 19.160
Authors: Andrew Yates; Wasiu Akanni; M Ridwan Amode; Daniel Barrell; Konstantinos Billis; Denise Carvalho-Silva; Carla Cummins; Peter Clapham; Stephen Fitzgerald; Laurent Gil; Carlos García Girón; Leo Gordon; Thibaut Hourlier; Sarah E Hunt; Sophie H Janacek; Nathan Johnson; Thomas Juettemann; Stephen Keenan; Ilias Lavidas; Fergal J Martin; Thomas Maurel; William McLaren; Daniel N Murphy; Rishi Nag; Michael Nuhn; Anne Parker; Mateus Patricio; Miguel Pignatelli; Matthew Rahtz; Harpreet Singh Riat; Daniel Sheppard; Kieron Taylor; Anja Thormann; Alessandro Vullo; Steven P Wilder; Amonida Zadissa; Ewan Birney; Jennifer Harrow; Matthieu Muffato; Emily Perry; Magali Ruffier; Giulietta Spudich; Stephen J Trevanion; Fiona Cunningham; Bronwen L Aken; Daniel R Zerbino; Paul Flicek Journal: Nucleic Acids Res Date: 2015-12-19 Impact factor: 16.971
Authors: Alessandro La Ferlita; Salvatore Alaimo; Sebastiano Di Bella; Emanuele Martorana; Georgios I Laliotis; Francesco Bertoni; Luciano Cascione; Philip N Tsichlis; Alfredo Ferro; Roberta Bosotti; Alfredo Pulvirenti Journal: BMC Bioinformatics Date: 2021-06-03 Impact factor: 3.169
Authors: Ruizhu Huang; Charlotte Soneson; Pierre-Luc Germain; Thomas S B Schmidt; Christian Von Mering; Mark D Robinson Journal: Genome Biol Date: 2021-05-17 Impact factor: 13.583