Literature DB >> 34316353

WIND (Workflow for pIRNAs aNd beyonD): a strategy for in-depth analysis of small RNA-seq data.

Konstantinos Geles^1,2, Domenico Palumbo^1,3, Assunta Sellitto¹, Giorgio Giurato^1,2,4, Eleonora Cianflone⁵, Fabiola Marino⁶, Daniele Torella⁶, Valeria Mirici Cappa^1,2, Giovanni Nassa^1,2,4, Roberta Tarallo^1,4, Alessandro Weisz^1,4, Francesca Rizzo^1,2,4.

Abstract

Current bioinformatics workflows for PIWI-interacting RNA (piRNA) analysis focus primarily on germline-derived piRNAs and piRNA-clusters. Frequently, they suffer from outdated piRNA databases, questionable quantification methods, and lack of reproducibility. Often, pipelines specific to miRNA analysis are used for the piRNA research in silico. Furthermore, the absence of a well-established database for piRNA annotation, as for miRNA, leads to uniformity issues between studies and generates confusion for data analysts and biologists. For these reasons, we have developed WIND ( Workflow for p IRNAs a Nd beyon D), a bioinformatics workflow that addresses the crucial issue of piRNA annotation, thereby allowing a reliable analysis of small RNA sequencing data for the identification of piRNAs and other small non-coding RNAs (sncRNAs) that in the past have been incorrectly classified as piRNAs. WIND allows the creation of a comprehensive annotation track of sncRNAs combining information available in RNAcentral, with piRNA sequences from piRNABank, the first database dedicated to piRNA annotation. WIND was built with Docker containers for reproducibility and integrates widely used bioinformatics tools for sequence alignment and quantification. In addition, it includes Bioconductor packages for exploratory data and differential expression analysis. Moreover, WIND implements a "dual" approach for the evaluation of sncRNAs expression level quantifying the aligned reads to the annotated genome and carrying out an alignment-free transcript quantification using reads mapped to the transcriptome. Therefore, a broader range of piRNAs can be annotated, improving their quantification and easing the subsequent downstream analysis. WIND performance has been tested with several small RNA-seq datasets, demonstrating how our approach can be a useful and comprehensive resource to analyse piRNAs and other classes of sncRNAs. Copyright:

Entities: CellLine Chemical Disease Gene Species

Keywords: ncRNA-expression; piRNA; small RNA sequencing; workflow

Mesh：

Substances：
RNA, Small Interfering

Year: 2021 PMID： 34316353 PMCID： PMC8276195 DOI： 10.12688/f1000research.27868.3

Source DB: PubMed Journal: F1000Res ISSN： 2046-1402

Introduction

Advances in the field of Next-Generation Sequencing and big data analysis have led to the identification of several small non-coding RNA (sncRNA) classes, some of which are still poorly characterised . Among others, the most investigated include microRNAs (miRNAs), small interfering RNAs (siRNAs), PIWI-interacting RNAs (piRNAs), small nuclear (snRNAs) and small nucleolar RNAs (snoRNAs). Increasing evidence demonstrates that the different sncRNAs constitute interconnected networks of molecules with key-regulatory functions in multiple biological processes, including physiological events, organism development or even disease . piRNAs represent an heterogeneous group, ubiquitous in most animal's germline cells, with lack of conserved sequences and few common structural features in the various species, due to the highly adaptive nature of the piRNA pathway . Germline piRNAs typically have a 21–35 nt length, a strong bias for 5'-end uridine signature and a 2'-O-methyl group at their 3'-end . Most of them are transcribed by either mono-directional or bidirectional genomic clusters, specific regions ranging from <1 kb to >100 kb, giving rise to a long, single-stranded precursor and further processed in multiple mature piRNAs through enzymatic cleavage. A subset of piRNAs present an adenosine bias at position 10, a feature indicating their biogenesis through the ping-pong cycle, a mechanism by which the cleavage of the target RNA is coupled with the production of a second population of target-specific piRNAs. They interact with PIWI proteins of the Argonaute (AGO) family, forming a silencing complex able to suppress transposable elements, regulate target’s gene expression at both epigenetic and post-transcriptional level and defend from viral infections . These piRNA functions are well studied in the animal germline, however in somatic cells, their role needs to be further elucidated. Additional studies have revealed that piRNA dysregulation can contribute to the onset of several diseases . Notably in cancer, the abnormal expression of piRNAs has been associated with tumour initiation, progression, and metastasis formation and these molecules have shown the potential to be useful diagnostic tools and therapeutic targets as well as biomarkers for cancer prognosis . A limitation in understanding their function and use in clinical practice is the lack of a comprehensive and reliable method for their identification in tissues others than germline. A common strategy for piRNAs identification is based on mapping the reads obtained from high-throughput small RNA libraries to the genome and then annotate to small RNA databases for quantification. Most of the piRNA sequences identified so far have been deposited in databases such as piRNABank , piRNAdb ( https://www.pirnadb.org/), piRNAclusterDB and others. However, data collected in these repositories mainly include germline piRNAs, while somatic piRNAs represent a minor fraction. In addition, piRNAs in somatic tissues and human cancers are less abundant than in germline, thereby leading to a more difficult identification and characterisation. Although piRNAs were initially confounded with fragments of longer RNAs, functional piRNAs have been identified to derive from fragments of rRNAs, tRNAs, snoRNAs, and post-transcriptionally processed mRNA . Another level of complexity is represented by their genomic origin(s) and their actual amount, since identical sequences of piRNA can be produced by multiple genomic loci, resulting in very low precision and sensitivity. For all the reasons stated above, and since existing workflows and tools for piRNA-analysis, usually, focus on the identification and quantification of piRNA clusters (PILFER , unitas ) or use outdated piRNA databases, we decided to implement a useful workflow for small RNA sequencing data analysis, able to analyse all classes of sncRNAs but especially designed for piRNA identification. We created a workflow that provides a quick method to integrate different piRNA databases in one annotation track, a two-method approach for small RNA identification, annotation and quantification, and an output with several ready-to-publish plots and statistics. Additionally, we packaged the entire workflow in several Docker containers avoiding the annoying problems related to the installation and libraries dependencies. Finally, we applied it to different small RNA datasets, highlighting that piRNAs are dysregulated in breast cancer tissues and may play an important role in maintaining the stemness of MCF7 spheroid-enriched cancer stem cells (CSCs).

Methods

In this study, we implemented a workflow for small RNA sequencing data analysis, defined WIND ( Workflow for p IRNAs a Nd beyon D) , designed for a comprehensive identification and quantification of small-RNAs and especially of piRNAs. We deployed it exploiting the Docker containerisation approach, allowing us to integrate multiple bioinformatics tools. In detail, we created two Docker images where we adopted broadly used tools for pre-processing, reads alignment, identification and quantification of sncRNAs, and all downstream analyses. We also integrated the already available container made for Salmon for transcriptome analysis. This solution takes into account best practices for reproducibility, versatility and ease of use, as the software deployment is fast and efficient. It can be used in various operating systems with only the requirements of the Docker engine and some minimum adjustments for processing power and RAM for the most memory demanding tools.

Workflow

The workflow consists of three significant steps ( Figure 1):

Figure 1.

Workflow schematic representation.

Workflow schematic representation.

The Annotation forging step, represented in blue, is the creation of a GTF file, where the two input databases (piRNABank and RNAcentral) are merged to produce the new small RNA annotation track, that together with the Fasta files constitute the inputs of the following step. In Pre-processing & Quantification step (light blue area), the user's fastq files undergo through the quality check, and the adapter removal followed by the two quantification approaches (completed by Salmon, and STAR with FeatureCounts software) that perform in parallel alignment and the quantification of reads. In the green box, representing the Exploratory data analysis phase, are displayed all the possible results produced by the workflow. The data analyst could also pursue differential expression analysis if that is the desirable outcome. Annotation forging: the generation of the annotation files for small RNA sequences used in the next quantification step. Pre-processing and quantification: pre-processing, alignment and quantification of the reads assigned to sncRNAs (using a dual approach: genomic and transcriptomic analysis). Exploratory data analysis: result exploration of both quantification methods in parallel and Differential Expression (DE) with two different methodologies (edgeR and limma-voom ). The first step of WIND, the Annotation forging (blue box in Figure 1) is the creation of the annotation track. In this step, we tried to reduce and potentially correct the issues regarding piRNA annotation, such as the presence of multi-mapped piRNA IDs, the inconsistencies among piRNA databases, and the misleading annotation of small RNA fragments. In particular, 667,944 human and 1,399,813 mouse sequences were acquired from piRNABank (02-May-2007, Version 1, hg19). Duplicates and multi-mapped sequences were collapsed, leading to 23,439 and 39,986 unique sequences for human and mouse, respectively. Afterwards, these unique sequences were realigned to the latest version of the reference genome (Gencode primary assembly, GRCh38.p13 for human, and GRCm38.p6 for mouse) using STAR aligner with the following parameters: --alignIntronMax 0, --outFilterMultimapNmax 100, --outFilterMatchNmin 16, and --outFilterMismatchNmax 0. Further on, 446,265 human and 180,780 mouse small RNA sequences from RNAcentral (v16, 28/09/2020) were utilised to complete the database (for details see Extended data: Supplementary Table 1). Then, the sequences from both databases were filtered with respect to their length, keeping only those with less than 100 bases in length, since our primary interests are piRNAs and sncRNAs, and keeping only those that correspond to standard chromosomes. Moreover, a re-classification of the piRNABank sequences was made. According to Tosar et al. , a small percentage of annotated piRNAs are probably piRNA-sized fragments of sncRNAs (rRNAs, tRNAs, YRNAs, snRNAs, and snoRNAs) or intermediates of miRNA biogenesis and potentially act as contaminants in the quantification step of the workflow. For this reason, piRNABank sequences matching sequences from RNAcentral with different sncRNA types (biotype) other than piRNA are re-categorised with the biotype from RNAcentral. Subsequently, as it is well established that mature piRNAs have a length of around 21~35 bases , before the final assembly of the sequences from both databases, the piRNABank sequences shorter than 69 bases (<69bp) are integrated with RNACentral sequences. Furthermore, we excluded the piRNA genomic ranges falling in regions annotated as protein coding, exons or CDS from the GENCODE annotation file. However, the user can choose to skip this filtering and obtain a “less stringent” GTF file. Moreover, inspired by Tosar et al. 2021 , we included, as metadata, those piRNAs sequences that are inside other sncRNAs and lncRNAs from GENCODE annotation file. The obtained sequences are finally exported to Fasta and GTF (Gene Transfer Format) file format. Eventually, we decided to provide the information available on piRNACLusterDB as metadata in the final GTF file. These tracks, for human and mouse species, have been included in the GitHub repository of the workflow and are available for users ( https://github.com/ConYel/wind); therefore, the annotation forging step can be skipped. The workflow can also be used for any other species, but in this case, it would be necessary for the user to run the Annotation forging step with the specific genome and small RNA sequences, respectively. The mapping of these piRNA sequences to the genome has revealed that the piRNAs can derive from two types of genomic locations: discrete genomic loci (the piRNA clusters) and protein-coding genes (e.g. UTRs, introns) . Using bumphunter package in the workflow, we were able to obtain piRNA origin information and provide it as the first additional file of the annotation. Since piRNAs are involved in the maintenance of genome stability through the silencing of transposable elements , in this step, we also report a GTF file with the intersection between the genomic positions of small RNAs, the various categories of Transposable Elements (TEs) and information about the TE class, family and gene. Briefly, the GTF file is created with the related Fasta file; then the Genomic_Region_info, Multimapping_piRNA_info and Transposable_Elements_info files are reported which carry information about the genomic topology for the sequences in the GTF file. Likewise, these files are available in the GitHub repository (GRCh38 and GRCm38), for future usage by the data analyst. The second step of the workflow, Pre-processing and quantification (light blue box in Figure 1), consists of a quality control check of the small RNA-seq data, carried out using the FastQC tool , followed by the adapter removal using Cutadapt and by another quality control check using FastQC again. After these initial steps, the workflow exploits two different approaches for the quantification of sncRNAs. In particular, one uses alignment to a reference genome with STAR and then quantification of aligned reads with FeatureCounts ; the other one uses Salmon (an aligner-free method) for the estimation of transcript-level abundance. We named the two approaches “genomic” and “transcriptomic” based on how the two methods work. Both approaches have positive and negative features. Undoubtedly, with STAR, the reads are aligned on a reference genome, Salmon instead is an alignment-free quantification method, able to prioritise the association of a feature with a specific site on a transcriptome. On the one hand, STAR could associate a read on multiple sites creating a complete list of identified regions, but this makes it more difficult to determine the genomic locations of origin, thus requiring more computational work. On the other hand, Salmon is a transcriptome quantifier able to correct for fragment GC-content and positional bias, which improves the accuracy of abundance estimates and potentially the sensitivity of subsequent DE analysis. To ensure the proper alignment of sncRNA reads to the genome, we used the following options for STAR aligner (as used in SPAR workflow ): --alignIntronMax 1, --outFilterMultimapNmax 100, --outFilterMismatchNmax 1 and --outFilterMatchNmin 14. For Salmon, to be suitable for small RNA reads, the following options were applied: --seqBias, --gcBias, --numBootstraps 100, and --validateMappings as was suggested from the work of Wu et al. . Resulting files, from the previous step, are imported to R using the Bioconductor packages: tximport (for Salmon) and Rsubread (for FeatureCounts), as DGEList objects (edgeR). After reads count, a FeatureCounts object is reported as an R object (.RDS) for an easy and fast way to import it in R. Moreover, we decided to record the assigned reads from both Salmon and FeatureCounts as BAM files. The last step of the workflow is the Exploratory data analysis (EDA), which includes the filtering of low expressed small RNAs, the normalisation procedures performed in parallel for both FeatureCounts and Salmon, and then the visualisation of the results according to the suggested EDA workflows from Bioconductor . Finally, the workflow provides several useful output files: text and RDS files with filtered or normalised reads, information about the filtering step, Multidimensional Scaling (MDS) plots, biodetection plots, expression per small RNA category plot (countsbio), distance-matrix plot, hierarchical clustering plots with various normalisation methods, Principal Components Analysis (PCA) plots, Relative Log Expression (RLE) plots , voom-derived plots, sequence length barplots, and piRNA sequence logos. Briefly, for each dataset analysed, 9 RDS files, 17 tab-delimited files with all the statistics from alignment, filtering and annotation plus the filtered and normalised reads in counts per million (CPM), and 24 PDF files with several exploratory data analysis plots for each of the two methods used are generated. Furthermore, we also provided a script for the creation of ping-pong and strand coverage plots exploiting ssviz and ggbio R packages from Bioconductor. Eventually, the gene expression data can be further compared using the DE analysis module, which allows calculating logarithmic fold change values using limma-voom or edgeR methods, and finally, both results (Salmon and FeatureCounts) can be merged and visualised together using heatmaps . Then, the data analyst can choose to use the union of the results, and either consider all the molecules identified by at least one of the two methods, or use the intersection of the results and consider only the molecules supported by two methods. The differentially expressed molecules can be further used for piRNA target prediction analysis (included in the code) which was inspired by the similar module of iSMART tool . This workflow is structured to provide maximum flexibility to the user, who can modify several elements. In each step, alterations can be made regarding the tools or the databases used according to the needs of the data analyst, while the workflow strategy remains the same. Specifically, in the first step, the GTF file can be enriched with more sequences of interest or a completely new GTF file could be created for any species. Currently, the first step has been performed on human and mouse sncRNA sequences for the generation of the GTF files (included in the GitHub repository), but the same approach could be utilised for any well-annotated genome that has enough small RNA sequences reported. In the second step, it is possible to use different tools for quality control, adapter trimming, aligning of the reads, e.g. Subread or HISAT2 or a different "alignment-free" RNA-seq quantification method, as Kallisto .

Operation

The workflow was run on CentOS Linux release 7.8.2003 (Core) with Docker Engine - Community v19.03.13 and in R v4.0.0, with Bioconductor v3.12.

Validation and datasets

The complete workflow has been tested on several datasets to evaluate whether this worked in the identification of known piRNAs, low abundant molecules and in different species. Specifically, we have evaluated the performance of the transcriptomic approach on sncRNA identification, and particularly on piRNAs, for which this method has been tested here for the first time. We created a small dataset where spike-in sequences of piRNA-like molecules were added to the input RNA. For this purpose, RNA of metastatic colon cancer cell line (COLO 205), where piRNA's population has been already characterised , was used. To mimic the behaviour of true piRNAs, a synthetic set of 4 piRNA-like molecules was used, including two non-methylated (SS-22 and SS-28) and two methylated (mSS-22 and mSS-28) of different lengths (22 nt and 28 nt). Spike-ins were chemically synthesised at Exiqon, adapting the sequences described in Locati et al. to our conditions and the pool of 4 molecules was used at three different concentrations, with a final amount of 0.3 × 10^9 (dil_A), 0.3 × 10^10 (dil_B) and 0.3 × 10^11 (dil_C) molecules/ug of RNA. Small RNA libraries were prepared using 1 μg of total RNA with a TruSeq small RNA Sample Prep Kit (Illumina, San Diego, Canada) and sequenced on the NextSeq 500 platform (Illumina, San Diego, CA, USA) as previously described in Sellitto et al. 2019 (samples are available on ArrayExpress, Accession number E-MTAB-9772: COLO205_Dil_A, COLO205_Dil_B, COLO205_Dil_C). Furthermore, we also exploited the samples processed with sodium periodate/β-elimination (samples are available on ArrayExpress, Accession number E-MTAB-8115: Treated_COLO205_1, Treated_COLO205_2, Treated_COLO205_3, Treated_testis_1) as an additional control for the quantification algorithms. Indeed, sodium periodate oxidation strongly reduces the non-methylated molecules allowing to see a drastic change in non-methylated spike-ins concentration. To test the performance of WIND, in both high-piRNA and low-piRNA expression conditions, we used Human Testis RNAs (BioChain Institute Inc, Newark, CA, USA) and COLO 205 cell line RNAs (samples are available on ArrayExpress. Accession number E-MTAB-8115: Non_treated_Testis_1 and Non_treated_COLO205_1, Non_treated_COLO205_2, Non_treated_COLO205_3, Treated_COLO205_1, Treated_COLO205_2, Treated_COLO205_3, Treated_testis_1; Accession number E-MTAB-9782: Non_treated_Testis_2 and Non_treated_Testis_3). To test the workflow on mouse data, we used two samples of mouse adult Cardiac Myocyte (samples are available on ArrayExpress, Accession number E-MTAB-9866: aCM1, aCM2) . Furthermore, we also exploited two public datasets to test our workflow thoroughly including the differential expression module dataset, consisting of two experimental conditions in triplicates, MCF-7 enriched CSCs spheroids and monolayer cultures (Accession number GSE68246 ); and a subset of 18 samples from TCGA-BRCA , using 9 Primary Solid Tumour versus 9 Solid Tissue Normal corresponding samples.

Results

The goal of this study was to create a robust workflow for the identification and quantification of piRNA sequences in small RNA sequencing data. It focuses on elucidating and solving one of the most challenging issues of this kind of analysis, the annotation controversies of piRNAs, thus providing relatively accurate detection of the piRNA expression patterns. As a first point, a unique GTF file was generated for human and mouse species, starting from sequences obtained from the two widely used databases (piRNABank and RNAcentral) for piRNAs and sncRNAs, respectively. The GTF file was created as described in the Methods obtaining 149,549 different genomic locations corresponding to 39,812 sequences in human and 925,759 distinct genomic locations corresponding to 95,205 sequences in mouse for all small RNA types (see Extended data: Supplementary Table 1). Furthermore, in humans, from the 39,812 sequences coding for small RNAs, 28,000 were classified as piRNA, and 19,203 of them were found in common between RNAcentral and piRNABank; instead, 8,444 were found only in RNAcentral and 353 only in piRNABank. Additionally, in the mouse genome, from 95,205 sequences of small RNAs, 65,632 were categorised as piRNA, 34,306 were in common between the two databases and on the contrary, 29,114 were unique to RNACentral, and 2,213 were exclusive to piRNABank. To test WIND thoroughly, we used several datasets with different characteristics: data produced in house, data available in a public repository, samples which include internal controls (spike-ins), datasets from different species (human and mouse), a dataset including different experimental conditions, and a dataset of tumour tissues (for more details see Methods and Data availability). First, we compared the quantification capability of the two methods implemented in the workflow. In particular, we evaluated the performance of the transcriptomic approach on piRNA quantification, as this method to our knowledge, has not yet been used to analyse this sncRNA class. For this reason, we decided to apply the workflow on an own-made dataset, in which spike-in sequences of four piRNA-like molecules, two non-methylated and two methylated at three different concentrations, were included (see Methods for details). Exploiting this feature, we were able to assess the high efficiency of both (genomic and transcriptomic) approaches in quantifying the spike-ins, as demonstrated by the very similar results obtained by the different methods. Supplementary Table 2 ( Extended data) summarises the results obtained for the 4 piRNA-like molecules calculated using three methods: iSMART, FeatureCounts, and Salmon. The results show that all approaches can identify and quantify all the types of piRNA-like sequences (methylated, non-methylated, treated and not treated, and of different length) correctly. For a long time, piRNAs have been considered exclusively expressed in germline cells, but recently, it has been reported by several studies their presence also in somatic and pathologic tissues . Germinal cells generally show the most significant number and a higher level of expression of piRNAs. Starting from this knowledge, we tested the workflow on small RNA data obtained from human testis samples and tumour cell line (COLO205) to assess the capability to detect piRNAs in high and low concentration. Using WIND, we analysed the dataset and represented the results as plots of biodetection ( Figure 2A and B) and countsBio ( Figure 2C and D) per sample from NOISeq package. Biodetection plots are made from raw data in order to explore: a) the percentages of each small RNA type (named "biotypes") in the genome (referred to the whole set of small RNAs provided); b) the proportion detected in each sample; c) the percentage of each biotype within the sample. The countsBio plots, instead, show the count distribution for each biotype displayed as box plots, and the number of sncRNAs detected per biotype. Here, the two biodetection plots show, as expected, the presence of higher percentage of piRNA in testis sample respect to the COLO 205 cells (~75% in testis and ~20% in COLO 205; Figure 2A and B). Moreover, considering the countsBio plots ( Figure 2C and D), it is also possible to assess piRNAs higher median expression in testis if compared to the COLO 205 cells. Finally, we also produced sequence logos for the expressed piRNAs in the two sample types. These plots indicate if the bias for uridine at first position base or the adenine at the 10 th position of the sequence exists and if there are other biases in the 15 first bases of the sequences . As expected, both groups showed a strong bias for uridine at the first position (drawn as thymine in the plots), in accordance with the preferential binding of PIWI proteins to transcripts starting with U. A bias, albeit with a much weaker signal, was also evident toward adenine at position 10 in testis group, a hallmark of piRNAs generated by the ping-pong cycle and typical feature of germinal cells ( Figure 2E and F).

Figure 2.

Example of plots generated by WIND.

Example of plots generated by WIND.

A) and B) Biodetection plots (genomic approach) from NOIseq reporting: percentages of each sncRNA type called "biotype" on the genome (grey bar) for one of the samples; the proportion detected in each sample (red stripes bar); the percentage of each biotype within the sample (red bar). The biotypes on the right side of the green dashed line are the least abundant, and the reference values are reported on the Y right axis. C) and D) CountsBio plots (genomic approach) from NOIseq showing the count distribution for each biotype displayed as boxplots. Numbers on top of the plot show how many sncRNAs are detected per biotype in the entire dataset analysed. Different colours indicate different sncRNA classes. E) and F) Sequence Logo (1-15 bps) extracted from the piRNA sequence of the expressed piRNAs found in each group of samples (transcriptomic approach). A, C, E represents the results obtained for one representative testis samples, while B, D, F represent one representative COLO 205 sample. For this analysis, we applied a stringent approach; thus, we considered as expressed only those molecules that were identified by both methods (genomic and transcriptomic). Then, we found that 7324 piRNAs were identified in testis and 223 in COLO 205 cells. Therefore, this workflow was able to efficiently identify a good number of piRNAs in somatic cells, where low levels of expression make the procedure more complicated, even when very stringent analysis parameters are used. It is worth mentioning that, as detailed in the Methods, this workflow operates using two methods in parallel, each of which is able to identify sncRNAs with different performance. Applying the two algorithms together (considering the union of the results) allows the identification of an enriched number of molecules. The final user can decide, based on specific interests, which results should take into consideration, the union of the two approaches, only one, or the intersection. We also evaluated the performance of the workflow for piRNA identification in the mouse. Specifically, we analysed small RNA-seq samples from mouse adult cardiac myocyte (aCM). In these samples, we were able to identify, considering the union of the genomic and transcriptomic approach, ∼290 different piRNAs per samples (see Extended data: Supplementary Table 3 for the details of the two analysis in comparison). We found that the piRNA population identified in aCM represents 12% of all reads assigned to small RNAs, and the top 100 expressed molecules are listed in Supplementary Table 3 ( Extended data). Moreover, to test the accuracy of the workflow across diverse sets of data, we moved to a public dataset. Recent findings have indicated that the role of piRNAs may not be only limited to germ cells, but may be extended to the regulation of cancer, promoting a stem-like state of tumour cells . Therefore, we selected a dataset (GSE68246) to compare the piRNA profiles of breast spheroid-enriched CSCs against parental MCF7 cells and also generated in this case files, statistics and plots with WIND that are all available on the GitHub repository. On the expression data, filtering for low-expressed features was first carried out, then two of the NOIseq filters (1 count per million, and proportional filtering) or the EdgeR were applied, filtering by group with and without the specific batch. The resulting objects were reported as RDS files and, for all the analysed sequences, a histogram ( Figure 3A and B) with the average log 2 CPM before and after filtering of the counts was made using the edgeR filtering. Finally, the normalisation of all the counts was carried out with multiple methods: TMM , TMMwsp (TMM with singleton pairing), RLE , limma-Voom, with and without quality weights quantile , Voom, with and without quality weights using the TMM normalisation, and Voom with and without quality weights exploiting the TMMwsp normalisation. To visualise the unforeseen sources of variation and to control whether the normalisation applied was correct, RLE plots ( Figure 3C and D) were generated for all the sequenced data, for each normalisation method and for the not normalised, filtered data. An RDS object was also exported with the list of all normalised objects, and hierarchical clustering ( Figure 3E and F) was then performed on previous data with various normalised methods. We applied the Euclidean distance and the methods of Ward's, complete and average linkage. Furthermore, a correlation plot ( Figure 4A) with sample-to-sample distances was made to show the similarities and dissimilarities between samples on all sncRNA data. In order to check for batch effects and get the summarised effects of the experimental categories, MDS plots ( Figure 4B) and the first two Principal Components on a PCA plot were reported ( Figure 4C). From the GTF file, sequences' lengths were extracted and combined with information about the expressed molecules to draw the barplots ( Figure 5), allowing to underline the differences between the two methods or between the groups of interest. Alongside, in this case, the sequences' logos for only the expressed piRNAs were generated. Moreover, we reported a tab-delimited file with the mean CPM per biological group, as it is useful to know these values for further studying or visualisation. All the files, plots and statistics are available in the GitHub repository.

Figure 3.

Exploratory data analysis plots generated by WIND.

A– B) Histograms of average log 2 Counts Per Million (CPM) among all samples before ( A) and after ( B) filtering with one of the selected methods (EdgeR filtering in this case) for sncRNA data. C– D) Relative Log Expression (RLE) plots for each normalisation method, made with the use of plotRLE function for all the sncRNA data. As an example, only the first two plots (with TMM ( D) and without normalisation ( C) for the filtered counts derived from FeatureCounts) are shown. E– F) Hierarchical Clustering plots, exploiting all the sequenced sncRNA data, with multiple clustering methods and different normalisation methods. As an example, only the first two plots (with TMM ( D) and without normalisation ( C) for the filtered counts derived from FeatureCounts). In black and brown are shown the two different groups (monolayer and spheroid).

Figure 4.

Sample group clustering plots.

A) Correlation plot showing samples' distances in GSE68246 dataset. The darker the colour, the more correlated they are. B) Multidimensional Scaling (MDS) plot using all the sequenced data and one of the normalisation methods applied in the workflow (in this case, TMM) made with plotMDS() function from EdgeR. In black and brown are shown the two different groups (monolayer and spheroid). C) Principal Components Analysis (PCA) plot displaying the first two Principal Components using all the sncRNA molecules data. Each sample is shown with different colours (depending on the group) and different symbols (depending on the batch).

Figure 5.

Barplots of the length of piRNA classes with respect to each experimental group (in this case monolayer and spheroid MCF7).

The colours indicate the two different methods of quantification (genomic and transcriptomic).

Exploratory data analysis plots generated by WIND.

Sample group clustering plots.

Barplots of the length of piRNA classes with respect to each experimental group (in this case monolayer and spheroid MCF7).

The colours indicate the two different methods of quantification (genomic and transcriptomic). Ultimately, we performed the differential expression analysis on the results of both methods (genomic and transcriptomic), and the union of the comparisons was reported ( Extended data: Supplementary Table 4). Our workflow identified 466 differentially expressed sncRNAs (p-value ≤ 0.05) using both methods and 352, considering the adjusted p-value ≤ 0.05. 63 miRNAs were found DE, in common with Boo et al. . Most importantly, we were able to identify 181 expressed piRNAs, 48 of which differentially expressed (adj. p-value < 0.05) between spheroids and parental cells, with 44 of them up- and 4 down-regulated ( Figure 6A). Their log-fold changes were varying from -2.60 to 8.05, and 20 of them derive from the sequences found in RNAcentral while 28 from piRNABank, thus showing the importance of including both databases in the final GTF file. Of these 48 DE piRNAs, three of them (DQ570940, DQ571550, DQ578783) have also been found DE in the work of Vella et al. in cardiosphere-derived cells. This suggests a possible functional role of this group of piRNAs in the stemness of cancer cells, independently from the tissue type.

Figure 6.

Differential expression analysis.

Differential expression analysis.

A) Heatmap of differentially expressed piRNAs in 3 MCF7 Spheroid samples versus 3 MCF7 Monolayer (GSE68246 public dataset) found in common with both approaches (genomic and transcriptomic). B) Heatmap of differentially expressed piRNAs among 9 Primary Solid Tumour versus 9 Solid Tissue Normal from TCGA found in common with both approaches (genomic and transcriptomic). As a final test, we exploited sncRNA data of 18 samples from TCGA-BRCA, 9 Primary Solid Tumor versus 9 Solid Tissue Normal ( Extended data: Supplementary Table 5). We identified 10 piRNAs DE out of 235 DE sncRNA molecules with both approaches (genomic and transcriptomic). In the heatmap ( Figure 6B), it is possible to note that the two approaches obtained equivalent results, and the clustering approach showed a good clustering between tumour and normal samples. Interestingly, some of the identified piRNAs have been previously described as related to cancer progression in tissues like kidney and lung (DQ581033 , DQ593398 - DQ592932 ). In addition, from the 235 DE sncRNA, 64 are reported as miRNA and most importantly, we found the cancer-specific MIR-8 (now reported as mir-141 and mir-200) upregulated as previously reported by Hoadley et al. . In order to acquire possible functional information about the DE piRNAs, we predicted their possible target RNAs (using the code included in the workflow), and we identified 11 protein-coding genes ( Extended data: Supplementary Table 5). Most of them were predicted to bind their targets at the 3' UTR and 4 at the 5' UTR. The functional enrichment analysis of the 11 predicted piRNA targets, using the EnrichR online tool , revealed that they might be involved in regulating "signal transduction that contributes to a DNA damage checkpoint" (GO:0072422), a biological process that has a vital role in cancer progression.

Conclusions

In this paper, we describe a novel bioinformatics workflow, WIND , for the identification and analysis of piRNA from small RNA sequencing data. The main innovations of WIND are: a Docker containerisation approach for the complete analysis, the integration of two databases for piRNA annotation, a dual-method for detection and quantification of piRNAs (named as "genomic" and "transcriptomic" in this article), and the creation of ready-to-use plots and statistics for the interpretation of the results. The idea was born in order to cope with the absence of a gold-standard pipeline for piRNA identification and annotation. We tried to solve many issues related to small RNA sequencing data analysis and, in particular, piRNA identification and quantification. For this reason, the first step was to deploy multiple Docker containers set up to run all the steps of the workflow without installing tools, software or libraries. After this, we focused on the creation of an easy method to integrate data from distinct databases (RNACentral and piRNABank). As described in methods, we were able to assess the deep diversity between the databases. Indeed, it was possible to notice not only differences in numbers of piRNAs annotated between the two databases (both in human and mouse genome) but also inconsistencies in the annotation or in the classification (e.g. the same molecule is classified as piRNA in one and as miRNA in the other). Actually, combining databases usually produces discrepancies and working with sncRNA sequences that have multiple annotations is troublesome. However, with this step, it is possible to obtain a unique GTF file that merges this information (all ids and genomic locations associated with that specific molecule) that can be used for piRNA identification and annotation. The main part of our workflow consists of two detection methods for piRNAs described above as "genomic" and "transcriptomic". For the genomic part, we decided to perform an alignment using STAR. STAR is a well-known genomic aligner that uses a reference genome to compute read alignments. For the transcriptomic part, we used Salmon to produce accurate transcript-level quantification estimates from sncRNA sequencing data. Salmon's main innovation is the use of quasi-mapping (accurate and very fast-to-compute read alignments). However, even if the transcriptomic approach proved to be working well, it has been demonstrated that for the identification of some sncRNAs might not be as efficient as the genomic approach . For this reason, we set the methods to improve sncRNAs identification, following the suggestions of previous works . Our idea was to combine the two approaches in order to evaluate the similarities between the results obtained and then ameliorate the identification of piRNAs. The last step was to create a Differential Expression module and, most importantly, the automatic creation of plots and statistics useful for the interpretation of data and results. To test WIND, we applied it to several sncRNA datasets. Working on the first dataset, were we used the spike-in approach, we found a good consistency between the different methods in the detection of piRNA-like molecules, highlighting the efficiency of both approaches in piRNA quantification. Furthermore, the test on germline and somatic tissues revealed that the two methods, even when a stringent filter is applied, are able to assess the presence of piRNAs also in tissues where they are not abundant. In addition, the workflow is also functional in different species, as shown by the results obtained on the mouse genome. Finally, we also tested WIND on two published datasets, comprising tumour cell lines and tissues. Our workflow, also in this instance, was able to identify efficiently piRNAs and find differentially expressed molecules (not previously investigated) and to recognise, in general, a significant number of sncRNAs. It is important to clarify that WIND deals with issues and discrepancies present in the original piRNA databases and it recategorizes piRNAs using sequences and genomic ranges information. However, it is still possible to find overlap between piRNAs and other ncRNAs, due to sequence similarity, ab initio categorization of a sequence as piRNA without further wet-lab validation, or real sncRNAs derived piRNAs. However, this information is stored in the final GTF. The purpose is to give complete information to allow studying piRNAs and other sncRNAs, highlighting possible ambiguities of the in silico analysis, which researchers will be able to evaluate and eventually resolve with wet-lab experiments, such as immunoprecipitation (IP) of PIWI-containing ribonucleoprotein complexes and/or the RNA periodate treatment . In conclusion, WIND is a complete dockerised workflow, usable by bioinformaticians and data analysts who want to explore small RNA sequencing data globally, but specifically designed and optimised for piRNAs. WIND allows going from raw data to plots and statistics ready for publication thanks to fast and efficient software implementation, making it very useful in the field of small RNA research.

Data availability

Underlying data

ArrayExpress: Monitor the efficiency of "WIND: A Workflow for pIRNAs aNd beyonD" for the identification of single-stranded (SS) spike-in piRNA-like molecules in smallRNA-seq, Accession number E-MTAB-9772: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-9772/ ArrayExpress: Monitor the efficiency of "WIND: A Workflow for pIRNAs aNd beyonD" for the identification of piRNA molecules in small RNA-seq, Accession number E-MTAB-9782: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-9782/ ArrayExpress: Monitor the efficiency of "WIND: A Workflow for pIRNAs aNd beyonD" for the identification of piRNA in mouse samples, Accession number E-MTAB-9866: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-9866/ ArrayExpress: Analysis of the 3’-end of piRNAs in the COLO 205 cell line through sodium periodate (NaIO4) /β-Elimination treatment and small RNA-Seq, Accession number E-MTAB-8115: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8115/ NCBI Gene Expression Omnibus: miRNA transcriptome profiling of spheroid-enriched cells with cancer stem cell properties in human breast MCF-7 cell line, Accession number GSE68246: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68246 Selected samples from the Genomic Data Commons Data Portal have been accessed and analysed from the TCGA-BRCA project: https://portal.gdc.cancer.gov/projects/TCGA-BRCA

Extended data

Zenodo: Supplementary tables, http://doi.org/10.5281/zenodo.4915906 . This project contains the following extended data: Supplementary Table 1. Statistics of GFT files obtained for human and mouse genome. The file reports the data of the filtering process and the final GTF data. Supplementary Table 2. Spike-in quantification. For each sample are shown the percentage of each piRNA-like molecules, respect to the raw reads count, using three quantification methods. Supplementary Table 3. Statistics of sncRNA data analysis for mouse cardiomyocytes. The file reports the results obtained using the two methods applied in the workflow and the list of top 100 expressed piRNAs. Supplementary Table 4. Differentially Expressed molecules found for GSE68246 dataset. In yellow are highlighted miRNA DE in common with Boo et al. , in red and green are highlighted the up- and down-expressed molecules respectively, in light blue the molecules with an adjusted p-value less than 0.05. Supplementary Table 5. Differentially Expressed molecules found for BRCA TCGA dataset. In red and green are highlighted the up- and down-expressed molecules respectively, in light blue the molecules with an adjusted p-value less than 0.05. For DE piRNA molecules, the predicted possible target RNAs are also provided. Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Software availability

Workflow available from: https://github.com/ConYel/wind Archived workflow as at time of publication: http://doi.org/10.5281/zenodo.4289908 . License: MIT All software packages used throughout this workflow are publicly available through the Bioconductor project ( http://bioconductor.org), or the Comprehensive R Archive Network ( https://cran.r-project.org) and all bioinformatics tools are freely available as Docker containers on https://hub.docker.com/r/congelos/. I am satisfied with the author responses and with version 3 of this paper. Is the rationale for developing the new method (or application) clearly explained? Yes Is the description of the method technically sound? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are sufficient details provided to allow replication of the method development and its use by others? Yes Reviewer Expertise: Molecular and cell biology I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The authors have addressed all my concerns. I have no further comments. Is the rationale for developing the new method (or application) clearly explained? Yes Is the description of the method technically sound? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are sufficient details provided to allow replication of the method development and its use by others? Yes Reviewer Expertise: Genomics, Bioinformatics, Genetics. I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The authors have made important efforts to address my concerns raised in the previous version of this manuscript. This is a method article. As such, I think the method is designed in a thoughtful manner and tested with relevant controls, so it could constitute a useful resource to other researchers interested in piRNA annotation. However, although the abstract and discussion are method-oriented, the result section is strongly supportive of widespread somatic piRNA expression in mammals, which is controversial. I give the authors the credit to address this relevant research question by removing as much confounders from their GTF files as possible, and for their willingness to redo most of their analysis after my initial concerns. But an analysis of the sequence of many of the top 100 expressed piRNAs in mouse cardiomyocytes (Supplementary Table 3) shows that they still have 100% identity to rRNAs (e.g., mmu_piR_039147, mmu_piR_005109), snoRNAs (mmu_piR_032974, mmu_piR_034793, mmu_piR_025007), etc. Thus, although the design of the workflow intends to remove all piRNAs with overlap to mRNAs or ncRNA fragments, there seems to be a problem in the filtering step, and some of the article's claims are not fullfilled. My intention is not to further delay acceptance of this manuscript. But the fact that most remaining piRNAs are still non-coding RNA fragments should be at least analyzed and acknowledged. Supplementary tables should contain a list of the most abundant piRNAs in all the cases analyzed (including COLO 205 cells, etc), and sequences should be provided. Ideally, the authors could try to understand why many piRNAs with an overlap to ncRNA fragments are not being removed from their GTF files, fix the problem and redo the analysis. Minor: mSS28 seems to be resistant to periodate treatment, but mSS22 is not. Is my interpretation correct (Supp Table S2) Is the rationale for developing the new method (or application) clearly explained? Yes Is the description of the method technically sound? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are sufficient details provided to allow replication of the method development and its use by others? Yes Reviewer Expertise: Molecular and cell biology I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. Q1. This is a method article. As such, I think the method is designed in a thoughtful manner and tested with relevant controls, so it could constitute a useful resource to other researchers interested in piRNA annotation. However, although the abstract and discussion are method-oriented, the result section is strongly supportive of widespread somatic piRNA expression in mammals, which is controversial. A1. The aim of this article was to develop a workflow for sncRNA identification, focused on piRNAs, with the intention of detecting piRNAs (and not only them) independently of the tissue of origin. Indeed, we tested the WIND performance in detecting the presence of piRNAs in both somatic and germline tissues, but demonstrating the somatic expression of piRNAs is beyond this article’s aim. We used these types of data as proof of the sensing power of our workflow in identifying piRNAs even in the presence of low expressed molecules. Q2. I give the authors the credit to address this relevant research question by removing as much confounders from their GTF files as possible, and for their willingness to redo most of their analysis after my initial concerns. But an analysis of the sequence of many of the top 100 expressed piRNAs in mouse cardiomyocytes (Supplementary Table 3) shows that they still have 100% identity to rRNAs (e.g., mmu_piR_039147, mmu_piR_005109), snoRNAs (mmu_piR_032974, mmu_piR_034793, mmu_piR_025007), etc. Thus, although the design of the workflow intends to remove all piRNAs with overlap to mRNAs or ncRNA fragments, there seems to be a problem in the filtering step, and some of the article's claims are not fullfilled. My intention is not to further delay acceptance of this manuscript. But the fact that most remaining piRNAs are still non-coding RNA fragments should be at least analyzed and acknowledged. A2. The results obtained show that the workflow can detect piRNA sequences that have been annotated in various Databases as “piRNAs” until now. Although precautions have been taken for the removal of as many as possible sequences that could potentially not be piRNAs, in specific cases the removal of the sequences using only bioinformatic data is not possible, but would require a further study, possible only through wet-lab experiments (using various techniques), which can clarify the true nature of the molecule. As the workflow considers what in the databases exist ab initio, it is not possible to arbitrarily remove all piRNA sequences that have a high similarity or are included in the sequence of other sncRNAs, without the risk of eliminating real piRNAs from the analysis. Referring to the reviewer’s example, using BLASTN [Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14] on the sequence of “mmu_piR_039147”, we get as a result : “ The sequence of mmu_piR_039147 is: CCTCGACACAAGGGTTTGTAA and the sequence that is found to have a partial match with it is CCTCGACACAAGGGTTTGT, which is 2 nucleotides less. The genomic ranges of mmu_piR_039147 in the GTF file are chr6 87997228-87997248 - chr9 118674951-118674971 - In the GTF file (gencode.vM25.primary_assembly.annotation.gtf.gz) that we have used to search if these ranges are included in other sncRNAs, there are no overlaps with any sncRNAs for mmu_piR_039147. As we have seen from BLASTN, a part of the sequence is identical to “ We agree with the reviewer that some molecules can be fragments or other sncRNAs incorrectly annotated. However, it is not possible at present, without wet-lab experiments, to distinguish fragmentation products from molecules that can come from other ncRNAs such as rRNAs, tRNAs, and snoRNAs, as recently proposed by Barreñada We believe that WIND workflow must show all the possible reproducible results and all the useful information and then the data analyst, with the help of the biologist and eventually experimental approaches, will be able to deepen the study and possibly focus on the dubious sequences, to establish whether they are or no piRNA when ambiguity exists. As we already know that these issues exist, we have included as a metadata column on the GTF file, for all the piRNA genomic regions that fall inside other sncRNAs, this information, thus the data analyst can directly check if what identified as expressed or differentially expressed has multiple annotations. In other words, for example, if the piRNA identified by the workflow is inside a miRNA precursor or tRNA, this information can be found in the metadata column “GENCODE_annot” of the GTF. Moreover, it is possible to obtain a file with the detail of all mapped reads assigned to a specific piRNA and verify if they are multi-mapped or not. The data analyst can use this file to see if most of the reads are associated with a complete sequence of a piRNA or if some (for example shorter reads) also map to another sncRNA. In conclusion, the workflow tries to deal with the issues of piRNA annotation that the databases have until now; however, in our opinion, to completely solve these issues, it is not conceivable to use only an in-silico approach but it is necessary to combine it with wet laboratory techniques, as immunoprecipitation (IP) of PIWI-containing ribonucleoprotein complexes and/or the RNA periodate treatment, to define which sequence is truly a piRNA. Our contribution, with this workflow, is to mitigate the already existing problems and to use it as a backbone for potential future strategies for sncRNA analysis. In order to be clear about these concerns and to clarify to the users the type of results provided by WIND, we added these sentences in the conclusion section: Q3. Supplementary tables should contain a list of the most abundant piRNAs in all the cases analyzed (including COLO 205 cells, etc), and sequences should be provided. Ideally, the authors could try to understand why many piRNAs with an overlap to ncRNA fragments are not being removed from their GTF files, fix the problem and redo the analysis. A3. All resulting files (the top 100 expressed piRNAs, along with other 28 info files and 21 PDF files for each analyzed dataset) are published in the GitHub repository and can be found in this Furthermore, during the 2nd revision, our main point was to remove any incongruencies between piRNA genomic positions that fall inside gene coding regions. The existence of piRNAs with an overlap to sncRNA fragments is a well-known issue that our workflow tried to mitigate, however, despite our best efforts, it is not possible to completely remove all these molecules without further wet-lab investigation. For now, we prefer to keep the sequences that are inside sncRNAs and give that information to the users as metadata. In this way, the researchers have all the information and can decide if, for the molecules of interest, it is necessary to confirm the molecule's nature with the help of wet-lab experiments. Moreover, the workflow is under constant development and better ways of solving this kind of issue could be found in the future and integrated with the workflow. Any data analyst or biologist that has a specific request, about the workflow, is able to make a suggestion or even improve it by himself. Q4.Minor: mSS28 seems to be resistant to periodate treatment, but mSS22 is not. Is my interpretation correct (Supp Table S2) A4. Thanks to the reviewer, we noticed a missing part of the table in the submitted Supplementary table 2. We have fixed the error and we re-uploaded it. The authors developed a new computational pipeline for analysis of piRNAs in small RNA sequencing data. The methods were elaborated clearly in the manuscript. The pipeline was tested with real small RNA sequencing data of human and mouse. The manuscript is well written and the WIND pipeline will be a good tool facilitating the analysis of piRNAs. I have several comments and suggestions for the authors. Major: Minor: The title should be revised as “a strategy for in-depth analysis of small RNA-seq data” is confusing. The WIND pipeline mainly focuses on the analysis of piRNA, which is only a category of all small RNA species. For example, miRNA is ignored by WIND. I agree with the author that the key to piRNA analysis is the comprehensive and accurate identification of piRNAs and piRNA precursor. However, only piRNABank and RNAcentral were used in the “Annotation forging” step of WIND while updated databases of piRNA and piRNA cluster had been published. I suggest the authors integrating piRBase and piRNA cluster database in the WIND pipeline. Figure 1. “small RNA Fastq files” were not used in the “Annotation forging” step, and should be placed in the “Pre-processing and quantification” step. In the “Annotation forging” step, I noticed two length thresholds (100 bp and 69 bp) were used when building the “final Fasta file”. So why 100 and 69? Figure 5 is bar plot rather than histogram. Figure 6B - part of the legend at the bottom is obscured. Is the rationale for developing the new method (or application) clearly explained? Yes Is the description of the method technically sound? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are sufficient details provided to allow replication of the method development and its use by others? Yes Reviewer Expertise: Genomics, Bioinformatics, Genetics. I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. Q.1 The title should be revised as “a strategy for in-depth analysis of small RNA-seq data” is confusing. The WIND pipeline mainly focuses on the analysis of piRNA, which is only a category of all small RNA species. For example, miRNA is ignored by WIND. A.1 The most problematic part of this workflow was, without any doubt, creating a good GTF file and removing all the possible noise from piRNA sequences. We agree that, in this paper, we focused our attention mostly on piRNAs but our workflow is made for analysing all species of sncRNAs as shown in the figures. A user can easily focus his attention on another small RNA species and perform all the analyses exploiting the genomic and the transcriptomic approach. For this reason, we choose to write in the title “piRNAs and beyond”, to indicate that our workflow is suggested not only for the study of piRNAs but can also be used to analyse other sncRNAs molecules. Q.2 I agree with the author that the key to piRNA analysis is the comprehensive and accurate identification of piRNAs and piRNA precursor. However, only piRNABank and RNAcentral were used in the “Annotation forging” step of WIND while updated databases of piRNA and piRNA cluster had been published. I suggest the authors integrating piRBase and piRNA cluster database in the WIND pipeline. A.2 Thank you for your valuable comment. Initially, we have incorporated piRNAbank in our WIND pipeline as the first building block for the piRNAs annotation track. After that, we incorporated RNACentral to include also sncRNAs sequences in the annotation track in order to have a complete view of the sncRNAs species (known until now). However, regarding the additional piRBase database, we are facing a major bottleneck in including it. In detail, the database provides the largest and more comprehensive resource of piRNA sequences annotation, including more than 8 million sequences for human and about 60 million sequences for mouse. A major problem is that many of these sequences are located in the same genomic locus, and they differ only in one or a few nucleotides opening the possibility, not yet demonstrated, that they could be piRNA isoforms produced by variations on 5’ or 3’ end, including nucleotides extension, addition or trimming. This situation raises the question of how these molecules should be quantified and including all of them as independent molecules (“In piRBase, if a piRNA sequence is a subsequence of another piRNA, both of them were considered as different sequences and were assigned distinct piRBase names.” cited from Wang ) would significantly falsify the abundance ratio in the quantification step when measuring gene expression with Featurecounts or Salmon. Indeed, this could be an additional methodological constraint that would generate biased counts for any further downstream analysis (e.g. differential expression analysis). On this premise, we are considering including piRBase in a future version of WIND, but some issues about the quantification need to be solved. Nevertheless, about piRNAClusterDB (piRNAcldb), as suggested, we integrated the information included in piRNAcldb as metadata in the final GTF file produced by the annotation forging step. We updated the workflow shown in Figure 1 and we added the following sentence in the Method section: Q.3 Figure 1. “small RNA Fastq files” were not used in the “Annotation forging” step, and should be placed in the “Pre-processing and quantification” step. A.3 We modified the figure as suggested. Q.4 In the “Annotation forging” step, I noticed two-length thresholds (100 bp and 69 bp) were used when building the “final Fasta file”. So why 100 and 69? A.4 We have selected 100 nts as a filter for sncRNAs sequences on both databases to filter out all other sequences that were too long to be sncRNAs. Following, we added a new filter to the sequences deriving from the piRNABank alignments to the genome (hg38). Although the piRNABank sequences that we used are shorter than 34 nts, when these sequences were aligned to the genome, the genomic ranges were in some cases ≥ 69 nts due to gap-opening, specifically, we found few sequences aligning on genomic regions longer than 33nts (69, 75, 81, 87 and 99 nts). This could be due to the fact that the original piRNABank database was built on genome version hg18. In order to exclude these genomic ranges, which correspond to molecules that do not align correctly on the new version of the genome, we applied the filter at 69 nts as piRNAs are considered to be about 28-34 nts. However, the users can easily change this number or remove this filter as they prefer. Minor comments: Q.5 Figure 5 is bar plot rather than histogram. A.5 We corrected the error in the figure legend and in the text. Q.6 Figure 6B - part of the legend at the bottom is obscured. A.6 We corrected the issue. The authors developed a workflow for small RNA-seq data analysis, especially intended for the study of Piwi-interacting RNAs or piRNAs. The authors called their workflow WIND (for Workflow for pIrnas aNd beyonD) which is a nice name but should not be presented as an acronym because it is not. The manuscript is professionally written and reads very well, and I think the workflow is complete and can be a useful resource. However, I have major concerns in its design that I will try to explain below. One of the motivations of the authors was to develop a reliable package for piRNA analysis that can be also applied for piRNA identification in somatic cells. The authors cite our 2018 study (ref. 25) so they are aware that piRNA databases contain a small percentage of contaminating entries that are probably not piRNAs. They considered this information in the design of their workflow and removed all piRNA reads in piRNABank that also have an alternative annotation in RNA Central. However, it is not clear to me whether they did the same with the piRNA sequences in RNA Central that also have an alternative annotation in RNA Central. Figure 2A shows that roughly 80% of the sequences in testis are piRNAs, and also 80% of the sequences in testis are annotated as tRNAs. So this is the proof that, if the authors really intended to depurate their GTF file from piRNAs also having an annotation in RNA Central, they were not effective in doing so. And this completely alters the author’s conclusions regarding non-germinal piRNAs. Another concern is that RNA Central is explicitly a database of non-coding RNAs. Thus, by removing entries in piRNABank that have an alternative annotation in RNA Central, they are not removing those piRNAs in piRNABank that are mRNA-fragments. How to distinguish secondary piRNAs generated from cleavage of coding sequences from mRNA fragments contaminating piRNA databases? It is not surprising, therefore, that the authors found that the remaing piRNAs in their GTF file are either derived from piRNA clusters or from protein-coding genes. This is a methodological bias and does not seem to be a deliberate decision based on the biology of the piRNA pathway. Again, this can completely alter the authors’ results and conclusions when analyzing RNA-seq data obtained in somatic cells using WIND. The authors affirm that they focused in “solving on of the most challenging issues of (small RNA) analysis, the annotation controversies of piRNAs”. I’m afraid that, in my opinion, this controversy is still not solved. I would suggest the authors to read our last contribution in this topic (Tosar et al. 2020 ) and reconsider the design of their workflow based on what we discussed in that paper. My suggestion is to take the union of piRNAs from piRNABank and RNA Central, and remove those sequences that have alternative annotations in RNA Central. This can be used to construct GTF file 1 containing “canonical” piRNAs derived from piRNA clusters and also mRNA fragments (whether truly piRNAs or not). Then, remove sequences matching to human or mouse mRNAs from RefSeq to make GTF file 2, containing sequences that can only be classified as piRNAs. Repeat their analysis and compare the results shown in this manuscript with what they see based on my suggested approach. Minor comments: I hope the authors find this suggestions useful and my comments constructive. Consider adding “testis” and “COLO 205 cells” as a headline in Figure 2. The authors refer that the problem of detecting piRNAs in COLO 205 cells is their low expression. However, there are some sequences which are highly expressed according to Figure 2, D. Are these sequences really piRNAs? A brief description of the sequencing library preparation should be supplied. If the authors spiked in methylated RNAs, treating the samples with sodium periodate before NGS could have been an interesting control. Sequence logos are nice and can be informative, but the workflow could be more powerful if it included plots showing ping-pong signals. Why 69 nt as a cut-off? Is the rationale for developing the new method (or application) clearly explained? Yes Is the description of the method technically sound? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are sufficient details provided to allow replication of the method development and its use by others? Yes Reviewer Expertise: Molecular and cell biology I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. Q.1 One of the motivations of the authors was to develop a reliable package for piRNA analysis that can be also applied for piRNA identification in somatic cells. The authors cite our 2018 study (ref. 25) so they are aware that piRNA databases contain a small percentage of contaminating entries that are probably not piRNAs. They considered this information in the design of their workflow and removed all piRNA reads in piRNABank that also have an alternative annotation in RNA Central. However, it is not clear to me whether they did the same with the piRNA sequences in RNA Central that also have an alternative annotation in RNA Central. Figure 2A shows that roughly 80% of the sequences in testis are piRNAs, and also 80% of the sequences in testis are annotated as tRNAs. So this is the proof that, if the authors really intended to depurate their GTF file from piRNAs also having an annotation in RNA Central, they were not effective in doing so. And this completely alters the author’s conclusions regarding non-germinal piRNAs. A.1 We thank the reviewer for the helpful suggestion. We removed all piRNA sequences in piRNABank that also have an alternative annotation in RNA Central. However, checking the sequences in RNA Central that also have an alternative annotation in RNA Central, we found only six molecules and none of them is a piRNA. These sequences have been deduplicated in the final GTF, and corresponded to snoRNAs. We apologize to the reviewer for the misunderstanding, in Figure 2, the categories on the right of the green dashed line should be referred to the axis on the right. In this specific case, the piRNAs in the testis sample are ~75% while tRNA are ~4% as shown on the axis on the right of the plot. To clarify this point, we specified this in the legend as follows: << The biotypes on the right side of the green dashed line are the least abundant, and the reference values are reported on the Y right axis.>> Q.2 Another concern is that RNA Central is explicitly a database of non-coding RNAs. Thus, by removing entries in piRNABank that have an alternative annotation in RNA Central, they are not removing those piRNAs in piRNABank that are mRNA-fragments. How to distinguish secondary piRNAs generated from cleavage of coding sequences from mRNA fragments contaminating piRNA databases? It is not surprising, therefore, that the authors found that the remaining piRNAs in their GTF file are either derived from piRNA clusters or from protein-coding genes. This is a methodological bias and does not seem to be a deliberate decision based on the biology of the piRNA pathway. Again, this can completely alter the authors’ results and conclusions when analyzing RNA-seq data obtained in somatic cells using WIND. A.2 We want to thank the reviewer for this challenging question. As suggested in question 3, we have created a new GTF that takes this into account by removing the sequences matching to human or mouse mRNAs. For more details see “answer 3”. Q.3 The authors affirm that they focused in “solving one of the most challenging issues of (small RNA) analysis, the annotation controversies of piRNAs”. I’m afraid that, in my opinion, this controversy is still not solved. I would suggest the authors to read our last contribution in this topic (Tosar et al. 20201) and reconsider the design of their workflow based on what we discussed in that paper. My suggestion is to take the union of piRNAs from piRNABank and RNACentral, and remove those sequences that have alternative annotations in RNACentral. This can be used to construct GTF file 1 containing “canonical” piRNAs derived from piRNA clusters and also mRNA fragments (whether truly piRNAs or not). Then, remove sequences matching to human or mouse mRNAs from RefSeq to make GTF file 2, containing sequences that can only be classified as piRNAs. Repeat their analysis and compare the results shown in this manuscript with what they see based on my suggested approach. A.3 We want to thank the reviewer for this very illuminating article regarding the piRNA annotation challenges. We agree with the reviewer that the problem has not been completely resolved, but we are trying to move in that direction and above all, we are trying to highlight that this problem must be considered and addressed in order to study the piRNAs correctly. Using the suggestions proposed, we created a new GTF file removing the sequences matching to human or mouse mRNAs. Using this approach, it is possible to note that some differences exist between the previous and the new results. For this reason, we modified all the tables, figures and data in the text accordingly. We are confident that now our workflow is stronger and more robust. However, it is possible to obtain the previous GTF from GitHub or change the code in order to produce the preferred GTF. Finally, we have also added on the method section the details about the creation of this new GTF file: <> Minor comments: Q.4 Consider adding “testis” and “COLO 205 cells” as a headline in Figure 2. A.4 We modified the figure as suggested. Q.5 The authors refer that the problem of detecting piRNAs in COLO 205 cells is their low expression. However, there are some sequences which are highly expressed according to Figure 2, D. Are these sequences really piRNAs? A.5 When we talk about the low expression of piRNAs in COLO205 we are referring to an average expression of all identified molecules, however, among these, there are also molecules with a higher expression. In any case, after the creation of the new GTF file, as described before, we reanalysed all the datasets and we were still able to identify some molecules highly expressed and classified as piRNAs. Obviously, WIND exploits the previous knowledge about sncRNAs, this means that the obtained results, even if more accurate thanks to these new improvements, are still limited due to the primary databases used. However, as always suggested, a wet-lab validation should be necessary to confirm the in silico results and to really establish the correct identification and function of the discovered molecules, but this is beyond the scope of this article. Q.6 A brief description of the sequencing library preparation should be supplied. If the authors spiked in methylated RNAs, treating the samples with sodium periodate before NGS could have been an interesting control. A.6 As suggested by the reviewer, we added three samples treated with sodium periodate before library preparation and now, in Supplementary Table 2, is available the “Treated” table with the percentages of all the spike-ins used. Furthermore, we modified the method section accordingly: Q.7 Sequence logos are nice and can be informative, but the workflow could be more powerful if it included plots showing ping-pong signals. A.7 We added in the GitHub repository a code called “pinp_pong.Rmd” that creates a ping-pong plot or a coverage plot for both the strands from a BAM file selected for piRNAs. Finally, we modified the Methods section as following: Q.8 Why 69 nt as a cut-off? A.8 We have selected 100 nts as a filter for sncRNAs sequences on both databases to filter out all other sequences that were too long to be sncRNAs. Following, we added a new filter to the sequences deriving from the piRNABank alignments to the genome (hg38). Although the piRNABank sequences that we used are shorter than 34 nts, when these sequences were aligned to the genome, the genomic ranges were in some cases ≥ 69 nts due to gap-opening, specifically, we found few sequences aligning on genomic regions longer than 33nts (69, 75, 81, 87 and 99 nts). This could be due to the fact that the original piRNABank database was built on genome version hg18. In order to exclude these genomic ranges, which correspond to molecules that do not align correctly on the new version of the genome, we applied the filter at 69 nts as piRNAs are considered to be about 28-34 nts. However, the users can easily change this number or remove this filter as they prefer.

70 in total

Review 1. piRNA clusters as a main source of small RNAs in the animal germline.

Authors: I A Olovnikov; A I Kalmykova
Journal: Biochemistry (Mosc) Date: 2013-06 Impact factor: 2.487

Review 2. Small non-coding RNA and cancer.

Authors: Giulia Romano; Dario Veneziano; Mario Acunzo; Carlo M Croce
Journal: Carcinogenesis Date: 2017-05-01 Impact factor: 4.944

3. STAR: ultrafast universal RNA-seq aligner.

Authors: Alexander Dobin; Carrie A Davis; Felix Schlesinger; Jorg Drenkow; Chris Zaleski; Sonali Jha; Philippe Batut; Mark Chaisson; Thomas R Gingeras
Journal: Bioinformatics Date: 2012-10-25 Impact factor: 6.937

4. ggbio: an R package for extending the grammar of graphics for genomic data.

Authors: Tengfei Yin; Dianne Cook; Michael Lawrence
Journal: Genome Biol Date: 2012-08-31 Impact factor: 13.583

5. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR.

Authors: Charity W Law; Monther Alhamdoosh; Shian Su; Gordon K Smyth; Matthew E Ritchie
Journal: F1000Res Date: 2016-06-17

6. RLE plots: Visualizing unwanted variation in high dimensional data.

Authors: Luke C Gandolfo; Terence P Speed
Journal: PLoS One Date: 2018-02-05 Impact factor: 3.240

7. Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer.

Authors: Katherine A Hoadley; Christina Yau; Toshinori Hinoue; Denise M Wolf; Alexander J Lazar; Esther Drill; Ronglai Shen; Alison M Taylor; Andrew D Cherniack; Vésteinn Thorsson; Rehan Akbani; Reanne Bowlby; Christopher K Wong; Maciej Wiznerowicz; Francisco Sanchez-Vega; A Gordon Robertson; Barbara G Schneider; Michael S Lawrence; Houtan Noushmehr; Tathiane M Malta; Joshua M Stuart; Christopher C Benz; Peter W Laird
Journal: Cell Date: 2018-04-05 Impact factor: 41.582