Literature DB >> 30153801

miRge 2.0 for comprehensive analysis of microRNA sequencing data.

Yin Lu1, Alexander S Baras1, Marc K Halushka2.   

Abstract

BACKGROUND: miRNAs play important roles in the regulation of gene expression. The rapidly developing field of microRNA sequencing (miRNA-seq; small RNA-seq) needs comprehensive, robust, user-friendly and standardized bioinformatics tools to analyze these large datasets. We present miRge 2.0, in which multiple enhancements were made towards these goals.
RESULTS: miRge 2.0 has become more comprehensive with increased functionality including a novel miRNA detection method, A-to-I editing analysis, integrated standardized GFF3 isomiR reporting, and improved alignment to miRNAs. The novel miRNA detection method uniquely uses both miRNA hairpin sequence structure and composition of isomiRs resulting in higher specificity for potential miRNA identification. Using known miRNA data, our support vector machine (SVM) model predicted miRNAs with an average Matthews correlation coefficient (MCC) of 0.939 over 32 human cell datasets and outperformed miRDeep2 and miRAnalyzer regarding phylogenetic conservation. The A-to-I editing detection strongly correlated with a reference dataset with adjusted R2 = 0.96. miRge 2.0 is the most up-to-date aligner with custom libraries to both miRBase v22 and MirGeneDB v2.0 for 6 species: human, mouse, rat, fruit fly, nematode and zebrafish; and has a tool to create custom libraries. For user-friendliness, miRge 2.0 is incorporated into bcbio-nextgen and implementable through Bioconda.
CONCLUSIONS: miRge 2.0 is a redesigned, leading miRNA RNA-seq aligner with several improvements and novel utilities. miRge 2.0 is freely available at: https://github.com/mhalushka/miRge .

Entities:  

Keywords:  Alignment; Small RNA-seq; isomiR; miRNA

Mesh:

Substances:

Year:  2018        PMID: 30153801      PMCID: PMC6112139          DOI: 10.1186/s12859-018-2287-y

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

MicroRNAs (miRNAs) are short, single-stranded RNAs that post-transcriptionally regulate gene expression via mRNA decay and/or translational repression [1, 2]. MiRNAs are transcribed by RNA polymerases II and III, generating precursors that undergo a series of cleavage events to form mature miRNAs [3]. Around 30 to 60% of all human protein coding genes are regulated by miRNAs [4], involved in almost all biological process ranging from development to metabolism to cancer [5-7]. With the continued popularity of small RNA sequencing to characterize miRNAs, much attention has been focused on miRNA alignment software. In 2015 we introduced miRge, a fast, multiplexing method to align miRNAs and other RNA species to expressed libraries [8]. Since that time, a number of developments in the field have occurred necessitating improvements to this alignment tool. The number and classification of true miRNAs has become controversial. miRBase, the central resource for miRNA curation, lists 2656 human miRNAs in their recently updated version (v22) [9]. Other manuscripts have listed thousands more putative novel miRNAs [10-12] including new passenger miRNA sequences of known miRNAs. However, the MirGeneDB group has indicated, using strict criteria, that only 586 human miRNA genes (1171 miRNA 5p and 3p strands) exist, calling into question the continued search for novel miRNAs and perhaps the loose methods employed to designate short RNAs as miRNAs from deep RNA-seq data [13]. In recent years, there has also been an increased awareness and value placed on isomiRs. IsomiRs are categorized into three main classes: 5′ isomiRs, 3′ isomiRs and polymorphic isomiRs, with 5′ and 3′ isomiRs subclassified into templated and nontemplated modifications [14]. The 5′ and 3′ isomiRs are the result of imprecise and alternative cleavage during the precursor miRNA (pre-miRNA) processing, post-transcriptional modifications, and/or editing by various post-transcriptional enzymes including exoribonucleases and nucleotidyl transferases [15-19]. IsomiRs are beginning to be considered as more selective than just miRNA expression levels and must become well-characterized [20] and taken into account for alignment strategies [21]. True internal modifications (not technical artifacts) are generally the result of adenosine deaminase (ADAR) acting on RNA to cause an A to I modification [22] as noted in a variety of RNA species. Recently, a call to develop a consistent nomenclature for isomiRs using a. GFF3 file format has been made. In response to these advancements, we now report major improvements in the 2.0 version of miRge. These include a highly-specific novel miRNA detector based on a machine learning algorithm, a standardized GFF3 isomiR reporting option, and an A-to-I (ADAR1) modification detector. Smaller revisions have been made to the algorithm and libraries to improve miRNA and tRNA calling, increase flexibility of reporting and unification of the code base to Python for ease of programming and allowing for the implementation of miRge 2.0 into the bcbio-nextgen framework. Bcbio-nextgen is a shared-community Python-based toolkit for pipelining and automated analysis of deep sequencing data (https://github.com/bcbio/bcbio-nextgen). We report the improvements and comparisons to other tools below.

Implementation

Sequence databases and software dependencies

miRNA libraries were obtained from both miRBase.org [9] and MirGeneDB [13, 23]. mRNA and noncoding libraries were obtained from Ensembl (www.ensembl.org) and other sources (See Additional file 1: Extended Materials and Methods). miRge 2.0 was written in Python (2.7.12) and utilizes a number of tools and libraries including Bowtie (v1.1.1) [24], RNAfold (v2.3.5) [25], SAMtools (v1.5) [26], cutadapt (v1.11) [27], biopython (v1.68), sklearn (v0.18.1), numPy (v1.11.0), SciPy (v0.17.0), pandas (v0.21.0), reportlab (v3.3.0) and forgi (v0.20). (See Additional file 1: Extended Materials and Methods), which are included in an installer. The entire package is available through Bioconda and https://github.com/mhalushka/miRge. miRge 2.0 runs on a Linux platform (Ubuntu 16.04.3).

miRge 2.0 workflow

Figure 1 shows the workflow of miRge 2.0. In Fig. 1, similar to the original miRge, the input FASTQ (or FASTQ.GZ) file(s) undergo prealignment steps of quality control, adaptor removal (cutadapt v1.11) and collapse into unique reads. Their observed counts are subsequently merged across all unique samples [8]. This file is then annotated against multiple search libraries: mature miRNAs, miRNA hairpins, mRNAs, mature & primary tRNAs, snoRNAs, rRNAs, other non-coding RNA, and (optional) known RNA spike-in sequences [28, 29]. A full rationale of the method was given previously [8] and additional modifications are described in “Improvements of miRge 2.0” and in the Additional file 1: Extended Material and Methods.
Fig. 1

Workflow of miRge 2.0. It illustrates the flow chart from input to output. The models of A-to-I editing sites for known miRNAs and novel miRNAs detection are newly added functions, while the original outputs are shown in dashed box

Workflow of miRge 2.0. It illustrates the flow chart from input to output. The models of A-to-I editing sites for known miRNAs and novel miRNAs detection are newly added functions, while the original outputs are shown in dashed box

Datasets to model novel miRNA detection

Sequencing datasets from 17 tissues in human and mouse (adrenal, bladder, blood, brain prefrontal cortex, colon, epididymis, heart, kidney, liver, lung, pancreas, placenta, retina, skeletal muscle, skin, testes and thyroid) were retrieved from the NCBI Sequence Read Archive (SRA) (Table 1).
Table 1

Data sets for constructing the predictive model in human and mouse

Tissue TypeSRA References in humanSRA References in mouse
AdrenalSRR944031, SRR944034SRR3653309, SRR3653310
BladderSRR333658, SRR333674SRR3652859, SRR3652860
BloodSRR837475, SRR837477SRR5241767, SRR5241768
Brain Prefrontal CortexSRR1635903, ERR409900SRR3540303, SRR3540304
ColonSRR837839, SRR837842SRR1973865
EpididymisSRR384894NA
HeartSRR553574, ERR038425SRR5832818, SRR5832819
KidneySRR553575, ERR038420SRR3652244, SRR3652245
LiverERR038413, ERR038410SRR5832837, SRR5832838
LungSRR372648, SRR372650SRR5059366, SRR5059367
PancreasERR852097, ERR852099SRR1973869
PlacentaSRR567637, SRR567638NA
RetinaERR973611, ERR973613SRR1427160, SRR1427161
Skeletal MuscleSRR1635908, SRR1820680SRR3651659, SRR3651660
SkinSRR2174513, SRR2174517SRR3402126, SRR3402132
TestesSRR333680, SRR553576SRR1647951, SRR1647953
ThyroidSRR1291267, SRR1291269NA
Data sets for constructing the predictive model in human and mouse These samples were processed through miRge 2.0 to identify the different RNA species for machine learning controls. MirGeneDB miRNAs were used to assemble positive clusters (known miRNAs). RNAs in the categories of tRNA, snoRNA, rRNA or mRNA were used to assemble negative clusters (known non-miRNAs). Sequences in repeat elements were excluded. The details regarding the final selection of RNA species used are listed in “Generation of read clusters” and in the Additional file 1: Extended Materials and Methods.

Clustering reads to determine features for model construction and novel miRNA detection

To build a predictive model and for novel miRNA detection, unmapped reads are aligned and clustered to the genome. Figure 2a illustrates this process, specifically during the development of the predictive model, where known miRNAs and known non-miRNAs are processed through multiple steps. During novel miRNA detection, all reads are processed as one, features are determined and these are then fed into the predictive model, rather than being used to build the predictive model. For both development and use of the predictive model, structural features (Additional file 2: Table S1) were generated from the clusters. These methods are described in further detail in the Additional file 1: Extended Materials and Methods.
Fig. 2

The process of construction of the predictive model. a The building of the predictive model composed of data preparation, feature calculation, feature selection and machine learning model training. (Key parameters are in parentheses.) b Schematic diagram of generating a stable range of clustered sequences in a cluster. The sequences in the cluster were aligned against the assembled sequence. The probability of the major nucleotide at each position was computed. A threshold of 0.8 was selected to determine the stable range of the cluster sequence

The process of construction of the predictive model. a The building of the predictive model composed of data preparation, feature calculation, feature selection and machine learning model training. (Key parameters are in parentheses.) b Schematic diagram of generating a stable range of clustered sequences in a cluster. The sequences in the cluster were aligned against the assembled sequence. The probability of the major nucleotide at each position was computed. A threshold of 0.8 was selected to determine the stable range of the cluster sequence

Calculating compositional features of read clusters

miRNAs have a characteristic processing pattern by DICER and DROSHA to create a unique, but complicated family of reads and isomiRs. The 5′ ends of the family of reads tends to consistently begin at the same nucleotide, the 3′ end tends to be variable and nucleotide additions of uracils (U/T) or adenines (A) are frequently seen here. Non-miRNAs tend to not share these features, so this difference can be exploited (Additional file 3: Figure S1). In order to codify these patterns, several features of read clusters were defined as follows: 1) the 5′ and 3′ unstable length of the cluster; 2) genome (DNA) nucleotide proportion at the positions − 3, − 2, − 1 of 5′ and + 1, + 2, + 3, + 4, + 5, + 6 of 3′ in the stable range of the cluster sequences; 3) A, T, C, and G percentages in the expressed RNA at the positions − 3, − 2, − 1 of 5′ and + 1, + 2, + 3, + 4, + 5, + 6 of 3′ in the stable range of the cluster sequences. In addition, sequence type count, total read count and the proportion of reads that are an exact match to the cluster sequences were calculated as well.

Prediction models for novel miRNA detection

We generated measurable features associated with read cluster composition and precursor miRNA structures. These features are listed in Additional file 2: Table S1. The discrimination power of each feature was ranked by its Minimum Redundancy Maximum Relevance (mRMR) score. We applied forward stepwise feature selection [30], to subselect the most informative features. To test the model for robustness, the dataset was randomly split into training and validation sets at the ratio of 4:1 in 10 replicates. Standardization was performed to scale the features into zero mean and unit variance. The parameters of the estimator were optimized by 10 fold cross-validated grid-search over a parameter grid. The searching space of C and gamma in radial basis function kernel of SVM [31] were {0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000}. The SVM model was implemented by scikit-learn Python package (http://scikit-learn.org). Matthews correlation coefficient (MCC) was used to evaluate the performance of the training model. The models were additionally tested on 12 rat samples (Additional file 4: Table S2).

A-to-I editing analysis

We utilized the mapped output file to identify all reads corresponding to each miRNA for A-to-I editing, as noted as an A to G change. Four exclusion criteria were made to reduce false positive A-to-I identifications based on possible and known sequence similarities and alignment problems. We excluded the putative A-to-I signals if: 1) the locations where similar miRNA families or miRNA SNPs that have A/G differences could be mistaken for A-to-I changes (i.e. nucleotide position 19 in let-7a-5p and let-7c-5p which differ only by an A/G variation and miR- 548al which has a SNP (A-to-G) at position 8 with the frequency of 0.18); 2) the 455 miRNAs found in repeat elements which could give false positives (i.e. miR-6503-3p is located in a MTL1D long terminal repeat.); 3) the miRNAs where the RPM of the canonical sequence is less than 1; 4) the miRNAs where the corresponding one nucleotide switched sequence (A to G) can be aligned to more than 1 location in the genome with trimming the last two nucleotides at 3′. Further description is present in the Additional file 1: Extended Materials and Methods.

GFF3 isomiR reporting

The increased awareness and interest in isomiRs is challenged by the lack of consistent isomiR reporting. As a result, a consensus standard has been developed by the miRTop consortium utilizing CIGAR values (https://samtools.github.io/hts-specs/SAMv1.pdf). This GFF3 formatted output reports on each isomiR sequence and its relationship to the miRNA precursor.

Comparison to other novel miRNA tools

Currently, miRDeep2 [32] and miRAnalyzer [33] are two prevailing tools for the prediction of novel miRNAs. In our annotation comparison study, default parameters were utilized except that the ‘-l’ was set to be 17 in the mapper.pl for miRDeep2 and default parameters were utilized in miRAnalyzer. In our prediction comparison study, new FASTQ files were generated from the unmapped read data of an original miRge run. Default parameters were utilized when running miRDeep2 and miRAnalyzer. Two metrics of novel miRNAs were used to compare three tools: PhyloP score and quality score. Basewise conservation scores across miRNAs were calculated from PhyloP data downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phyloP20way/ [34] using the PHAST package [35]. For each miRNA, the mean of PhyloP values across its length was calculated. The quality scores for each miRNA by each tool was defined by: 1 - (ranking percentile by the tool).

Hardware

All processing was performed on a workstation with 56 CPUs (dual Intel(R) Xeon(R) E5–2690 v4 at 2.60GHz) and 256GB DDR4-RAM. Novel miRNA modelling was performed using 32 CPUs. For speed testing, the number of CPUs in running original miRge, miRge 2.0, miRDeep2 and miRAnalyzer were 5, 5, 5 and 1, respectively. Due to a java incompatibility on the workstation, miRAnalyzer was run on a desktop with 4 CPUs (Intel(R) Core(TM) i7–6700 CPU at 3.40GHz) and 16GB DDR4-RAM.

Results

Improvements of miRge 2.0

The major improvements of miRge 2.0 consist of a novel miRNA detection method, improved alignment parameters, and the reporting of A-to-I changes in the sequence. These are described below, while smaller improvements are reported here. Utilizing updated search parameters, miRge 2.0 is able to annotate reads more precisely. In human data, using the miRBase v22 library, miRge 2.0 will align to 2817 miRNAs of which 149 are merged due to a similarity of their sequences. Although most miRNA alignment tools are agnostic to exact (canonical) or mismatched alignments (nontemplated isomiRs), miRge 2.0 sets a threshold (default value: 0.1; range 0–0.5) of the proportion of canonical reads to all reads for any given miRNAs. This can eliminate over reporting of miRNAs in which too high a percentage of sequences are nontemplated isomiRs, likely from other genomic loci or species contamination. miRge 2.0 also provides an optional GFF3 file report, which implements the miRTop guidelines for isomiR reporting utilizing CIGAR values. These can be used for isomiR-driven analyses. Additionally, the GFF3 data file is easily incorporated into other analysis pathway software including the bcbio-nextgen framework. miRge 2.0 also generates a .csv and .pdf file report of summary statistics; replacing a html report which was more difficult to process for tabular information. We also made several revisions to the search libraries. For the miRBase-based alignment search, we included additional SNP information in the miRNA library based on the updated miRNASNP database [36]. For the miRBase based searches, we have also included 161 5p or 3p miRNAs that are the complement of known miRBase miRNAs for which the passenger strand was detected recently [11]. Thus we have expanded our miRBase search library from 2656 miRNAs in our original method to 2817 miRNAs currently. We have also built a MirGeneDB 2.0-based alignment library that is corrected for SNP information for those investigators seeking this more specific set of miRNAs. We have also improved the tRNA alignments by adding “CCA” to the 3′-end of mature tRNA sequences and the precursor tRNA sequences at the 3′-end. For any alignment, we have added an optional spike-in RNA library search based on two popular sources of spike-in normalization [28, 29]. This search can easily be expanded to capture newer spike-in normalization methods as they appear. All options to call in miRge 2.0 are shown in Additional file 5: Table S3.

Speed and annotation comparison of original miRge and miRge 2.0

We performed tests of speed and annotation function of miRge 2.0 using six datasets. Both miRBase and MirGeneDB based libraries were analyzed although novel miRNA detection and A-to-I analysis were not performed. We found the processing time of miRge 2.0 was similar to the original miRge although bowtie searching libraries and searching parameters were adjusted (Table 2).
Table 2

Annotation comparison of the first version of miRge, miRge 2.0, miRDeep2 and miRAnalyzer

Tissue/CellSRA ReferencesAlignment ToolProcessing timemiRNA ReadsUnique miRNAsmiRNAs > 10 RPM
Human Adipose TissueSRR772563miRge - mb35 s2,041,433484240
miRge 2.0 - mb36 s2,039,835477238
miRge 2.0 - MDB35 s2,034,710390220
miRDeep29.3 min1,981,793598224
miRAnalyzer30 s1,752,855689243
Human Alpha CellSRR1028924miRge - mb14.6 min44,124,580920293
miRge 2.0 - mb15.6 min43,880,855911279
miRge 2.0 - MDB15.0 min43,752,598583261
miRDeep256.0 min42,326,135864267
miRAnalyzer18.4 min34,349,8161124281
Human Beta CellSRR873410miRge - mb6.5 min26,196,298896297
miRge 2.0 - mb6.6 min26,197,845889291
miRge 2.0 - MDB6.5 min26,130,904585274
miRDeep234.1 min23,280,604754273
miRAnalyzer8.0 min14,240,6691113289
Mouse Stomach TissueSRR3653378miRge - mb2.0 min7,063,128804457
miRge 2.0 - mb2.3 min7,175,534806420
miRge 2.0 - MDB2.2 min7,094,217578378
miRDeep218.5 min6,738,987748387
miRAnalyzer2.5 min6,818,2201086423
Mouse Epididymal Epithelial CellSRR2075702miRge - mb3.0 min1,394,193435364
miRge 2.0 - mb3.6 min1,387,591411290
miRge 2.0 - MDB3.4 min1,381,670360271
miRDeep224.4 min1,367,627402212
miRAnalyzer3.0 min925,019532270
Mouse B3 CellSRR2960463miRge - mb3.7 min9,515,760604322
miRge 2.0 - mb3.9 min9,612,571606282
miRge 2.0 - MDB3.9 min9,553,713359227
miRDeep232.6 min8,321,228487251
miRAnalyzer4.2 min6,856,264819289

Key: mb = miRBase; MDB = MirGeneDB. Starting read counts: SRR772563 = 2,373,604 reads; SRR1028924 = 82,497,527 reads; SRR873410 = 33,233,648 reads; SRR3653378 = 9,587,887 reads; SRR2075702 = 13,890,643 reads; SRR2960463 = 17,652,076 reads

Annotation comparison of the first version of miRge, miRge 2.0, miRDeep2 and miRAnalyzer Key: mb = miRBase; MDB = MirGeneDB. Starting read counts: SRR772563 = 2,373,604 reads; SRR1028924 = 82,497,527 reads; SRR873410 = 33,233,648 reads; SRR3653378 = 9,587,887 reads; SRR2075702 = 13,890,643 reads; SRR2960463 = 17,652,076 reads The number of detected miRNAs was slightly decreased as well. The alignment speed was essentially the same as miRAnalyzer and significantly faster than miRDeep2. The discovery of novel miRNAs is more time and memory intensive, as expected. For the dataset SRR553572 with 25.7 million reads, to identify novel miRNAs, the calculation time and maximum memory consumption were 17 mins and 6.7 GB RAM respectively. To evaluate the accuracy of A-to-I editing analysis, we performed A-to-I analysis using a pooled human brain sample (SRR095854) and compared the results to prior published data on this sample [37]. We identified 19 significant A-to-I modification sites compared to 16 reported in the reference paper. Comparing the two sets of results, the adjusted R2 of A-to-I proportion of these shared 16 sites was 0.96 and the slope of the linear regression was 0.99 indicating high reproducibility between our method and the established method (Fig. 3a). We then performed a new A-to-I editing analysis across colon tissue (Sequence Read Archive samples: SRR837842 and SRR837839), colon epithelial cells (SRR5127219), colon cancer (SRR1646473 and SRR1646493), and the colon cancer cell lines DKO1 (SRR1917324), DLD1 (SRR1917336) and DKS8 (SRR1917329). Significant miRNA editing sites with A-to-I percentage ≥ 1% in at least one sample were shown in Fig. 3b, with the data indicating differences between tumor and normal cells in ADAR activity [38].
Fig. 3

A-to-I analysis. a The A-to-I proportion of the sites is strongly correlated with a reference dataset analysis with adjusted R2 of 0.96 in the log-log plot. b The output of miRge 2.0 showing an illustrated heat map of miRNA A-to-I editing sites across colon tissue, primary colon cell, colon cancer tissue and colon cancer cells from multiple sources

A-to-I analysis. a The A-to-I proportion of the sites is strongly correlated with a reference dataset analysis with adjusted R2 of 0.96 in the log-log plot. b The output of miRge 2.0 showing an illustrated heat map of miRNA A-to-I editing sites across colon tissue, primary colon cell, colon cancer tissue and colon cancer cells from multiple sources

Validation of the predictive model

To determine the optimal number of features to use in the human and mouse predictive model, the MCC for the training and validation sets for the top 40 ranked features based on mRMR scores are shown in Fig. 4. For human data, when the number of features reached 21, the mean value of MCC of training and validation set approached the maximum and became stable. These top features are listed in Table 3.
Fig. 4

Model performance on top 40 features for training and validation sets for human a and mouse b miRNA discovery. Each dot stands for the mean value of Matthews correlation coefficient (MCC)

Table 3

Top 21 features in human predictive model. Hairpin structural features are labeled in italics, while read compositional features are not

RankFeature nameDescription of the feature
1 count_bindings_in_miRNA Number of bindings in the stable range of sequences
2exactMatchRatioThe proportion of reads that are an exact match to the cluster sequence in the cluster
3 pair_state_No Whether there is another stable range of sequences located at the other arm of precursor
4 mFE Minimum free energy (MFE) of the precursor
5head_minus3_TemplateNucleotide_percentageProportion of genomic templated nucleotide at position −3 relative to the 5′ end of the stable range of the cluster sequences
6 hairpin_count Number of hairpin loops in the precursor
7 stem_length Stem length of the precursor
8 distanceToloop Distance between the stable range of sequences and the terminal loop
9 percentage_PairedInMiRNA Number of bindings in the stable range of sequences divided by its length
10headUnstableLength5′ unstable length of the cluster
11 pair_state_Yes Whether there is another stable range of sequences located at the other arm of precursor
12tail_plus2_A_percentageProportion of non-templated adenine (A) at position + 2 relative to the 3′ end of the stable range of the cluster sequences
13head_minus2_TemplateNucleotide_percentageProportion of genomic templated nucleotide at position −2 relative to the 5′ end of the stable range of the cluster sequences
14 binding_count Number of bindings in the precursor hairpin
15tail_plus1_A_percentageProportion of non-templated adenine (A) at position + 1 relative to the 3′ end of the stable range of the cluster sequences
16 armType_loop Whether the stable range of sequences is located at the terminal loop if the precursor
17tail_plus3_A_percentageProportion of non-templated adenine (A) at position + 3 relative to the 3′ end of the stable range of the cluster sequences
18tail_plus5_TemplateNucleotide_percentageProportion of genomic templated nucleotide at position + 5 relative to the 3′ end of the stable range of the cluster sequences
19tail_plus1_TemplateNucleotide_percentageProportion of genomic templated nucleotide at position + 1 relative to the 3′ end of the stable range of the cluster sequences
20 interiorLoopCount Number of interior loops in the precursor
21head_minus1_TemplateNucleotide_percentageProportion of genomic templated nucleotide at position −1 relative to the 5′ end of the stable range of the cluster sequences
Model performance on top 40 features for training and validation sets for human a and mouse b miRNA discovery. Each dot stands for the mean value of Matthews correlation coefficient (MCC) Top 21 features in human predictive model. Hairpin structural features are labeled in italics, while read compositional features are not Among them, there are 11 precursor miRNA structural features and 10 compositional features. The ultimate model was constructed using these selected features. We used 32 human cell data sets to test the model. The positive and negative miRNAs were generated through the same process described above. The predictive result is shown in Table 4. The mean of MCC is 0.94, indicating that the performance of the model in the test set is good.
Table 4

Predictive results of 32 human cell data in a test set by the human model

Cell TypeSRA ReferencesAUCPrecisionRecallMCC
Fibroblast Aorta AdventitiaSRR51272060.9950.9830.9630.945
Smooth Muscle Cell AortaSRR51272170.9940.9810.9610.938
AstrocyteSRR51272140.9940.980.9680.949
Smooth Muscle Cell BladderSRR51272150.9920.9710.9630.936
Fibroblast Dermal (Adult)SRR51272050.9950.9830.9740.95
Fibroblast Dermal (Neonatal)SRR51272250.9950.9890.960.942
Epithelium Keratinocyte (Adult)SRR51272030.9940.9770.9620.934
Epithelium Keratinocyte (Neonatal)SRR51272080.9930.9750.9420.923
Endothelial AorticSRR51391210.9880.9750.9320.915
Endothelial Umbilical veinSRR51272130.9930.9810.9540.926
Epithelium BronchialSRR51272160.9880.9740.9510.935
ChondrocyteSRR51272290.9950.9850.9590.944
Endothelial MicrovascularSRR51272010.9910.9730.9570.945
Fibroblast CardiacSRR51272360.9920.9830.9450.94
MelanocyteSRR51272070.9950.990.9810.954
Epithelium MammarySRR51272240.990.9760.9410.927
Epithelium ProstateSRR51272120.9920.9750.9610.948
Epithelium Renal CortexSRR51272040.9880.9660.9480.927
Epithelium Renal ProximalSRR51272300.9920.9780.9490.936
Stromal cell ProstateSRR51272260.9910.9760.9630.94
Myoblast Skeletal MuscleSRR51272180.990.9740.9560.932
Epithelium IntestinalSRR51272230.9940.9850.9730.957
MyofibroblastSRR51272200.9910.9870.9650.95
Smooth Muscle Cell ProstateSRR51272220.9910.9780.9610.943
Neuron DopaminergicSRR51272340.9820.9630.9220.916
Neuron CorticalSRR51272090.9860.9680.9170.917
MesangialSRR51272210.9960.9860.9710.948
OsteoblastSRR51272330.9970.9860.9550.946
Fibroblast Periodontal ligamentSRR51272270.9940.9860.9620.946
Epithelium RenalSRR51272350.9920.9890.9540.936
Epithelium Retinal PigmentSRR51272100.9940.9880.9740.959
Skeletal Muscle CellSRR51272020.9950.9840.960.936
Mean0.9920.980.9560.939
Std dev0.0030.0070.0140.012
Predictive results of 32 human cell data in a test set by the human model Meanwhile, in the mouse predictive model, the optimal number of features are 12 which is shown in Additional file 6: Table S4. These 12 features are a subset of the 21 human features used. The performance of a mouse model towards 19 mouse cell datasets are shown in Additional file 7: Table S5 where the mean of MCC is 0.93, indicating that the mouse model performed well on the test dataset.

Comparison with other novel miRNA detection tools

Using miRge 2.0, we identified 302 RNA species that are putative novel miRNAs from 32 cell types [11]. Referring to these sequences as novel miRNAs, without further validation, may be incorrect terminology. However, without other terminology for these small “true miRNAs” or “miRNA-like RNA species,” we will refer to them as putative novel miRNAs. We then used the same unmapped reads generated from miRge 2.0 as input for miRDeep2 and miRAnalyzer. They predicted 1975 and 18,168 putative novel miRNAs respectively. After thresholding the data from those two tools to the same parameters as miRge 2.0 (≥10 total reads, ≥3 sequences, etc.), there were 312 and 391 putative novel miRNAs remaining. As shown in Fig. 5a, a Venn diagram depicts the overlap among miRge 2.0, miRDeep2 and miRAnalyzer, showing 129 novel miRNAs shared between the three methods. We then calculated the mean PhyloP scores as a measure of nucleotide conservation across primates for the novel miRNAs (Fig. 5b). More conservation was noted for the shared novel miRNAs (0.14) compared to miRAnalyzer (0.013) and miRDeep2 (− 0.036). Conservation was equivocal between the shared novel miRNAs and the miRge 2.0 novel miRNAs (0.15) As all three tools give a quality score to each novel prediction, we compared these values for miRNAs found shared vs. those unique to each method. As shown in Fig. 5c, the overlapped miRNAs ranked higher in quality for each method, further suggesting these 129 are the optimal putative novel miRNAs from the group. The full list of putative novel miRNAs generated by the three tools are available in Additional file 8: Table S6.
Fig. 5

Venn diagram for novel miRNAs predicted by miRge 2.0, miRDeep2, and miRAnalyzer. a Overlapped novel miRNAs among the three tools. b The average basewise conservation scores across novel miRNAs. c The average Quality score across novel miRNAs among the three tools

Venn diagram for novel miRNAs predicted by miRge 2.0, miRDeep2, and miRAnalyzer. a Overlapped novel miRNAs among the three tools. b The average basewise conservation scores across novel miRNAs. c The average Quality score across novel miRNAs among the three tools

Comparison between the human model and mouse model

We used our SVM model to create an optimal novel miRNA tool for both human and mouse. We also trained a model using combined human and mouse data. We questioned how well those tools could predict novel miRNAs in another mammalian species. We utilized these three (human-only, mouse-only and combined) models on the 12 rat miRNA samples shown in Additional file 4: Table S2. Using known miRNAs and known non-miRNAs, we found the average MCC for the rat samples to be essentially equivalent among the three models (0.95–0.96). The average MCC for testing the human model on the mouse data and mouse model on the human data are 0.93 and 0.89, respectively. Therefore, although same-species modeling might be ideal, the combined human and mouse SVM model is incorporated to be used for other mammalian species in miRge 2.0.

Discussion

In light of the positive and negative feedback we received for our original miRge tool, we generated an improved 2.0 version. miRge 2.0 has a more robust search, better overall output reporting, more run options, and new parameters for novel miRNA detection and A-to-I editing detection. It can be installed by bioconda and implemented within the bcbio-nextgen framework to better integrate with other software tools. It still remains one of the fastest options for alignment and can multiplex multiple samples in a single run. The new novel miRNA detection tool has reasonable requirements for RAM and can be used widely. Our data suggests the miRge 2.0 novel miRNA detection tool is more robust than the earlier tools miRDeep2 and miRAnalyzer. As much recent literature suggests, the novel miRNA detection tools have been too open in their parameters, allowing an explosion of novel miRNA reporting, that is likely inaccurate. For novel miRNA detection in established species, less is likely more. We believe that our more strict requirements and unique use of compositional features has improved miRNA discovery and is a better approach going forward. We caution though, that these are putative novel miRNAs and should not be thought of as bona fide miRNAs unless they meet additional parameters [13]. We are also wondering if a novel detection tool built for one mammalian species could be used to detect putative novel miRNAs in other species. Our human and mouse models assayed with the rat data indicates, that, indeed, at least among mammalia, our tool is robust. We have also tried to make miRge 2.0 more robust to current concerns of the community. Many authors have argued that miRBase—the online repository for miRNAs —is riddled with false positive miRNAs [39-42]. Therefore, we have built a MirGeneDB 2.0-based alignment library, incorporating SNPs, for six species to cater to those investigators seeking a better-defined set of miRNAs. We have reported concerns with using reads per million miRNA reads (RPM) as a normalization tool [43]. Therefore, we have added an optional spike-in RNA library search step for spike-in normalization. Spike-in for miRNA RNA-seq is still in its infancy, so this step can easily be expanded/modified to account for newer spike-in normalization methods. Currently, the sequence libraries of human, mouse, rat, nematode, fruitfly and zebrafish datasets are provided, but miRge 2.0 can be used by individual users to investigate any species by constructing the sequence libraries to incorporate in the miRge 2.0 workflow using our miRge_bowtie_build.py tool. In our original miRge tool, we accepted that reads could randomly align to highly similar miRNAs, e.g. miR-192-5p and miR-215-5p; thus we reported those together as miR-215-5p/192-5p reads. The cross-mapping of sequencing reads can create false alignments that may be interpreted as sequence or expression alterations which can occur in other alignment tools, as other tools have generally not hand-curated their alignment libraries. Our improvements in miRge 2.0 optimize the number of miRNAs that are clustered together to reduce these random alignment challenges. With the interest in ADAR activity and A-to-I changes in RNAs, we have added a feature to miRge 2.0 to capture this information. miRge 2.0 performs robustly in identifying these ADAR sites, comparable to other stand-alone programs.

Conclusion

In summary, miRge 2.0 is an update of our original miRNA alignment tool that more comprehensively and more robustly analyzes miRNA sequencing data. We believe the improvements in miRge 2.0 will be useful to a wide range of scientists who are interested in interpreting small RNA-seq data for miRNA expression patterns.

Availability and requirements

Project name: miRge 2.0. Project home page: https://github.com/mhalushka/miRge Operating System: Linux. Programming Language: Python. Other Requirements: Bowtie (v1.1.1), RNAfold (v2.3.5), SAMtools (v1.5), cutadapt (v1.11), biopython (v1.68), sklearn (v0.18.1), numPy (v1.11.0), SciPy (v0.17.0), pandas (v0.21.0), reportlab (v3.3.0) and forgi (v0.20). License: GNU GPL 3.0. Any restrictions to use by non-academics: none. Extended materials and methods on the development of miRge 2.0. (PDF 132 kb) Table S1. Features calculated for building the novel miRNA detection model. (PDF 46 kb) Figure S1. Distribution of non-templated nucleotides as a percentage of all 4 nucleotides at both 5′ and 3′ positions relative to the mature miRNA or equivalent non-miRNA sequence (PDF 876 kb) Table S2. 12 rat samples for evaluating the human and mouse novel miRNA predictive models. (PDF 19 kb) Table S3. All options to call in miRge 2.0. (PDF 55 kb) Table S4. Top 12 features in the mouse novel miRNA predictive model. (PDF 7 kb) Table S5. Predictive results for 19 mouse RNA-seq datasets for novel miRNA detection. (PDF 20 kb) Table S6. Novel miRNAs detected by miRge 2.0, miRDeep2 and miRAnalyzer. (XLSX 167 kb)
  40 in total

Review 1.  microRNA functions.

Authors:  Natascha Bushati; Stephen M Cohen
Journal:  Annu Rev Cell Dev Biol       Date:  2007       Impact factor: 13.827

2.  Analysis of 13 cell types reveals evidence for the expression of numerous novel primate- and tissue-specific microRNAs.

Authors:  Eric Londin; Phillipe Loher; Aristeidis G Telonis; Kevin Quann; Peter Clark; Yi Jing; Eleftheria Hatzimichael; Yohei Kirino; Shozo Honda; Michelle Lally; Bharat Ramratnam; Clay E S Comstock; Karen E Knudsen; Leonard Gomella; George L Spaeth; Lisa Hark; L Jay Katz; Agnieszka Witkiewicz; Abdolmohamad Rostami; Sergio A Jimenez; Michael A Hollingsworth; Jen Jen Yeh; Chad A Shaw; Steven E McKenzie; Paul Bray; Peter T Nelson; Simona Zupo; Katrien Van Roosbroeck; Michael J Keating; George A Calin; Charles Yeo; Masaya Jimbo; Joseph Cozzitorto; Jonathan R Brody; Kathleen Delgrosso; John S Mattick; Paolo Fortina; Isidore Rigoutsos
Journal:  Proc Natl Acad Sci U S A       Date:  2015-02-23       Impact factor: 11.205

Review 3.  IsomiRs--the overlooked repertoire in the dynamic microRNAome.

Authors:  Corine T Neilsen; Gregory J Goodall; Cameron P Bracken
Journal:  Trends Genet       Date:  2012-08-08       Impact factor: 11.639

Review 4.  Functions and regulation of RNA editing by ADAR deaminases.

Authors:  Kazuko Nishikura
Journal:  Annu Rev Biochem       Date:  2010       Impact factor: 23.643

5.  Reconstitution of an Argonaute-dependent small RNA biogenesis pathway reveals a handover mechanism involving the RNA exosome and the exonuclease QIP.

Authors:  Zhihong Xue; Haiyan Yuan; Jinhu Guo; Yi Liu
Journal:  Mol Cell       Date:  2012-04-17       Impact factor: 17.970

6.  A comprehensive survey of 3' animal miRNA modification events and a possible role for 3' adenylation in modulating miRNA targeting effectiveness.

Authors:  A Maxwell Burroughs; Yoshinari Ando; Michiel J L de Hoon; Yasuhiro Tomaru; Takahiro Nishibu; Ryo Ukekawa; Taku Funakoshi; Tsutomu Kurokawa; Harukazu Suzuki; Yoshihide Hayashizaki; Carsten O Daub
Journal:  Genome Res       Date:  2010-08-18       Impact factor: 9.043

7.  Systematic identification of edited microRNAs in the human brain.

Authors:  Shahar Alon; Eyal Mor; Francois Vigneault; George M Church; Franco Locatelli; Federica Galeano; Angela Gallo; Noam Shomron; Eli Eisenberg
Journal:  Genome Res       Date:  2012-04-12       Impact factor: 9.043

8.  Systematic Curation of miRBase Annotation Using Integrated Small RNA High-Throughput Sequencing Data for C. elegans and Drosophila.

Authors:  Xiangfeng Wang; X Shirley Liu
Journal:  Front Genet       Date:  2011-05-26       Impact factor: 4.599

9.  ViennaRNA Package 2.0.

Authors:  Ronny Lorenz; Stephan H Bernhart; Christian Höner Zu Siederdissen; Hakim Tafer; Christoph Flamm; Peter F Stadler; Ivo L Hofacker
Journal:  Algorithms Mol Biol       Date:  2011-11-24       Impact factor: 1.405

10.  miRBase: annotating high confidence microRNAs using deep sequencing data.

Authors:  Ana Kozomara; Sam Griffiths-Jones
Journal:  Nucleic Acids Res       Date:  2013-11-25       Impact factor: 16.971

View more
  25 in total

1.  The RNA landscape of the human placenta in health and disease.

Authors:  Gordon C S Smith; D Stephen Charnock-Jones; Sungsam Gong; Francesca Gaccioli; Justyna Dopierala; Ulla Sovio; Emma Cook; Pieter-Jan Volders; Lennart Martens; Paul D W Kirk; Sylvia Richardson
Journal:  Nat Commun       Date:  2021-05-11       Impact factor: 14.919

2.  In Silico Analysis of Micro-RNA Sequencing Data.

Authors:  Ernesto Aparicio-Puerta; Bastian Fromm; Michael Hackenberg; Marc K Halushka
Journal:  Methods Mol Biol       Date:  2021

3.  Genome-Wide Analysis of c-MYC-Regulated mRNAs and miRNAs and c-MYC DNA-Binding by Next-Generation Sequencing.

Authors:  Rene Jackstadt; Markus Kaller; Antje Menssen; Heiko Hermeking
Journal:  Methods Mol Biol       Date:  2021

4.  Unification of miRNA and isomiR research: the mirGFF3 format and the mirtop API.

Authors:  Thomas Desvignes; Phillipe Loher; Karen Eilbeck; Jeffery Ma; Gianvito Urgese; Bastian Fromm; Jason Sydes; Ernesto Aparicio-Puerta; Victor Barrera; Roderic Espín; Florian Thibord; Xavier Bofill-De Ros; Eric Londin; Aristeidis G Telonis; Elisa Ficarra; Marc R Friedländer; John H Postlethwait; Isidore Rigoutsos; Michael Hackenberg; Ioannis S Vlachos; Marc K Halushka; Lorena Pantano
Journal:  Bioinformatics       Date:  2020-02-01       Impact factor: 6.937

5.  Global MicroRNA Profiling of Vascular Endothelial Cells.

Authors:  Eloi Schmauch; Anna-Liisa Levonen; Suvi Linna-Kuosmanen
Journal:  Methods Mol Biol       Date:  2022

Review 6.  microRNA-based diagnostic and therapeutic applications in cancer medicine.

Authors:  Lorenzo F Sempere; Asfar S Azmi; Anna Moore
Journal:  Wiley Interdiscip Rev RNA       Date:  2021-05-17       Impact factor: 9.957

7.  Introduction to Bioinformatics Resources for Post-transcriptional Regulation of Gene Expression.

Authors:  Eliana Destefanis; Erik Dassi
Journal:  Methods Mol Biol       Date:  2022

8.  Mycobacterium tuberculosis Infection Drives Mitochondria-Biased Dysregulation of Host Transfer RNA-Derived Fragments.

Authors:  Monika M Looney; Yin Lu; Petros C Karakousis; Marc K Halushka
Journal:  J Infect Dis       Date:  2021-05-28       Impact factor: 5.226

9.  miRge3.0: a comprehensive microRNA and tRF sequencing analysis pipeline.

Authors:  Arun H Patil; Marc K Halushka
Journal:  NAR Genom Bioinform       Date:  2021-07-21

10.  mirnaQC: a webserver for comparative quality control of miRNA-seq data.

Authors:  Ernesto Aparicio-Puerta; Cristina Gómez-Martín; Stavros Giannoukakos; José María Medina; Juan Antonio Marchal; Michael Hackenberg
Journal:  Nucleic Acids Res       Date:  2020-07-02       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.