| Literature DB >> 19433510 |
Michael Hackenberg1, Martin Sturm, David Langenberger, Juan Manuel Falcón-Pérez, Ana M Aransay.
Abstract
Next-generation sequencing allows now the sequencing of small RNA molecules and the estimation of their expression levels. Consequently, there will be a high demand of bioinformatics tools to cope with the several gigabytes of sequence data generated in each single deep-sequencing experiment. Given this scene, we developed miRanalyzer, a web server tool for the analysis of deep-sequencing experiments for small RNAs. The web server tool requires a simple input file containing a list of unique reads and its copy numbers (expression levels). Using these data, miRanalyzer (i) detects all known microRNA sequences annotated in miRBase, (ii) finds all perfect matches against other libraries of transcribed sequences and (iii) predicts new microRNAs. The prediction of new microRNAs is an especially important point as there are many species with very few known microRNAs. Therefore, we implemented a highly accurate machine learning algorithm for the prediction of new microRNAs that reaches AUC values of 97.9% and recall values of up to 75% on unseen data. The web tool summarizes all the described steps in a single output page, which provides a comprehensive overview of the analysis, adding links to more detailed output pages for each analysis module. miRanalyzer is available at http://web.bioinformatics.cicbiogune.es/microRNA/.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19433510 PMCID: PMC2703919 DOI: 10.1093/nar/gkp347
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Summary of miRanalyzer input options
| Input option | Description |
|---|---|
| Species | The species from which the input reads have been obtained |
| Number of mismatches | For the detection of known microRNAs the user can allow matches with up to two mismatches |
| Target gene method | Selection of the microRNA target gene prediction method for the ontological analysis. |
| Posterior probability threshold | The threshold for the posterior probability calculated by the classification model. |
| Considering adapter sequences | The read sequences frequently contain adapter sequences at its 3′ end. In this case, the user can take it into account by aligning also sub-sequences of a given minimum length (Data and methods section). |
| Detect just new microRNAs | This option skips the detection of known microRNAs. |
| Remove all mRNA matches | This option removes all reads which have been perfectly aligned with mRNA sequences. If this option is not set, the program will remove all reads which match in more than five mRNAs as we observed that these reads are frequently poly-A like sequences. |
| Remove RFam/RepBase. | These options remove all reads which have mapped to RFam or RepBase. |
| Just predict conserved microRNAs | This option limits the prediction of new microRNAs to regions which overlap with a Phylogenetically Conserved Element (PhastCons). |
The true positive rates (top part) and false positive rates (bottom part) for different classifiers at a posterior probability threshold of 0.9
The superscripted ‘CV’ denotes that this value was achieved in a standard 10-fold cross-validation approach. The highlighted false positive rates correspond to the true positive rates discussed in the text.
Figure 1.Histogram of miRanalyzer scores. Known microRNAs are colored in red, all other data are colored in blue. The insert is a close-up for candidates with scores better than 0.65.
Figure 2.The summary page of miRanalyzer: five boxes are shown which correspond to summary & state of the process, analysis of known microRNA, matches against transcribed sequences, and detection of new microRNAs and summary of unmatched sequences.
Features calculated for the generation of the classifier
| Feature name | Description of the feature |
|---|---|
| Read count | Number of reads mapping to the pre-microRNA |
| Length | The length of the longest hairpin structure |
| Stem length | The length of the longest hairpin structure stem |
| Mfe | The mean free energy of the hairpin |
| Loop length | The number of bases in the loop of the hairpin |
| Loop GC | The GC-content of the loop |
| GC | The GC-content of the small hairpin |
| Asymmetric bulges | The number of asymmetric bulges and mismatches regarding the stem |
| Symmetric bulges | The number of symmetric bulges and mismatches regarding the stem |
| Bulges | The number of bulges in the stem |
| Longest bulge | The number of non-pairing nucleotides of the longest bulge |
| Mismatches pre-microRNA | The number of single mismatches in the hairpin |
| Mismatches microRNAs | The number of single mismatches in the mature microRNA region of the hairpin |
| Stability | The smallest hairpin harbouring the read is extended 10 times 10bp at both ends. The stability is the frequency the original structure is found in the elongated structures |
| Alternating stability | Reports whether a structure disappears when extending the sequence, but reappears again. |
| Triplet-SVM features | All features that were proposed by Xue |
| Bindings | The number of bindings in the stem divided by the hairpin length |