| Literature DB >> 23052038 |
Nuno D Mendes1, Steffen Heyne, Ana T Freitas, Marie-France Sagot, Rolf Backofen.
Abstract
MOTIVATION: The computational search for novel microRNA (miRNA) precursors often involves some sort of structural analysis with the aim of identifying which type of structures are prone to being recognized and processed by the cellular miRNA-maturation machinery. A natural way to tackle this problem is to perform clustering over the candidate structures along with known miRNA precursor structures. Mixed clusters allow then the identification of candidates that are similar to known precursors. Given the large number of pre-miRNA candidates that can be identified in single-genome approaches, even after applying several filters for precursor robustness and stability, a conventional structural clustering approach is unfeasible.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23052038 PMCID: PMC3516144 DOI: 10.1093/bioinformatics/bts574
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Example of a vectorial representation. (a) The characteristics of a single position are determined, which include the nucleotide and whether the previous, current and following positions in the secondary structure are unpaired (0), left/right paired (1/2), or located in the terminal loop (3). (b) Portions of the final vector illustrating the counts. Each vector position refers to a particular nucleotide type and the neighboring pairing status, from (A, 0, 0, 0) to (G, 3, 3, 3). (c) Portions of the normalized vector obtained from (b), each position is divided by a constant such that the sum of all components is 1
Evaluation of vectorial representations
| Correctly assigned (%) | Average Cluster size | Correctly assigned (%) | Average cluster size | |||
|---|---|---|---|---|---|---|
| 0.00 | 83.60 | 8.58 | 3.05 | 82.10 | 3.45 | 3.29 |
| 0.10 | 82.50 | 1.69 | 3.33 | 81.24 | 1.92 | 3.54 |
| 0.20 | 81.21 | 8.68 | 3.70 | 80.01 | 4.31 | 3.89 |
| 0.30 | 79.30 | 2.37 | 4.27 | 78.24 | 3.31 | 4.44 |
| 0.40 | 77.09 | 9.22 | 5.31 | 76.08 | 1.64 | 5.44 |
| 0.50 | 74.12 | 2.24 | 7.61 | 72.80 | 1.43 | 7.54 |
| 0.60 | 71.23 | 2.76 | 12.09 | 69.45 | 1.01 | 11.44 |
| 0.70 | 68.70 | 1.05 | 17.24 | 67.41 | 1.74 | 15.32 |
| 0.80 | 68.14 | 6.01 | 19.52 | 66.07 | 9.62 | 17.77 |
| 0.90 | 67.37 | 2.57 | 21.08 | 65.72 | 4.01 | 20.09 |
Note: For each k-level, the table shows the percentage of correct assignments in the datasets of A.gambiae and D.melanogaster, the P-value of Welch’s two-sample t test comparing the observed correct assignments with a randomized version of each dataset shuffling the correspondence between candidates and their vectorial representation, and the average number of cluster members.
Fig. 2.ROC curves for the minimum distance (MinDist) to pre-miRNAs method and the performance of TripletSVM over 4000 samples equally divided into four groups. Each group uses 5%, 10%, 20% or 50% of the known precursors of (a) A.gambiae and (b) D.melanogaster to set up the positive examples of the training set. The positive examples of the testing set are made up by the remaining precursors. In both sets, the negative examples are samples of the set of candidates. ROC curves for each individual sample are shown in dashed lines and the average curve across the range of cutoff values is shown in a solid line. The red dot represents the average performance of the MinDist method over all samples considering the optimal cutoff for each sample. The green dots represent the performance of TripletSVM on each sample, whereas the green diamond refers to its average performance
Sensitivity (TPR), Specificity (1 − FPR) and the F1 measure of TripletSVM and MinDist computed as the average performance across all samples for training sets whose positive examples consist of a fraction of known pre-miRNAs in Anopheles gambiae and Drosophila melanogaster
| % known | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MinDist | TripletSVM | MinDist | TripletSVM | |||||||||
| Sensitivity | Specificity | Sensitivity | Specificity | F1 | Sensitivity | Specificity | Sensitivity | Specificity | F1 | |||
| 5 | 0.72 | 0.68 | 0.71 | 0.63 | 0.57 | 0.61 | 0.70 | 0.71 | 0.70 | 0.69 | 0.74 | 0.71 |
| 10 | 0.72 | 0.68 | 0.71 | 0.64 | 0.73 | 0.67 | 0.70 | 0.72 | 0.71 | 0.70 | 0.80 | 0.74 |
| 20 | 0.71 | 0.71 | 0.71 | 0.67 | 0.69 | 0.68 | 0.70 | 0.73 | 0.71 | 0.71 | 0.83 | 0.76 |
| 50 | 0.71 | 0.74 | 0.72 | 0.66 | 0.76 | 0.69 | 0.72 | 0.75 | 0.73 | 0.70 | 0.86 | 0.76 |
| 80 | 0.70 | 0.80 | 0.74 | 0.64 | 0.78 | 0.69 | 0.75 | 0.75 | 0.75 | 0.69 | 0.86 | 0.75 |
| 90 | 0.74 | 0.82 | 0.77 | 0.63 | 0.78 | 0.68 | 0.77 | 0.78 | 0.77 | 0.68 | 0.87 | 0.75 |
| 95 | 0.83 | 0.79 | 0.81 | 0.64 | 0.78 | 0.69 | 0.79 | 0.81 | 0.80 | 0.68 | 0.87 | 0.75 |
Fig. 3.Genomic clusters of pre-miRNAs. Shown are the secondary structure of both stem-loops, the consensus structure along with the SCI (structure conservation index) and the MPI (mean pairwise identity), the LocARNA alignment and a representation of their genomic loci