| Literature DB >> 23874353 |
Sachin Pundhir1, Jan Gorodkin.
Abstract
In silico generated search for microRNAs (miRNAs) has been driven by methods compiling structural features of the miRNA precursor hairpin, as well as to some degree combining this with the analysis of RNA-seq profiles for which the miRNA typically leave the drosha/dicer fingerprint of 1-2 ~22 nt blocks of reads corresponding to the mature and star miRNA. In complement to the previous methods, we present a study where we systematically exploit these patterns of read profiles. We created two datasets comprised of 2540 and 4795 read profiles obtained after preprocessing short RNA-seq data from miRBase and ENCODE, respectively. Out of 4795 ENCODE read profiles, 1361 are annotated as non-coding RNAs (ncRNAs) and of which 285 are further annotated as miRNAs. Using deepBlockAlign (dba), we align ncRNA read profiles from ENCODE against the miRBase read profiles (cleaned for "self-matches") and are able to separate ENCODE miRNAs from the other ncRNAs by a Matthews Correlation Coefficient (MCC) of 0.8 and obtain an area under the curve of 0.93. Based on the dba score cut-off of 0.7 at which we observed the maximum MCC of 0.8, we predict 523 novel miRNA candidates. An additional RNA secondary structure analysis reveal that 42 of the candidates overlap with predicted conserved secondary structure. Further analysis reveal that the 523 miRNA candidates are located in genomic regions with MAF block (UCSC) fragmentation and poor sequence conservation, which in part might explain why they have been overlooked in previous efforts. We further analyzed known human and mouse miRNA read profiles and found two distinct classes; the first containing two blocks and the second containing >2 blocks of reads. Also the latter class holds read profiles that have less well defined arrangement of reads in comparison to the former class. On comparison of miRNA read profiles from plants and animals, we observed kingdom specific read profiles that are distinct in terms of both length and distribution of reads within the read profiles to each other. All the data, as well as a server to search miRBase read profiles by uploading a BED file, is available at http://rth.dk/resources/mirdba.Entities:
Keywords: RNA-seq; alignment; deepBlockAlign; miRNA read profiles; microRNA; read profiles
Year: 2013 PMID: 23874353 PMCID: PMC3708161 DOI: 10.3389/fgene.2013.00133
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Major approaches for the computational prediction of micro-RNA.
| Evolutionary conservation and stem loop structure | miRseeker and miRscan | Lai et al., |
| Neighbor stem loop search | - | Ohler et al., |
| Sequence based homology and stem loop structure | microHARVESTER, MiRAlign | Wang et al., |
| Phylogenetic shadowing | - | Berezikov et al., |
| Minimum free energy index | - | Zhang et al., |
| Machine learning methods | ProMiR, mirCoS-a, MiPred | Nam et al., |
| RNA-seq based | miRanalyzer, miRDeep2, miRDeep* | Hackenberg et al., |
Figure 2Preprocessing of ENCODE and miRBase datasets. (A) In ENCODE dataset, reads mapped to human genome from each of the 18 RNA-seq experiments were subjected to preprocessing to obtain closely spaced set of reads termed here as “block group”. Block groups thus obtained were compiled so as to identify a set of distinct genomic loci that have a block group in at least one experiment. Next, for each locus, we retrieved one block group corresponding to the experiment in which the block group has maximum number of blocks leaving us with 58,161 block groups. (B) In miRBase dataset, reads mapped to microRNAs from each of the 20 organisms were subjected to preprocessing and block groups thus obtained were compiled as miRNA read profile database (miRRPdb). (C) Given a set of mapped reads in BED format, we derive closely spaced stack of reads termed here as “block group” using blockbuster (Langenberger et al., 2009). Each block group or read profile is composed of one or more blocks of reads.
Figure 1A characteristic read profile (block group) for microRNA in the human genome. It is dominated by two distinct clusters of reads (blocks) with almost similar start and/or end positions. These read profiles are in many cases influenced by secondary structures of the parent transcript and may convey information about the processing mechanism of the transcript like dicer cleavage in this case.
The ENCODE dataset is comprised of short-reads from 18 RNA-seq experiments performed on nine human tissues with each having two replicates.
| Bl | 49,280,641 | 16,437 | 56,439,584 | 19,609 |
| Br | 48,773,897 | 15,148 | 48,394,385 | 13,317 |
| Bt | 26,713,326 | 13,309 | 40,144,816 | 13,555 |
| Cx | 41,301,918 | 15,890 | 40,798,294 | 15,948 |
| Ep | 47,775,551 | 13,522 | 44,163,861 | 10,923 |
| Es | 35,965,377 | 12,692 | 33,651,242 | 14,697 |
| Li | 31,930,869 | 6158 | 33,939,724 | 10,684 |
| Lu | 38,877,787 | 14,511 | 43,732,746 | 15,873 |
| Sn | 37,649,014 | 6370 | 41,022,882 | 11,268 |
The dataset was downloaded from ENCODE (ENCODE Consortium, 2011) and is composed of a uniform read length of 36 nt.
Tissues with each having two biological replicates (Bl, Blood; Br, Brain; Bt, Breast; Cx, Cervix; Ep, Epithelium; Es, Embryonic stem cell; Li, Liver; Lu, Lung; Sn, Skin).
Number of mapped reads.
Total number of block groups retrieved after preprocessing.
miRBase dataset is comprised of short-reads mapped to 4862 distinct microRNAs from 20 organisms.
| Ame | 475,288 (1) | 159 | 109 | 96 |
| Ath | 472,5021 (8) | 275 | 303 | 173 |
| Bfl | 37,217 (1) | 113 | 71 | 34 |
| Bmo | 2,021,309 (3) | 384 | 264 | 194 |
| Cbr | 17,442 (1) | 115 | 81 | 25 |
| Cel | 1,048,509 (6) | 184 | 130 | 93 |
| Cqu | 379,978 (1) | 68 | 61 | 29 |
| Cre | 1082 (1) | 28 | 19 | 11 |
| Crm | 9988 (1) | 95 | 63 | 19 |
| Cte | 50,659 (1) | 118 | 72 | 8 |
| Dme | 35,664,132 (50) | 237 | 48 | 45 |
| Hsa | 81,138,802 (79) | 1279 | 801 | 550 |
| Mmu | 913,716,590 (82) | 749 | 688 | 624 |
| Nve | 2711 (1) | 34 | 17 | 2 |
| Osa | 1,506,288 (4) | 440 | 540 | 288 |
| Ppc | 11,176 (1) | 113 | 66 | 5 |
| Ppt | 503,573 (3) | 4224 | 285 | 148 |
| Rco | 147 (25) | 13 | 2 | 0 |
| Spu | 6458 (1) | 38 | 25 | 7 |
| Tca | 4,861,929 (2) | 196 | 193 | 189 |
| Total | 1,046,178,299 | 4862 | 3838 | 2540 |
Dataset was downloaded from miRBase (Kozomara and Griffiths-Jones, 2011).
Ame, A. mellifera; Ath, A. thaliana; Bfl, B. floridae; Bmo, B. mori; Cbr, C. briggsae; Cel, C. elegans; Cqu, C. quinquefasciatus; Cre, C. reinhardtii; Crm, C. remanei; Cte, C. teleta; Dme, D. melanogaster; Hsa, H. sapiens; Mmu, M. musculus; Nve, N. vectensis; Osa, O. sativa; Ppc, P. pacificus; Ppt, P. patens; Rco, R. communis; Spu, S. purpuratus; Tca, T. castaneum.
Number of mapped reads. Total number of GEO experiments are given in brackets. Some experiments are comprised of reads from multiple organisms.
Total number of distinct miRNAs with mapped reads.
Total number of block groups or miRNAs retrieved after preprocessing.
Block groups with >1 block and ≤200 nt in length that are compiled to form a database of miRNA read profiles (miRRPdb).
Annotation status of 58,161 block groups obtained after preprocessing of ENCODE dataset and their alignment to miRRPdb.
| miRNA | 571 | 285 | 223 (78) |
| snoRNA | 468 | 255 | 3 (1) |
| tRNA | 625 | 496 | 7 (1) |
| snRNA | 395 | 143 | 6 (4) |
| scRNA | 187 | 46 | 3 (7) |
| others | 277 | 136 | 8 (6) |
| unannotated | 55,638 | 3434 | 523 (15) |
| Total | 58,161 | 4795 | 773 (16) |
Block groups obtained after preprocessing and are overlapping to non-coding RNA annotation.
Block groups with >1 block and ≤200 nt in length.
Block groups that have significant alignment score (≥0.7) to miRNA read profile database (miRRPdb).
Figure 3Search strategy. (A) To predict putative miRNA candidates, we align 4795 read profiles from ENCODE dataset against miRNA read profile database (miRRPdb) that comprise of 2540 block groups using deepBlockAlign. We retrieve top scoring alignment for each query as putative miRNA, if the score is above the threshold of 0.7. (B) To predict distinct classes of miRNA based on their read profiles, all 2540 read profiles from miRRPdb were aligned against each other using deepBlockAlign to generate an alignment score matrix. Next, we perform hierarchical clustering of block groups based on their alignment score using pvclust. pvclust computes the p-value for each cluster in hierarchical clustering using multiscale bootstrap resampling and indicates how strong the cluster is supported by the data. We select all the clusters with p < 0.05 that represent families of microRNAs that share similar read profiles.
Figure 4Performance evaluation of the proposed method for the prediction of putative miRNAs using alignment of read profiles. (A) Density distribution of deepBlockAlign alignment scores for 285 miRNA (orange) and 1076 other ncRNA (black) read profiles from ENCODE dataset against miRNA read profile database (miRRPdb), respectively. We observed two distinct score distributions comprised of most miRNAs (223 out of 285) that have an alignment score of ≥0.7 and most of the other ncRNAs (1049 out of 1076) with score <0.7 against miRRPdb. (B) ROC curve analysis of the prediction performance. A high AUC of 0.93 was observed suggesting that miRNA read profiles have characteristic features that are distinct from read profiles of other ncRNAs and can be employed for confident miRNA prediction.
Figure 5A putative miRNA predicted by alignment of read profiles. (A) Alignment between an unannotated and miRNA read profile (block group) computed using deepBlockAlign. The alignment is obtained while aligning unannotated read profiles from ENCODE dataset against the miRRPdb. The unannotated region has a similar read profile to that of miRNA characterized by similar arrangement of reads as evident from block alignment and similar size and distance between the read blocks, leading to a high alignment score of 0.82. (B) The read profile for the unannotated region [chr17:20841720–20841781(+)] in UCSC's bigWig format. (C) Consensus RNA secondary structure computed using Multiz alignment with high conservation (Blanchette et al., 2004) from six vertebrate genomes corresponding to the unannotated region. The structure has been predicted using PETfold (Seemann et al., 2011) and has a characteristic hair-pin loop structure of miRNAs. (D) Multiz alignment (Blanchette et al., 2004) across six vertebrate genomes for the predicted consensus RNA secondary structure.
Figure 6The length and average pairwise identity for Multiz alignments corresponding to known and putative miRNAs. (A) In comparison to known miRNAs, most putative miRNAs have Multiz alignment [46-way downloaded from UCSC (Karolchik et al., 2004)] of length <50 nt and rest have a significantly low average pairwise identity (p-value = 2.9e-19, Kolmogorov–Smirnov test). (B) For the in house 13-way Multiz alignments (Anthon et al., in preparation), the same pattern was observed with putative miRNAs either have Multiz alignments that are short (<50 nt) or have a significantly lower average pairwise identity in comparison to known miRNAs (p-value = 1.2e-28, Kolmogorov–Smirnov test). This suggests that absence of well-defined Multiz alignments may contribute to many of these putative miRNAs not being identified through methods based on a set of aligned RNA sequences for ncRNA prediction (Gorodkin et al., 2010).
Figure 7Clusters of read profiles obtained after hierarchical clustering of 550 and 624 read profiles from Human and Mouse, respectively. (A) Two distinct and well-separated tree nodes were observed for both plant and animals (red and blue). For better representation, the count of read profiles within each cluster are shown on top of the corresponding branch. In comparison to read profiles in first node, read profiles in second node were characterized by a significantly high number of read blocks (p-value = 4.3e-115, Fisher's exact test) and entropy (p-value = 0.005, Fisher's exact test, blue shade bars). An earlier study has shown similar finding in Ciona intestinalis where half of the miRNA loci encode upto four distinct, stable small RNAs (Shi et al., 2009). (B) Three example read profiles with more than two read blocks, (i) beside expression in miR–miR*, loop region also showed expression even higher in comparison to miR*. (ii and iii) many reads are observed from region partially overlapping to miR–miR*, a pattern similar to those of miRNA-offset RNAs (moRs) albeit different in not being completely adjacent to the miR–miR* (Shi et al., 2009).
Figure 8Distinct clusters of read profiles observed for animals and plants. (A) Four clusters of read profiles (black box) are mostly observed in H. sapiens and M. musculus (animal specific). In contrast, three clusters (gray box) are mostly observed in O. sativa and A. thaliana (plant specific). (B) While almost all of the animal specific read profiles are of 60–90 nucleotides in length, plant specific read profiles were either short (<60 nt) or long (≥90 nt) in length. (C) Plant specific read profiles were also observed to have a significantly higher entropy in comparison to those from animals (p-value < 0.05, Kolmogorov–Smirnov test). This observation may be attributed to the different biogenesis mechanism of miRNAs in plants and animals (Lelandais-Briere et al., 2010).