| Literature DB >> 28637930 |
Ali M Yazbeck1, Kifah R Tout1, Peter F Stadler1, Jana Hertel1.
Abstract
The miRBase currently reports more than 25,000 microRNAs in several hundred genomes that belong to more than 1000 families of homologous sequences. Quantitative investigations of miRNA gene evolution requires the construction of data sets that are consistent in their coverage and include those genomes that are of interest in a given study. Given the size and structure of data, this can be achieved only with the help of a fully automatic pipeline that improves the available seed alignments, extends the set of available sequences by homology search, and reliably identifies true positive homology search results. Here we describe the current progress towards such a system, emphasizing the task of improving and completing the initial seed alignment.Entities:
Keywords: Alignments; Homology Search; ascertainment biases; miRBase
Mesh:
Substances:
Year: 2017 PMID: 28637930 PMCID: PMC6042801 DOI: 10.1515/jib-2016-0013
Source DB: PubMed Journal: J Integr Bioinform ISSN: 1613-4516
Figure 1:Distribution of mature microRNA sequence entries for miRBase (v. 21) families. The majority of families reports a miR and miR* for all (red) or at least some (blue) of the family members. Families in which only a single mature product is reported for every member are shown in yellow.
Figure 2:Comparison of S(𝒜) for original and processed alignments. Each data point represents one miRNA family. Data are stratified into six groups of miRBase families depending on the number of members in the initial alignment (1: 2-10 pre-miRNAs, 139 families; 2: 11-20 pre-miRNAs, 92 families; 3: 21-40 pre-miRNAs, 57 families; 4: 41-100 pre-miRNAs, 52 families; 5: 101-200 pre-miRNAs, 15 families; 6: >200 pre-miRNAs, 4 families, indicating that the improvements in alignment quality depends much more strongly on the entropy of the input alignment than on the number of sequences in the miRNA family.
Figure 3:Distribution of miRNA homologs of miRNA families mir-3 (left) and mir-723 (right). Numbers at the nodes denote potential observations of the number of paralogs. The black triangle assigns the LCA to the phylogenetic tree. ’3R’ denotes the third round of whole genome duplication events that is assumed to have happened during the evolution of vertebrates.