| Literature DB >> 21619662 |
Jiandong Ding1, Shuigeng Zhou, Jihong Guan.
Abstract
BACKGROUND: MicroRNAs (miRNAs) are ~22 nt long integral elements responsible for post-transcriptional control of gene expressions. After the identification of thousands of miRNAs, the challenge is now to explore their specific biological functions. To this end, it will be greatly helpful to construct a reasonable organization of these miRNAs according to their homologous relationships. Given an established miRNA family system (e.g. the miRBase family organization), this paper addresses the problem of automatically and accurately classifying newly found miRNAs to their corresponding families by supervised learning techniques. Concretely, we propose an effective method, miRFam, which uses only primary information of pre-miRNAs or mature miRNAs and a multiclass SVM, to automatically classify miRNA genes.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21619662 PMCID: PMC3120706 DOI: 10.1186/1471-2105-12-216
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The explosion of miRNA genes (Sep. 2004 - Apr. 2010). MiRNAs registered in miRBase increased rapidly in recent years. Almost at the same time when we finalized this manuscript, the 16th version of miRbase was released on 10 September 2010. Here, the latest information is not shown. Similar information was also exhibited in [10].
Figure 2The experimental pipeline. In order to show the discriminative power of miRFam, we designed a series of experiments including single family tests, multi-family tests, and tests on large-scale miRBase families. All these experiments were carried out by 5-fold cross validation. Details of datasets and features were also shown.
Results of single family experiments
| experiment | SE(%) | SP(%) | Acc(%) | |
|---|---|---|---|---|
| let-7+R1 | 99.50 | 99.52 | 99.51 | |
| R* | mir-17+R2 | 100.0 | 100.0 | 100.0 |
| mir-9+R3 | 98.58 | 98.46 | 98.52 | |
| let-7+S | 99.02 | 99.69 | 99.42 | |
| S | mir-17+S | 99.33 | 99.69 | 99.57 |
| mir-9+S | 100.0 | 99.38 | 99.56 | |
* Only trigram and bigram features are considered in these experiments.
Results of different combinations of n-gram types
| Group | Acc(trigram) | Acc(tri-&bigram) | Acc(tri-, bi-&unigram) | Acc(tri-, bi-&unigram) |
|---|---|---|---|---|
| T20 | 90.67 | 96.21 | 68.90 | 96.76 |
| G1 | 93.61 | 98.40 | 87.63 | 98.86 |
| G2 | 87.62 | 99.01 | 87.74 | 99.01 |
| Total | 85.08 | 93.48 | 63.75 | 93.62 |
Results of miRFam with unigram, bigram and trigram, without concentration factor. Results of miRFam with unigram, bigram and trigram, with concentration factor. Combination of T20, G1 and G2. All results are percentiles.
Figure 3Center vectors comparison among three miRNA families (let-7, mir-17 and mir-9) and the dataset S (snoRNAs). All horizontal axes are the n-grams arranged in the order from trigrams to bigrams and unigrams. (A) Family centers before weighting; (B) Family centers after weighting; (C) The variances of n-gram feature values among the four families before weighting; (D) The variances of n-gram feature values among the four families after weighting.
Figure 4Classification performance vs. the size of training dataset. We used T20 to show the impact of training dataset size. At the beginning, only 10% of 2198 sequences in T20 were treated as training samples while others (90%) were used to test miRFam. At each round, we increased the training set by one partition (10%), and accordingly the testing set was reduced by one partition (10%). This process continued iteratively till half of T20 was for training and the other half for testing. The result of normal 5-fold crass validation is also shown.
Results on mature miRNAs
| Group | Families | Members* | Acc(tri-, bi-&unigram, %) |
|---|---|---|---|
| T20 | 20 | 1529 | 96.80 |
| G1 | 10 | 351 | 97.71 |
| G2 | 10 | 162 | 99.38 |
| Total | 40 | 2042 | 95.03 |
* Two reasons why the numbers of mature sequences in multi-family datasets are less than that in hairpins. First, different pre-miRNAs may generate similar mature miRNAs. Second, some pre-miRNAs contain several mature miRNAs, but only one is considered.
Performance of large-scale miRBase families test
| miRBase14 | miRBase15 | ||
|---|---|---|---|
| Family number | 334 | 398 | 1056 |
| MiRNA number | 7797 | 9379 | 11115 |
| Accuracy (%) | 89.21 | 88.91 | 85.09 |
| Accuracy (%) | 98.18 | 97.97 | 90.66 |
Families in miRBase whose members are no less than 5.
All families in miRBase 15 are used.
results with uni-, bi- and trigram features.
results with uni-, bi-, tri- and tetragram features.
Notations of datasets
| notation | description | |
|---|---|---|
| Single family | R | reverse sequences of the biggest three miRNA families |
| S | combination of SNORA26 and SNORA33 from Rfam10.0 | |
| T20 | 20 families with the largest members, ANM | |
| Multi families | G1 | 10 families selected from miRBase14, ANM |
| G2 | 10 families selected from miRBase14, ANM | |
R1 - let-7; R2 - mir-17; R3 - mir-9.
ANM -- Average Number of Members.