| Literature DB >> 19204784 |
Dong Wang1, Ming Lu, Jing Miao, Tingting Li, Edwin Wang, Qinghua Cui.
Abstract
Identifying the tissues in which a microRNA is expressed could enhance the understanding of the functions, the biological processes, and the diseases associated with that microRNA. However, the mechanisms of microRNA biogenesis and expression remain largely unclear and the identification of the tissues in which a microRNA is expressed is limited. Here, we present a machine learning based approach to predict whether an intronic microRNA show high co-expression with its host gene, by doing so, we could infer the tissues in which a microRNA is high expressed through the expression profile of its host gene. Our approach is able to achieve an accuracy of 79% in the leave-one-out cross validation and 95% on an independent testing dataset. We further estimated our method through comparing the predicted tissue specific microRNAs and the tissue specific microRNAs identified by biological experiments. This study presented a valuable tool to predict the co-expression patterns between human intronic microRNAs and their host genes, which would also help to understand the microRNA expression and regulation mechanisms. Finally, this framework can be easily extended to other species.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19204784 PMCID: PMC2635472 DOI: 10.1371/journal.pone.0004421
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The framework works as follows: A support vector machine (SVM) classifier is trained based on the feature vector extracted from the training samples.
And then the performance of this classifier is evaluated by the testing samples. The classifier is further used to identify the human intronic miRNAs that are high co-expressed with their host genes. Finally, the expression profiles of the identified human intronic miRNAs are predicted by inferring that of their host genes.
The feature vector of samples used in training the SVM.
| Index | Description |
| 1 | Distance from the transcription start position of the host gene to the start point of the host intron |
| 2 | Distance from the transcription start position of the host gene to the start point of the microRNA |
| 3 | Length of the host intron |
| 4 | Length of the microRNA/length of the host gene |
The training dataset and the leave-one-out validation results on Baskerville et al.' data [14].
| microRNAs | Host gene | Training dataset | ||||||
| 1 | 2 | 3 | 4 | co-expressions | ER | PR | ||
| hsa-mir-9-1 | C1orf61 | 9933 | 0.0035 | 182.716 | 143.205 | 0.999 | + | + |
| hsa-mir-139 | PDE2A | 33502 | 0.00068 | 580.925 | 486.716 | 0.99 | + | + |
| hsa-mir-1-1 | C20orf166 | 11323 | 0.00345 | 55.0571 | 45.9 | 0.968 | + | + |
| hsa-mir-95 | ABLIM2 | 24504 | 0.00041 | 499.888 | 228.05 | 0.96 | + | + |
| hsa-mir-338 | AATK | 1412 | 0.00398 | 138.167 | 125 | 0.921 | + | + |
| hsa-mir-126 | EGFL7 | 619 | 0.00862 | 91.3452 | 88.1071 | 0.888 | + | + |
| hsa-mir-25 | MCM7 | 771 | 0.0092 | 9.39759 | 7.48193 | 0.838 | + | + |
| hsa-mir-204 | TRPM3 | 43567 | 0.00128 | 239.523 | 3.77982 | 0.796 | + | + |
| hsa-mir-190 | TLN2 | 12893 | 0.00043 | 2100.1 | 2060.25 | 0.663 | + | + |
| hsa-mir-153-1 | PTPRN | 3451 | 0.0045 | 50.4157 | 21.3371 | 0.626 | + | + |
| hsa-mir-98 | HUWE1 | 2461 | 0.00097 | 198.585 | 187.542 | 0.503 | − | − |
| hsa-mir-153-2 | PTPRN2 | 5119 | 0.00008 | 410.163 | 377.105 | 0.499 | − | − |
| hsa-mir-26b | CTDSP1 | 629 | 0.01229 | 38.0526 | 34.8816 | 0.453 | − | − |
| hsa-mir-30c-1 | NFYC | 4874 | 0.0011 | 746.761 | 700.875 | 0.406 | − | − |
| hsa-let-7f-2 | HUWE1 | 2461 | 0.00068 | 297.585 | 269.878 | 0.379 | − | − |
| hsa-mir-99a | C21orf34 | 69554 | 0.00019 | 4308.89 | 4288.26 | 0.335 | − | − |
| hsa-mir-125b-2 | C21orf34 | 69554 | 0.00021 | 4498.4 | 3898.42 | 0.297 | − | − |
| hsa-mir-103-2 | PANK2 | 1620 | 0.00226 | 359.468 | 353.701 | 0.27 | − | − |
| hsa-mir-101-2 | RCL1 | 10574 | 0.00114 | 736.718 | 727.141 | 0.226 | − | − |
| hsa-let-7c | C21orf34 | 69554 | 0.0002 | 4162.05 | 4133.27 | 0.067 | − | − |
| hsa-mir-199a-1 | DNM2 | 7602 | 0.00061 | 1419.26 | 1347.13 | 0.02 | − | − |
| hsa-mir-32 | C9orf5 | 12165 | 0.00066 | 450.652 | 333.087 | −0.22 | − | − |
| hsa-mir-26a-1 | CTDSPL | 3524 | 0.00062 | 1410.88 | 1390.86 | −0.285 | − | − |
| hsa-mir-128a | R3HDM1 | 13938 | 0.00042 | 1652.9 | 1603.47 | 0.856 | + | − |
| hsa-mir-103-1 | PANK3 | 2235 | 0.00321 | 68.4675 | 45.8831 | 0.638 | + | − |
| hsa-mir-15b | SMC4 | 7415 | 0.00282 | 41.5464 | 40.6804 | 0.509 | − | + |
| hsa-mir-16-2 | SMC4 | 7415 | 0.00233 | 52.3375 | 49.325 | 0.504 | − | + |
| hsa-mir-128b | ARPP-21 | 48416 | 0.00055 | 1230.36 | 1224.2 | 0.444 | − | + |
| hsa-mir-335 | MEST | 1633 | 0.00655 | 43.172 | 36.8387 | 0.442 | − | + |
Once again, kicking one sample out as the testing sample, the rest 28 samples are the training dataset.
The four features (columns “1”, “2”, “3”, and “4”) of each miRNA are calculated based on the genomic coordinates of the miRNA, the miRNA hosting intron, and the host gene.
ER represents the experimental results and PR represents the prediction results. The symbol “+” means high co-expression and the symbol “−” means low co-expression.
Classification performance of the SVM classifier on two testing datasets.
| Testing datasets | Type | Size | Accuracy (%) | Sensitivity (%) | Specificity (%) |
| Baskerville et al.' co-expression data | + | 12 | 79 | 83 | 71 |
| − | 17 | 76 | 87 | ||
| The combined data | + | 5 | 95 | 100 | 83 |
| − | 16 | 94 | 100 |
the combined data is the consistent part of Baskerville et al's miRNA profile data/Su et al.' mRNA profile data and Barad et al.' miRNA profile data/Su et al.' mRNA profile data.
The symbol “+” means high co-expression and the symbol “−” means low co-expression.
The prediction results on the combined testing data.
| miRNAs | Host gene | Pearson's correlation coefficients | Calculated results | Predicted results |
| hsa-mir-28 | LPP | 0.2403 | − | − |
| hsa-mir-140 | WWP2 | −0.279 | − | − |
| hsa-mir-149 | GPC1 | 0.0334 | − | + |
| hsa-mir-23b | C9orf3 | 0.2401 | − | − |
| Has-mir-194-1 | IARS2 | −0.4155 | − | − |
| hsa-let-7g | TMEM113 | 0.6539 | + | + |
| hsa-mir-152 | COPZ2 | −0.1915 | − | − |
| hsa-mir-93 | MCM7 | 0.7318 | + | + |
| hsa-mir-107 | PANK1 | −0.254 | − | − |
| hsa-mir-30e | NFYC | 0.1416 | − | − |
| hsa-mir-208 | MYH7 | 0.0741 | − | − |
| Has-mir-218-1 | SLIT2 | 0.8282 | + | + |
| Has-mir-106b | MCM7 | 0.7093 | + | + |
| Has-mir-105-1 | GABRA3 | 0.1756 | − | − |
| hsa-mir-24-1 | C9orf3 | 0.0961 | − | − |
| hsa-mir-215 | IARS2 | −0.6153 | − | − |
| hsa-mir-214 | DNM3 | −0.3261 | − | − |
| hsa-mir-186 | ZRANB2 | 0.8033 | + | + |
| hsa-mir-211 | TRPM1 | 0.0692 | − | − |
| Has-mir-199b | DNM1 | −0.3855 | − | − |
| hsa-mir-191 | C3orf60 | 0.1413 | − | − |
The combined data is described as Table 3.
The symbol “+” means high co-expression and the symbol “−” means low co-expression.
Tissue-specific miRNAs that are also found in our predictions.
| miRNAs |
| miRNAs | Predicted tissues |
| mir-1 | Heart | mir-1-1 | Skeletal Muscle |
| mir-488 | Nervous system | mir-488 | Nervous system |
| mir-218 | Nervous system | mir-218-2 | Nervous system |
| mir-449a | Reproductive System | mir-449a | Ovary |
| mir-9 | Nervous system | mir-9-1 | Nervous system |
| mir-128a | Nervous system | mir-128a | Nervous system |
| mir-153 | Nervous system | mir-153-1 | Nervous system |
| mir-302 cluster | Embryonic tissue and cell lines | mir-302 cluster | lymphoblasts |
The specific tissues listed in this column are identified by biological experiments [23].
Figure 2The predicted expression profile of mir-488, which indicates that mir-488 mainly show high expression values in central nervous system.
Red and Blue bars represents high expression tissues, in which the red bar represents central nervous system tissues and the blue bar represents adrenal gland and adrenal cortex, respectively.
Figure 3The predicted expression profile of mir-208b, which indicates that mir-208b mainly show high expression values in heart, skeletal muscle, and tongue.
Red bars represent high expression tissues. From the left to the right, the three red bars represent heart, skeletal muscle, and tongue, respectively.