| Literature DB >> 27768749 |
Rongsheng Zhu1, Zhanguo Zhang1, Yang Li1, Zhenbang Hu2, Dawei Xin2, Zhaoming Qi2, Qingshan Chen2.
Abstract
Previous studies have confirmed that there are many differences between animal and plant microRNAs (miRNAs), and that numerical features based on sequence and structure can be used to predict the function of individual miRNAs. However, there is little research regarding numerical differences between animal and plant miRNAs, and whether a single numerical feature or combination of features could be used to distinguish animal and plant miRNAs or not. Therefore, in current study we aimed to discover numerical features that could be used to accomplish this. We performed a large-scale analysis of 132 miRNA numerical features, and identified 17 highly significant distinguishing features. However, none of the features independently could clearly differentiate animal and plant miRNAs. By further analysis, we found a four-feature subset that included helix number, stack number, length of pre-miRNA, and minimum free energy, and developed a logistic classifier that could distinguish animal and plant miRNAs effectively. The precision of the classifier was greater than 80%. Using this tool, we confirmed that there were universal differences between animal and plant miRNAs, and that a single feature was unable to adequately distinguish the difference. This feature set and classifier represent a valuable tool for identifying differences between animal and plant miRNAs at a molecular level.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27768749 PMCID: PMC5074594 DOI: 10.1371/journal.pone.0165152
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Basic information of candidate species.
| Species Class | Species Name | Number of miRNA precursor |
|---|---|---|
| Animal | 348 | |
| Animal | 189 | |
| Animal | 734 | |
| Animal | 324 | |
| Animal | 341 | |
| Animal | 460 | |
| Animal | 615 | |
| Animal | 1872 | |
| Animal | 656 | |
| Animal | 634 | |
| Animal | 396 | |
| Animal | 1186 | |
| Animal | 449 | |
| Animal | 798 | |
| Animal | 280 | |
| Animal | 346 | |
| Animal | 129 | |
| Animal | 489 | |
| Animal | 210 | |
| Animal | 223 | |
| Animal | 124 | |
| Animal | 148 | |
| plant | 229 | |
| Plant | 298 | |
| Plant | 505 | |
| Plant | 672 | |
| Plant | 352 | |
| Plant | 163 | |
| Plant | 592 | |
| Plant | 205 | |
| Plant | 172 |
Note: All sequences come from miRBase database.
Fig 1Partial numerical features of miRNA.
Osa-mir156a secondary structure as predicted by Mfold. H1~H7 denote helices. I1~I2 denote interior loops. T1 denote terminal loops or hairpin loops. B1~B3 denote bulge loops. ‘G++.’ indicates that the left base of G is a matching base (‘+’ denote matching, the left base of G base corresponding to the first mark behind G) and the right base of G is mismatching base (‘.’ denote mismatching, the right base of G base corresponding to the third mark behind G). G base is a matching base (the mark of G base is the second mark behind G).
Fig 2Statistical test results of differences between animal and plant miRNAs based on 132 numerical features and two test methods.
The upper panel shows results of the Kolmogorov-Smirnov test, while the bottom panel shows results of t-tests. The x-axis shows the serial number of the 132 numerical features. Description of the numerical features and A–H classes are shown in S1 Table.
Fig 3Distribution bar plot of lengths of pre-miRNAs, MFE, and number of stacks for animal and plant miRNAs.
The Fig 3A is a grouping distribution map about length of pre-miRNA about animal and plant. The Fig 3B refer to MFE and the Fig 3C refer to stack number of miRNAs.
Fig 4Frequency distribution plot of four numerical features of miRNAs.
The C content, G content, MFE index, and length of miRNA were selected based on results of the Kolmogorov-Smirnov test statistic.
Results of features selection.
| Attribute Evaluator | Search Method | Selected Feature |
|---|---|---|
| CfsSubsetEval | BestFirst | 118,120,121,122 |
| CfsSubsetEval | ExhaustiveSearch | 118,120,121,122 |
| CfsSubsetEval | GeneticSearch | 51,118,120,121,122 |
| FilteredSubsetEval | GreedyStepwise | 118,120,121,122 |
| FilteredSubsetEval | LinearForwardSelection | 118,120,121,122 |
| FilteredSubsetEval | RandomSearch | 120,121,122 |
Note: Serial number information of being selected features refer to S1 Table. The number 51 represent GUC content frequency, the number 118 represent number of helix, the number 120 represent number of stack, the number 121 represent length of hairpin and the number 122 represent minimum free energy of pre-miRNA’s secondary structure. Attribute Evaluator and Search Method refer to papers[43, 44] and all details have been recorded in S3 Table.
Results of evaluation based on different classifiers.
| Classifier | Sample Set | TP Rate | Precision | Recall | ROC Area |
|---|---|---|---|---|---|
| NaiveBayes | S1 | 0.849 | 0.843 | 0.849 | 0.773 |
| BayesNet | S1 | 0.843 | 0.833 | 0.843 | 0.801 |
| Logistic | S1 | 0.854 | 0.854 | 0.835 | 0.805 |
| FilteredClassifier | S1 | 0.856 | 0.856 | 0.856 | 0.789 |
| ZeroR | S1 | 0.772 | 0.595 | 0.772 | 0.5 |
| J48 | S1 | 0.855 | 0.851 | 0.855 | 0.764 |
| RandomForest | S1 | 0.815 | 0.804 | 0.815 | 0.744 |
| NaiveBayes | S2 | 0.844 | 0.835 | 0.844 | 0.795 |
| BayesNet | S2 | 0.84 | 0.833 | 0.84 | 0.807 |
| Logistic | S2 | 0.86 | 0.861 | 0.86 | 0.816 |
| FilteredClassifier | S2 | 0.861 | 0.86 | 0.861 | 0.778 |
| ZeroR | S2 | 0.772 | 0.595 | 0.772 | 0.5 |
| J48 | S2 | 0.862 | 0.857 | 0.862 | 0.759 |
| RandomForest | S2 | 0.836 | 0.827 | 0.836 | 0.77 |
Note: 10-fold cross-validation; S1 include the helix, stack number, length and MFE; S2 include AU, GU, AUC, GAC, GAU, GUC, CUC, A…, U…, helix number, interior loop number, stack number, length of pre-miRNA, MFE, AMFE, MFEI and IESS.