| Literature DB >> 31639051 |
Mohsen Sheikh Hassani1, James R Green2.
Abstract
BACKGROUND: MicroRNAs (miRNAs) are a family of short, non-coding RNAs that have been linked to critical cellular activities, most notably regulation of gene expression. The identification of miRNA is a cross-disciplinary approach that requires both computational identification methods and wet-lab validation experiments, making it a resource-intensive procedure. While numerous machine learning methods have been developed to increase classification accuracy and thus reduce validation costs, most methods use supervised learning and thus require large labeled training data sets, often not feasible for less-sequenced species. On the other hand, there is now an abundance of unlabeled RNA sequence data due to the emergence of high-throughput wet-lab experimental procedures, such as next-generation sequencing.Entities:
Keywords: Active learning; Co-training; Machine learning; Next-generation sequencing; Semi-supervised learning; miRNA prediction
Mesh:
Substances:
Year: 2019 PMID: 31639051 PMCID: PMC6805288 DOI: 10.1186/s40246-019-0221-7
Source DB: PubMed Journal: Hum Genomics ISSN: 1473-9542 Impact factor: 4.639
Fig. 1Illustration of proposed two-stage integrated semi-supervised ML classification pipeline comprising both multi-view co-training (upper) and active learning (lower)
MVCT performance results for all six data sets over 11 iterations of learning. Results demonstrate average area under the precision-recall curves. Standard deviations were in the range of 0.001 to 0.003 for all experiments and are omitted from the table for clarity
| Iteration | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 0.596 | 0.344 | 0.714 | 0.822 | 0.810 | 0.864 | 0.778 | 0.357 | 0.925 | 0.893 | 0.921 | 0.875 |
| 1 | 0.681 | 0.448 | 0.795 | 0.881 | 0.854 | 0.920 | 0.866 | 0.478 | 0.932 | 0.909 | 0.918 | 0.881 |
| 2 | 0.705 | 0.568 | 0.797 | 0.906 | 0.884 | 0.920 | 0.860 | 0.566 | 0.930 | 0.905 | 0.926 | 0.893 |
| 3 | 0.721 | 0.678 | 0.813 | 0.903 | 0.893 | 0.920 | 0.865 | 0.585 | 0.927 | 0.909 | 0.934 | 0.886 |
| 4 | 0.752 | 0.735 | 0.872 | 0.912 | 0.893 | 0.919 | 0.850 | 0.778 | 0.920 | 0.912 | 0.939 | 0.941 |
| 5 | 0.748 | 0.734 | 0.879 | 0.911 | 0.886 | 0.925 | 0.863 | 0.726 | 0.931 | 0.917 | 0.952 | 0.946 |
| 6 | 0.781 | 0.739 | 0.921 | 0.920 | 0.883 | 0.912 | 0.849 | 0.783 | 0.923 | 0.915 | 0.947 | 0.947 |
| 7 | 0.771 | 0.747 | 0.917 | 0.910 | 0.887 | 0.922 | 0.871 | 0.773 | 0.930 | 0.911 | 0.954 | 0.952 |
| 8 | 0.791 | 0.744 | 0.937 | 0.912 | 0.882 | 0.920 | 0.855 | 0.734 | 0.951 | 0.916 | 0.943 | 0.949 |
| 9 | 0.772 | 0.738 | 0.928 | 0.911 | 0.920 | 0.932 | 0.860 | 0.744 | 0.957 | 0.918 | 0.956 | 0.955 |
| 10 | 0.773 | 0.761 | 0.941 | 0.908 | 0.903 | 0.923 | 0.865 | 0.765 | 0.961 | 0.917 | 0.952 | 0.961 |
| 11 | 0.779 | 0.761 | 0.955 | 0.912 | 0.901 | 0.921 | 0.865 | 0.809 | 0.964 | 0.927 | 0.959 | 0.961 |
Fig. 2Learning curve for MVCT for Bos taurus (bta) showing the AUPRC for the expression- and sequence-based views over 15 iterations. Results represent the mean AUPRC observed in 100 repetitions with randomized seed training sets (5 positive and 5 negative exemplars). Performance assymptotes after 11 iterations, justifying selection of this parameter
Active learning performance results for all six data sets over 11 iterations of learning using the labeled set obtained from co-training. Results demonstrate average area under the precision-recall curves. Standard deviations were in the range of 0.001 to 0.003 for all experiments and are omitted from the table for clarity purposes
| Iteration |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
|
| 0.779 | 0.955 | 0.921 | 0.865 | 0.964 | 0.961 |
| 1 | 0.812 | 0.951 | 0.918 | 0.877 | 0.960 | 0.962 |
| 2 | 0.856 | 0.959 | 0.921 | 0.890 | 0.963 | 0.965 |
| 3 | 0.875 | 0.963 | 0.925 | 0.894 | 0.963 | 0.965 |
| 4 | 0.881 | 0.963 | 0.928 | 0.916 | 0.964 | 0.968 |
| 5 | 0.888 | 0.968 | 0.930 | 0.932 | 0.965 | 0.970 |
| 6 | 0.891 | 0.970 | 0.929 | 0.939 | 0.965 | 0.971 |
| 7 | 0.891 | 0.972 | 0.931 | 0.939 | 0.965 | 0.971 |
| 8 | 0.896 | 0.972 | 0.937 | 0.941 | 0.964 | 0.971 |
| 9 | 0.898 | 0.972 | 0.940 | 0.948 | 0.964 | 0.970 |
| 10 | 0.901 | 0.972 | 0.941 | 0.947 | 0.965 | 0.971 |
| 11 | 0.903 | 0.972 | 0.941 | 0.948 | 0.965 | 0.971 |
Fig. 3Learning curves for the human (hsa) dataset for active learning alone (seed training set of 5 positive and 5 negative examplars) and active learning applied to MVCT-augmented training set (i.e., proposed 2-stage integrated pipeline)
Fig. 4Stacked bar graphs for six test species showing relative contribution of the base classifier alone, the MVCT-augmented training set, and active learning applied after MVCT (i.e., complete pipeline)
Comparing average AUPRC for all six data sets over the following methods: miPIE classification tool (restricted to 32 training exemplars), miRDeep2, active learning alone, and dual stage semi-supervised pipeline. Means ± standard deviations are shown, representing 100 repetitions of each experiment (except for miRDeep2)
| Data set | miPIE | miRDeep2 | Active learning alone | Proposed dual-stage SS pipeline |
|---|---|---|---|---|
|
| 0.844 (± 0.01) | 0.736 | 0.875 (± 0.01) | 0.903 (± 0.02) |
|
| 0.966 (± 0.01) | 0.915 | 0.972 (± 0.00) | 0.972 (± 0.00) |
|
| 0.894 (± 0.01) | 0.914 | 0.924 (± 0.01) | 0.941 (± 0.01) |
|
| 0.905 (± 0.02) | 0.869 | 0.935 (± 0.01) | 0.948 (± 0.01) |
|
| 0.919 (± 0.01) | 0.923 | 0.944 (± 0.01) | 0.965 (± 0.00) |
|
| 0.919 (± 0.01) | 0.843 | 0.971 (± 0.00) | 0.971 (± 0.00) |
| Average | 0.908 | 0.867 | 0.935 | 0.950 |
NGS data sets examined in this study
| Data set | GEO | Organism | Reads | Labeled samples |
|---|---|---|---|---|
|
| GSM- 1820470 |
| 38,210,937 | 509+/842− |
|
| GSM- 1528810 |
| 54,947,527 | 367+/844− |
|
| GSM- 1123781 |
| 18,723,989 | 110+/97− |
|
| GSE- 74879 |
| 43,164,654 | 332+/650− |
|
| GSM- 2095817 |
| 27,937,224 | 193+/104− |
|
| GSE- 100852 |
| 42,178,766 | 364+/224− |