| Literature DB >> 27578422 |
Dingfang Li1, Longqiang Luo1, Wen Zhang2,3, Feng Liu4, Fei Luo5,6.
Abstract
BACKGROUND: Predicting piwi-interacting RNA (piRNA) is an important topic in the small non-coding RNAs, which provides clues for understanding the generation mechanism of gamete. To the best of our knowledge, several machine learning approaches have been proposed for the piRNA prediction, but there is still room for improvements.Entities:
Keywords: Ensemble learning; Feature; Genetic algorithm; piRNA
Mesh:
Substances:
Year: 2016 PMID: 27578422 PMCID: PMC5006569 DOI: 10.1186/s12859-016-1206-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Raw data about three species
| Species | Raw real piRNAs | Raw non-piRNA ncRNAs | Transposons |
|---|---|---|---|
|
| 32,152 (NONCODE v3.0) | 59,003 (NONCODE v3.0) | 4,679,772 (UCSC, hg38) |
|
| 75,814 (NONCODE v3.0) | 43,855 (NONCODE v3.0) | 3,660,356 (UCSC, mm10) |
|
| 12,903 (NCBI, GSE9138) | 102,655 (NONCODE v3.0) | 37,326 (UCSC, dm6) |
Number of real piRNAs and pseudo piRNA
| Species | Real piRNAs | Pseudo piRNA |
|---|---|---|
|
| 7,405 | 21,846 |
|
| 13,998 | 40,712 |
|
| 9,214 | 22,855 |
Twenty-three sequence-derived features
| Index | Feature | Dimension | Parameter | Annotation |
|---|---|---|---|---|
| F1 | 1-Spectrum Profile | 4 | No Parameters | Used in [ |
| F2 | 2-Spectrum Profile | 16 | No Parameters | Used in [ |
| F3 | 3-Spectrum Profile | 64 | No Parameters | Used in [ |
| F4 | 4-Spectrum Profile | 256 | No Parameters | Used in [ |
| F5 | 5-Spectrum Profile | 1024 | No Parameters | Used in [ |
| F6 | (3, | 64 |
| New features |
| F7 | (4, | 256 |
| New features |
| F8 | (5, | 1024 |
| New features |
| F9 | (3, | 64 |
| New features |
| F10 | (4, | 256 |
| New features |
| F11 | (5, | 1024 |
| New features |
| F12 | 1-RevcKmer | 2 | No Parameters | New features |
| F13 | 2-RevcKmer | 10 | No Parameters | New features |
| F14 | 3-RevcKmer | 32 | No Parameters | New features |
| F15 | 4-RevcKmer | 136 | No Parameters | New features |
| F16 | 5-RevcKmer | 528 | No Parameters | New features |
| F17 | PCPseDNC | 16 + |
| New features |
| F18 | PCPseTNC | 64 + |
| New features |
| F19 | SCPseDNC | 16 + 6 × |
| New features |
| F20 | SCPseTNC | 64 + 12 × |
| New features |
| F21 | Sparse Profile | 5 × |
| New features |
| F22 | PSSM |
|
| New features |
| F23 | LSSTE | 32 | No parameters | Used in [ |
Fig. 1Flowchart of the GA-based weighted ensemble method
Fig. 2The length distribution of piRNAs in three species (Human, Mouse and Drosophila)
Fig. 3a AUC scores of the (k, w)-subsequence profile-based models with the variation of parameter w on balanced Human dataset; b AUC scores of the PCPseDNC, PCPseTNC, SCPseDNC and SCPseTNC-based models with the variation of the parameter λ on balanced Human dataset; c AUC scores of the sparse profile and PSSM-based models with the variation of the parameter d on balanced Human dataset
The performances of individual feature-based models on balanced Human dataset
| Index | Feature | AUC | ACC | SN | SP |
|---|---|---|---|---|---|
| F1 | 1-Spectrum Profile | 0.754 | 0.690 | 0.731 | 0.649 |
| F2 | 2-Spectrum Profile | 0.841 | 0.756 | 0.780 | 0.732 |
| F3 | 3-Spectrum Profile | 0.839 | 0.750 | 0.747 | 0.754 |
| F4 | 4-Spectrum Profile | 0.829 | 0.740 | 0.732 | 0.748 |
| F5 | 5-Spectrum Profile | 0.802 | 0.718 | 0.681 | 0.755 |
| F6 | (3,1)-Mismatch Profile | 0.862 | 0.772 | 0.819 | 0.725 |
| F7 | (4,1)-Mismatch Profile | 0.854 | 0.761 | 0.788 | 0.734 |
| F8 | (5,1)-Mismatch Profile | 0.842 | 0.750 | 0.754 | 0.747 |
| F9 | (3,1)-Subsequence Profile | 0.850 | 0.767 | 0.809 | 0.725 |
| F10 | (4,1)-Subsequence Profile | 0.866 | 0.782 | 0.821 | 0.743 |
| F11 | (5,1)-Subsequence Profile | 0.875 | 0.791 | 0.829 | 0.754 |
| F12 | 1-RevcKmer | 0.746 | 0.699 | 0.889 | 0.509 |
| F13 | 2-RevcKmer | 0.803 | 0.724 | 0.774 | 0.673 |
| F14 | 3-RevcKmer | 0.818 | 0.732 | 0.765 | 0.698 |
| F15 | 4-RevcKmer | 0.808 | 0.718 | 0.717 | 0.718 |
| F16 | 5-RevcKmer | 0.791 | 0.702 | 0.658 | 0.746 |
| F17 | PCPseDNC | 0.836 | 0.757 | 0.776 | 0.738 |
| F18 | PCPseTNC | 0.849 | 0.765 | 0.787 | 0.742 |
| F19 | SCPseDNC | 0.833 | 0.754 | 0.770 | 0.739 |
| F20 | SCPseTNC | 0.832 | 0.751 | 0.777 | 0.725 |
| F21 | Sparse Profile | 0.904 | 0.819 | 0.815 | 0.824 |
| F22 | PSSM | 0.880 | 0.807 | 0.815 | 0.799 |
| F23 | LSSTE | 0.688 | 0.631 | 0.664 | 0.598 |
The performances of individual feature-based models on imbalanced Human dataset
| Index | Feature | AUC | ACC | SN | SP |
|---|---|---|---|---|---|
| F1 | 1-Spectrum Profile | 0.748 | 0.739 | 0.398 | 0.854 |
| F2 | 2-Spectrum Profile | 0.841 | 0.808 | 0.416 | 0.940 |
| F3 | 3-Spectrum Profile | 0.850 | 0.814 | 0.321 | 0.982 |
| F4 | 4-Spectrum Profile | 0.844 | 0.811 | 0.284 | 0.989 |
| F5 | 5-Spectrum Profile | 0.836 | 0.813 | 0.305 | 0.986 |
| F6 | (3,1)-Mismatch Profile | 0.867 | 0.824 | 0.427 | 0.959 |
| F7 | (4,1)-Mismatch Profile | 0.856 | 0.814 | 0.328 | 0.979 |
| F8 | (5,1)-Mismatch Profile | 0.851 | 0.810 | 0.277 | 0.991 |
| F9 | (3,1)-Subsequence Profile | 0.850 | 0.808 | 0.443 | 0.932 |
| F10 | (4,1)-Subsequence Profile | 0.864 | 0.822 | 0.473 | 0.940 |
| F11 | (5,1)-Subsequence Profile | 0.871 | 0.829 | 0.492 | 0.944 |
| F12 | 1-RevcKmer | 0.745 | 0.746 | 0.005 | 0.997 |
| F13 | 2-RevcKmer | 0.803 | 0.778 | 0.411 | 0.902 |
| F14 | 3-RevcKmer | 0.823 | 0.800 | 0.265 | 0.981 |
| F15 | 4-RevcKmer | 0.823 | 0.803 | 0.241 | 0.993 |
| F16 | 5-RevcKmer | 0.818 | 0.806 | 0.255 | 0.992 |
| F17 | PCPseDNC | 0.841 | 0.806 | 0.374 | 0.952 |
| F18 | PCPseTNC | 0.857 | 0.813 | 0.337 | 0.975 |
| F19 | SCPseDNC | 0.836 | 0.803 | 0.346 | 0.958 |
| F20 | SCPseTNC | 0.842 | 0.808 | 0.312 | 0.977 |
| F21 | Sparse Profile | 0.905 | 0.856 | 0.634 | 0.932 |
| F22 | PSSM | 0.882 | 0.832 | 0.584 | 0.916 |
| F23 | LSSTE | 0.688 | 0.766 | 0.175 | 0.966 |
The performances of the GA-WE model on three species (Human, Mouse and Drosophila)
| Dataset | Species | AUC | ACC | SN | SP |
|---|---|---|---|---|---|
| Balanced |
| 0.932 | 0.839 | 0.858 | 0.820 |
|
| 0.937 | 0.838 | 0.824 | 0.852 | |
|
| 0.995 | 0.959 | 0.951 | 0.966 | |
| Imbalanced |
| 0.935 | 0.869 | 0.687 | 0.931 |
|
| 0.939 | 0.889 | 0.745 | 0.939 | |
|
| 0.996 | 0.958 | 0.897 | 0.983 |
Fig. 4Optimal weights for the GA-WE model in each fold of 10-CV
The performances of cross-species prediction
| Dataset | Species | AUC | ACC | SN | SP |
|---|---|---|---|---|---|
| Balanced |
| 0.863 | 0.788 | 0.796 | 0.781 |
|
| 0.687 | 0.668 | 0.639 | 0.698 | |
| Imbalanced |
| 0.868 | 0.811 | 0.425 | 0.942 |
|
| 0.746 | 0.774 | 0.370 | 0.936 |
Performances of GA-WE and the state-of-the-art methods on three species
| Dataset | Species | Method | AUC | ACC | SN | SP |
|---|---|---|---|---|---|---|
| Balanced |
| Piano | 0.592 | 0.560 | 0.855 | 0.265 |
| piRNApredictor | 0.894 | 0.812 | 0.859 | 0.764 | ||
| Ensemble Learning | 0.920 | 0.807 | 0.815 | 0.800 | ||
| GA-WE | 0.932 | 0.839 | 0.858 | 0.820 | ||
|
| Piano | 0.445 | 0.5365 | 0.837 | 0.236 | |
| piRNApredictor | 0.892 | 0.819 | 0.862 | 0.776 | ||
| Ensemble Learning | 0.924 | 0.810 | 0.863 | 0.756 | ||
| GA-WE | 0.937 | 0.838 | 0.826 | 0.850 | ||
|
| Piano | 0.741 | 0.692 | 0.836 | 0.547 | |
| piRNApredictor | 0.983 | 0.952 | 0.927 | 0.977 | ||
| Ensemble Learning | 0.994 | 0.958 | 0.952 | 0.965 | ||
| GA-WE | 0.995 | 0.959 | 0.949 | 0.966 | ||
| Imbalanced |
| Piano | 0.449 | 0.747 | 0.000 | 1.000 |
| piRNApredictor | 0.905 | 0.847 | 0.548 | 0.949 | ||
| Ensemble Learning | 0.922 | 0.836 | 0.589 | 0.919 | ||
| GA-WE | 0.935 | 0.869 | 0.687 | 0.931 | ||
|
| Piano | 0.441 | 0.744 | 0.000 | 1.000 | |
| piRNApredictor | 0.892 | 0.848 | 0.568 | 0.944 | ||
| Ensemble Learning | 0.928 | 0.849 | 0.586 | 0.940 | ||
| GA-WE | 0.939 | 0.889 | 0.745 | 0.939 | ||
|
| Piano | 0.804 | 0.712 | 0.000 | 1.000 | |
| piRNApredictor | 0.982 | 0.961 | 0.902 | 0.985 | ||
| Ensemble Learning | 0.995 | 0.965 | 0.920 | 0.984 | ||
| GA-WE | 0.996 | 0.964 | 0.940 | 0.973 |
Performances of GA-WE and the state-of-the-art methods in the cross-species prediction
| Dataset | Species | Method | AUC | ACC | SN | SP |
|---|---|---|---|---|---|---|
| Balanced |
| Piano | 0.431 | 0.558 | 0.878 | 0.238 |
| piRNApredictor | 0.850 | 0.783 | 0.781 | 0.784 | ||
| Ensemble Learning | 0.845 | 0.774 | 0.764 | 0.784 | ||
| GA-WE | 0.863 | 0.788 | 0.796 | 0.781 | ||
|
| Piano | 0.367 | 0.587 | 0.905 | 0.270 | |
| piRNApredictor | 0.728 | 0.650 | 0.630 | 0.669 | ||
| Ensemble Learning | 0.682 | 0.628 | 0.512 | 0.745 | ||
| GA-WE | 0.687 | 0.668 | 0.639 | 0.698 | ||
| Imbalanced |
| Piano | 0.426 | 0.747 | 0.000 | 1.000 |
| piRNApredictor | 0.856 | 0.823 | 0.507 | 0.931 | ||
| Ensemble Learning | 0.856 | 0.783 | 0.300 | 0.946 | ||
| GA-WE | 0.868 | 0.811 | 0.425 | 0.942 | ||
|
| Piano | 0.369 | 0.713 | 0.000 | 1.000 | |
| piRNApredictor | 0.783 | 0.773 | 0.422 | 0.915 | ||
| Ensemble Learning | 0.750 | 0.736 | 0.275 | 0.921 | ||
| GA-WE | 0.746 | 0.774 | 0.370 | 0.936 |