| Literature DB >> 30577759 |
Guifeng Tang1, Jingwen Shi2, Wenjian Wu3, Xiang Yue4, Wen Zhang5.
Abstract
BACKGROUND: Bacterial small non-coding RNAs (sRNAs) have emerged as important elements in diverse physiological processes, including growth, development, cell proliferation, differentiation, metabolic reactions and carbon metabolism, and attract great attention. Accurate prediction of sRNAs is important and challenging, and helps to explore functions and mechanism of sRNAs.Entities:
Keywords: Ensemble learning; Neural network; Sequence-derived feature; Small RNA prediction
Mesh:
Substances:
Year: 2018 PMID: 30577759 PMCID: PMC6302447 DOI: 10.1186/s12859-018-2535-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Benchmark datasets of SLT2
| Dataset | Ratio | #Positive instances | #Negative instances |
|---|---|---|---|
| Balanced | 1:1 | 182 | 182 |
| Imbalanced | 1:2 | 182 | 364 |
| 1:3 | 182 | 546 | |
| 1:4 | 182 | 728 | |
| 1:5 | 182 | 910 |
Fig. 1The length distribution of sRNAs in SLT2
Sequence-derived features of sRNA
| Feature group | Index | Feature | Dimension | Parameter |
|---|---|---|---|---|
| Spectrum profile | F1 | 1-spectrum profile | 4 | No parameter |
| F2 | 2-spectrum profile | 16 | No parameter | |
| F3 | 3-spectrum profile | 64 | No parameter | |
| F4 | 4-spectrum profile | 256 | No parameter | |
| F5 | 5-spectrum profile | 1024 | No parameter | |
| Mismatch profile | F6 | (3, | 64 | |
| F7 | (4 | 256 | ||
| F8 | (5 | 1024 | ||
| Reverse compliment k-mer | F9 | 1-RevcKmer | 2 | No parameter |
| F10 | 2-RevcKmer | 10 | No parameter | |
| F11 | 3-RevcKmer | 32 | No parameter | |
| F12 | 4-RevcKmer | 136 | No parameter | |
| F13 | 5-RevcKmer | 512 | No parameter | |
| Pseudo nucleotide composition | F14 | PCPseDNC | 16 + | |
| F15 | PCPseTNC | 64 + | ||
| F16 | SCPseDNC | 16 + 6 × | ||
| F17 | SCPseTNC | 64 + 12 × |
Fig. 2The workflow of WAEM and NNEM
Fig. 3a AUC scores of the PCPseDNC and SCPseDNC-based models with the variation of the parameter λ on the balanced dataset; b AUC scores of the PCPseTNC and SCPseTNC-based models with the variation of the parameter λ on the balanced dataset
Performances of individual feature-based models constructed by RF and SVM on the balanced dataset
| Index | Feature | AUC | ACC | SN | SP | ||||
|---|---|---|---|---|---|---|---|---|---|
| RF | SVM | RF | SVM | RF | SVM | RF | SVM | ||
| F1 | 1-spectrum profile | 0.682 | 0.657 | 0.560 | 0.512 | 0.912 | 0.985 | 0.209 | 0.039 |
| F2 | 2-spectrum profile | 0.829 | 0.821 | 0.756 | 0.749 | 0.792 | 0.788 | 0.720 | 0.711 |
| F3 | 3-spectrum profile | 0.909 | 0.874 | 0.834 | 0.800 | 0.863 | 0.835 | 0.805 | 0.765 |
| F4 | 4-spectrum profile | 0.923 | 0.909 | 0.860 | 0.840 | 0.873 | 0.866 | 0.846 | 0.814 |
| F5 | 5-spectrum profile | 0.912 | 0.896 | 0.842 | 0.822 | 0.847 | 0.874 | 0.838 | 0.770 |
| F6 | (3, | 0.769 | 0.795 | 0.679 | 0.717 | 0.807 | 0.812 | 0.552 | 0.622 |
| F7 | (4 | 0.880 | 0.885 | 0.797 | 0.816 | 0.814 | 0.843 | 0.780 | 0.789 |
| F8 | (5 | 0.913 | 0.907 | 0.835 | 0.832 | 0.848 | 0.882 | 0.822 | 0.782 |
| F9 | 1-RevcKmer | 0.632 | 0.655 | 0.516 | 0.542 | 0.972 | 0.935 | 0.060 | 0.150 |
| F10 | 2-RevcKmer | 0.842 | 0.804 | 0.765 | 0.726 | 0.828 | 0.817 | 0.702 | 0.636 |
| F11 | 3-RevcKmer | 0.924 | 0.868 | 0.855 | 0.791 | 0.848 | 0.831 | 0.863 | 0.750 |
| F12 | 4-RevcKmer | 0.938 | 0.894 | 0.880 | 0.818 | 0.880 | 0.869 | 0.880 | 0.768 |
| F13 | 5-RevcKmer | 0.937 | 0.906 | 0.874 | 0.829 | 0.859 | 0.856 | 0.889 | 0.802 |
| F14 | PCPseDNC | 0.895 | 0.905 | 0.827 | 0.828 | 0.850 | 0.868 | 0.803 | 0.787 |
| F15 | PCPseTNC | 0.931 | 0.922 | 0.862 | 0.857 | 0.856 | 0.848 | 0.868 | 0.865 |
| F16 | SCPseDNC | 0.902 | 0.888 | 0.825 | 0.811 | 0.841 | 0.810 | 0.809 | 0.811 |
| F17 | SCPseTNC | 0.905 | 0.910 | 0.825 | 0.840 | 0.854 | 0.841 | 0.795 | 0.839 |
Performances of individual feature-based models constructed by RF on the benchmark datasets
| Index | AUC | ACC | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Balanced | Imbalanced | Balanced | Imbalanced | |||||||
| 1:1 | 1:2 | 1:3 | 1:4 | 1:5 | 1:1 | 1:2 | 1:3 | 1:4 | 1:5 | |
| F1 | 0.682 | 0.718 | 0.730 | 0.729 | 0.738 | 0.560 | 0.691 | 0.754 | 0.804 | 0.840 |
| F2 | 0.829 | 0.847 | 0.862 | 0.865 | 0.868 | 0.756 | 0.789 | 0.836 | 0.863 | 0.877 |
| F3 | 0.909 | 0.917 | 0.921 | 0.928 | 0.930 | 0.834 | 0.856 | 0.887 | 0.905 | 0.915 |
| F4 | 0.923 | 0.933 | 0.930 | 0.934 | 0.933 | 0.860 | 0.884 | 0.906 | 0.921 | 0.930 |
| F5 | 0.912 | 0.894 | 0.872 | 0.869 | 0.863 | 0.842 | 0.864 | 0.882 | 0.896 | 0.910 |
| F6 | 0.769 | 0.808 | 0.822 | 0.832 | 0.840 | 0.679 | 0.766 | 0.809 | 0.843 | 0.866 |
| F7 | 0.880 | 0.902 | 0.910 | 0.917 | 0.922 | 0.797 | 0.842 | 0.870 | 0.894 | 0.909 |
| F8 | 0.913 | 0.924 | 0.929 | 0.938 | 0.939 | 0.835 | 0.871 | 0.901 | 0.916 | 0.927 |
| F9 | 0.632 | 0.657 | 0.667 | 0.679 | 0.691 | 0.516 | 0.619 | 0.707 | 0.755 | 0.791 |
| F10 | 0.842 | 0.847 | 0.865 | 0.875 | 0.875 | 0.765 | 0.796 | 0.836 | 0.867 | 0.882 |
| F11 | 0.924 | 0.926 | 0.933 | 0.941 | 0.944 | 0.855 | 0.879 | 0.901 | 0.920 | 0.930 |
| F12 | 0.938 | 0.949 | 0.948 | 0.954 | 0.954 | 0.880 | 0.902 | 0.918 | 0.931 | 0.942 |
| F13 | 0.937 | 0.932 | 0.923 | 0.924 | 0.920 | 0.874 | 0.897 | 0.910 | 0.925 | 0.936 |
| F14 | 0.895 | 0.883 | 0.886 | 0.888 | 0.884 | 0.827 | 0.805 | 0.835 | 0.864 | 0.876 |
| F15 | 0.931 | 0.922 | 0.922 | 0.924 | 0.921 | 0.862 | 0.855 | 0.876 | 0.895 | 0.902 |
| F16 | 0.902 | 0.894 | 0.890 | 0.890 | 0.887 | 0.825 | 0.833 | 0.859 | 0.882 | 0.897 |
| F17 | 0.905 | 0.898 | 0.901 | 0.903 | 0.899 | 0.825 | 0.822 | 0.854 | 0.877 | 0.897 |
Performances of WAEM and NNEM on the balanced and imbalanced datasets
| Dataset | Ratio | Method | AUC | ACC | SN | SP |
|---|---|---|---|---|---|---|
| Balanced | 1:1 | WAEM | 0.942 | 0.887 | 0.888 | 0.868 |
| NNEM | 0.958 | 0.901 | 0.903 | 0.899 | ||
| Imbalanced | 1:2 | WAEM | 0.952 | 0.901 | 0.853 | 0.925 |
| NNEM | 0.962 | 0.909 | 0.872 | 0.927 | ||
| 1:3 | WAEM | 0.951 | 0.915 | 0.818 | 0.948 | |
| NNEM | 0.961 | 0.920 | 0.819 | 0.954 | ||
| 1:4 | WAEM | 0.957 | 0.929 | 0.817 | 0.956 | |
| NNEM | 0.962 | 0.931 | 0.810 | 0.961 | ||
| 1:5 | WAEM | 0.957 | 0.934 | 0.808 | 0.959 | |
| NNEM | 0.961 | 0.940 | 0.782 | 0.972 |
Fig. 4Optimal weights for the WAEM models on the benchmark datasets. dataset1 means balanced dataset 1:1, dataset2 means imbalanced dataset 1:2, dataset3 means imbalanced dataset 1:3, dataset4 means imbalanced dataset 1:4, dataset5 means imbalanced dataset 1:5
P-values of paired t-test on the AUCs of WAEM and NNEM on benchmark datasets
| Dataset | Balanced | Imbalanced | |||
|---|---|---|---|---|---|
| 1:1 | 1:2 | 1:3 | 1:4 | 1:5 | |
| P-values | 1.67E-09 | 3.07E-06 | 7.26E-09 | 1.12E-05 | 5.70E-03 |
Performances of different methods on benchmark datasets
| Dataset | Ratio | Method | AUC | ACC | SN | SP |
|---|---|---|---|---|---|---|
| Balanced | 1:1 | Carter’s method | 0.566 | 0.511 | 0.264 | 0.758 |
| Barman’s method | 0.938 | 0.882 | 0.846 | 0.918 | ||
| WAEM | 0.942 | 0.887 | 0.888 | 0.868 | ||
| NNEM | 0.958 | 0.901 | 0.903 | 0.899 | ||
| Imbalanced | 1:2 | Carter’s method | 0.602 | 0.678 | 0.033 | 1.000 |
| Barman’s method | 0.937 | 0.884 | 0.851 | 0.916 | ||
| WAEM | 0.952 | 0.901 | 0.853 | 0.925 | ||
| NNEM | 0.962 | 0.909 | 0.872 | 0.927 | ||
| 1:3 | Carter’s method | 0.619 | 0.757 | 0.030 | 1.000 | |
| Barman’s method | 0.944 | 0873 | 0.818 | 0.927 | ||
| WAEM | 0.951 | 0.915 | 0.818 | 0.948 | ||
| NNEM | 0.961 | 0.920 | 0.819 | 0.954 | ||
| 1:4 | Carter’s method | 0.627 | 0.805 | 0.025 | 1.000 | |
| Barman’s method | 0.944 | 0.874 | 0.818 | 0.929 | ||
| WAEM | 0.957 | 0.929 | 0.817 | 0.956 | ||
| NNEM | 0.962 | 0.931 | 0.810 | 0.961 | ||
| 1:5 | Carter’s method | 0.636 | 0.835 | 0.011 | 1.000 | |
| Barman’s method | 0.943 | 0.875 | 0.884 | 0.865 | ||
| WAEM | 0.957 | 0.934 | 0.808 | 0.959 | ||
| NNEM | 0.961 | 0.940 | 0.782 | 0.972 |