| Literature DB >> 28383059 |
Ranjan Kumar Barman1, Anirban Mukhopadhyay2, Santasabuj Das1,3.
Abstract
Bacterial small non-coding RNAs (sRNAs) are not translated into proteins, but act as functional RNAs. They are involved in diverse biological processes like virulence, stress response and quorum sensing. Several high-throughput techniques have enabled identification of sRNAs in bacteria, but experimental detection remains a challenge and grossly incomplete for most species. Thus, there is a need to develop computational tools to predict bacterial sRNAs. Here, we propose a computational method to identify sRNAs in bacteria using support vector machine (SVM) classifier. The primary sequence and secondary structure features of experimentally-validated sRNAs of Salmonella Typhimurium LT2 (SLT2) was used to build the optimal SVM model. We found that a tri-nucleotide composition feature of sRNAs achieved an accuracy of 88.35% for SLT2. We validated the SVM model also on the experimentally-detected sRNAs of E. coli and Salmonella Typhi. The proposed model had robustly attained an accuracy of 81.25% and 88.82% for E. coli K-12 and S. Typhi Ty2, respectively. We confirmed that this method significantly improved the identification of sRNAs in bacteria. Furthermore, we used a sliding window-based method and identified sRNAs from complete genomes of SLT2, S. Typhi Ty2 and E. coli K-12 with sensitivities of 89.09%, 83.33% and 67.39%, respectively.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28383059 PMCID: PMC5382675 DOI: 10.1038/srep46070
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Performance measures on different combination of features in SLT2 dataset, using RBF kernel of SVM.
| Features set | Vector length | P(+): N(−) | Threshold | Sensitivity (%) | Specificity (%) | Accuracy (%) | PPV (%) | MCC | F1 score (%) | AUC |
|---|---|---|---|---|---|---|---|---|---|---|
| All nucleotides composition | ||||||||||
| Tri-nucleotide composition | ||||||||||
| Mono and di-nucleotide composition | 20 | 1:1 | 0.00 | 76.94 | 85.78 | 81.36 | 85.42 | 0.64 | 80.96 | 0.888 |
| Di-nucleotide composition | 16 | 1:1 | 0.00 | 78.06 | 86.83 | 82.44 | 87.02 | 0.66 | 82.29 | 0.884 |
| Mono-nucleotide composition | 4 | 1:1 | 0.10 | 78.50 | 64.39 | 71.44 | 68.82 | 0.44 | 73.34 | 0.754 |
| Best features out of all nucleotides composition features (84), using Welch Two Sample t-test P value < 0.05. | 38 | 1:1 | 0.00 | 88.89 | 82.50 | 85.69 | 83.95 | 0.72 | 86.35 | 0.906 |
| Best features out of all nucleotides composition features (84), using Welch Two Sample t-test P value < 0.05. Here, nucleotides composition were significantly higher in positive (+ve) set rather negative (−ve) set. | 13 | 1:1 | 0.20 | 85.06 | 78.72 | 81.89 | 80.75 | 0.65 | 82.84 | 0.893 |
| Best features out of all nucleotides composition features (84), using Welch Two Sample t-test P value < 0.05. Here, nucleotides composition were significantly higher in negative (−ve) set rather positive (+ve) set. | 25 | 1:1 | 0.10 | 82.61 | 87.28 | 84.94 | 87.27 | 0.70 | 84.88 | 0.890 |
| Stem, Loop and Minimum free energy (MFE) | 3 | 1:1 | 0.00 | 61.17 | 80.83 | 71.00 | 77.08 | 0.43 | 68.21 | 0.732 |
| Stem and Loop | 2 | 1:1 | −0.30 | 43.94 | 69.89 | 56.92 | 59.14 | 0.15 | 50.42 | 0.587 |
| Stem | 1 | 1:1 | 0.70 | 74.06 | 34.56 | 54.31 | 53.02 | 0.12 | 61.80 | 0.535 |
| Loop | 1 | 1:1 | 0.30 | 92.94 | 43.61 | 68.28 | 62.64 | 0.43 | 74.84 | 0.667 |
| MFE | 1 | 1:1 | −0.10 | 76.39 | 36.72 | 56.56 | 54.94 | 0.14 | 63.92 | 0.547 |
Optimal parameter sets were used for respective combination of features.
SVM performance measures on balance and imbalanced SLT2 datasets.
| Best feature sets | Vector length | P(+): N(−) | Threshold | Sensitivity (%) | Specificity (%) | Accuracy (%) | PPV (%) | MCC | F1 score (%) | AUC |
|---|---|---|---|---|---|---|---|---|---|---|
| All nucleotides composition features | 84 | 1:1 | −0.20 | 89.00 | 86.83 | 87.92 | 87.75 | 0.76 | 88.37 | 0.929 |
| Tri-nucleotide composition features | 64 | 1:1 | 0.30 | 84.61 | 91.78 | 88.19 | 91.75 | 0.77 | 88.04 | 0.938 |
| All nucleotides composition features | 84 | 1:2 | 0.00 | 84.00 | 89.64 | 86.82 | 89.21 | 0.74 | 86.53 | 0.924 |
| Tri-nucleotide composition features | ||||||||||
| All nucleotides composition features | 84 | 1:3 | −0.10 | 82.89 | 90.30 | 86.59 | 89.79 | 0.74 | 86.20 | 0.931 |
| Tri-nucleotide composition features | 64 | 1:3 | 0.00 | 81.83 | 92.70 | 87.27 | 91.94 | 0.75 | 86.59 | 0.944 |
| All nucleotides composition features | 84 | 1:4 | −0.10 | 81.89 | 91.28 | 86.58 | 90.51 | 0.74 | 85.98 | 0.921 |
| Tri-nucleotide composition features | 64 | 1:4 | 0.00 | 81.83 | 92.92 | 87.38 | 92.24 | 0.76 | 86.73 | 0.944 |
| All nucleotides composition features | 84 | 1:5 | −0.30 | 81.89 | 91.14 | 86.52 | 90.32 | 0.74 | 85.90 | 0.928 |
| Tri-nucleotide composition features | 64 | 1:5 | −0.50 | 88.44 | 86.50 | 87.47 | 86.83 | 0.75 | 87.63 | 0.943 |
| All nucleotides composition features | 84 | 1:10 | −0.90 | 89.06 | 84.01 | 86.53 | 84.96 | 0.73 | 86.96 | 0.935 |
| Tri-nucleotide composition features | 64 | 1:10 | −0.50 | 86.22 | 89.56 | 87.89 | 89.27 | 0.76 | 87.72 | 0.946 |
Optimal parameter sets were used for respective balance and imbalanced SLT2 datasets.
Performance comparison of different machine learning methods.
| Machine learning method | Best feature sets | Vector length | P(+): N(−) | Sensitivity (%) | Specificity (%) | Accuracy (%) | PPV (%) | MCC | F1 score (%) | AUC |
|---|---|---|---|---|---|---|---|---|---|---|
| SVM | Tri-nucleotide composition | |||||||||
| Multilayer perceptron | Tri-nucleotide composition | 64 | 1:2 | 81.87 | 89.56 | 85.71 | 88.69 | 0.71 | 85.14 | 0.908 |
| Random forest | Tri-nucleotide composition | 64 | 1:2 | 66.48 | 95.88 | 81.18 | 94.16 | 0.68 | 77.94 | 0.927 |
Optimal parameter sets were used for respective methods.
Performance measures of the individual methods on SLT2 dataset.
| Method | SLT2 Sensitivity (%) | SLT2 Specificity (%) | SLT2 Accuracy (%) |
|---|---|---|---|
| QRNA | 59.00 | 71.00 | 65.00 |
| Alifoldz | 42.00 | 87.00 | 64.50 |
| MSARi | 2.00 | 100.00 | 51.00 |
| zMFold | 90.00 | 49.00 | 69.50 |
| RNAz2 | 27.00 | 98.00 | 62.50 |
| dynalign | 28.00 | 86.00 | 57.00 |
| vsFold | 25.00 | 88.00 | 56.50 |
| Arnedo | 67.00 | 78.00 | 72.50 |
Performance measures of proposed SVM model on experimentally verified sRNAs of others bacteria that are not used in training and testing datasets.
| Bacterial strain | No. of sRNAs | P(+): N(−) | Sensitivity (%) | Specificity (%) | Accuracy (%) | PPV(%) | MCC | F1 score(%) | AUC |
|---|---|---|---|---|---|---|---|---|---|
| 80 | 1:2 | 73.75 | 88.75 | 81.25 | 86.76 | 0.63 | 79.73 | 0.901 | |
| 38 | 1:2 | 89.47 | 88.16 | 88.82 | 88.31 | 0.78 | 88.89 | 0.926 |
Performance of sliding windows based approach for identifying sRNAs from complete genome.
| Dataset | No. of experimentally verified sRNAs | No. of sRNAs in intergenic regions | Positively predicted by proposed SVM model | % ofprediction |
|---|---|---|---|---|
| 182 | 165 | 147 | 89.09 | |
| 38 | 30 | 25 | 83.33 | |
| 80 | 46 | 31 | 67.39 |