| Literature DB >> 30026811 |
Amani Al-Ajlan1, Achraf El Allali1.
Abstract
BACKGROUND: Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences.Entities:
Keywords: Feature selection; Gene prediction; Metagenomics; ORF; Prokaryotes; mRMR
Year: 2018 PMID: 30026811 PMCID: PMC6047368 DOI: 10.1186/s13040-018-0170-z
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Training data
| GC range | GC content ranges | Number of ORFs |
|---|---|---|
| 1 | 0-36.57 | 713,474 |
| 2 | 36.57-41.57 | 716,896 |
| 3 | 41.57-46 | 728,133 |
| 4 | 46-50.14 | 705,792 |
| 5 | 50.14-54.28 | 741,691 |
| 6 | 54.28-58.14 | 710,639 |
| 7 | 58.14-61.85 | 705,692 |
| 8 | 61.85-65 | 724,478 |
| 9 | 65-68.28 | 729,822 |
| 10 | 68.28-100 | 742,300 |
Testing data
| Genomes | Gene bank accession no. | Number of ORFs |
|---|---|---|
| Archaeoglobus fulgidus | NC_000917 | 206,257 |
| Methanocaldococcus jannaschii | NC_000909 | 111,202 |
| Natronomonas pharaonis | NC_007426 | 241,784 |
| Buchnera aphidicola | NC_002528 | 38,541 |
| Corynebacterium jeikeium | NC_007164 | 239,797 |
| Chlorobaculum tepidum | NC_002932 | 206,807 |
| Helicobacter pylori | NC_000921 | 120,138 |
| Prochlorococcus marinus | NC_007577 | 117,755 |
| Wolbachia endosymbiont | NC_006833 | 86,338 |
| Burkholderia pseudomallei | NC_006350 | 311,856 |
| Pseudomonas aeruginosa | NC_002516 | 494,924 |
Fig. 1The proposed algorithm
Classification error rates vs. number of features
| mRMR Feature-set size | Error rate |
|---|---|
| 60 | 0.0321 |
| 200 | 0.0264 |
| 250 | 0.0257 |
| 300 | 0.0253 |
| 350 | 0.0249 |
| 400 | 0.0246 |
| 450 | 0.0245 |
|
|
|
Best RBF parameters for each GC range
| GC range | Best cost | Best gamma | Accuracy | Sensitivity | Specificity |
|---|---|---|---|---|---|
| 1 | 100 | 1.5 | 97.89 | 94.03 | 98.53 |
| 2 | 100 | 1.5 | 98.37 | 95.68 | 98.75 |
| 3 | 100 | 1.5 | 98.40 | 95.66 | 98.74 |
| 4 | 100 | 2 | 98.28 | 94.67 | 98.71 |
| 5 | 100 | 2 | 98.22 | 94.80 | 98.61 |
| 6 | 100 | 2 | 98.05 | 94.05 | 98.49 |
| 7 | 100 | 2 | 98.30 | 94.93 | 98.71 |
| 8 | 100 | 1.5 | 98.70 | 96.51 | 98.99 |
| 9 | 100 | 2 | 98.95 | 97.09 | 99.22 |
| 10 | 100 | 2 | 99.08 | 97.66 | 99.31 |
Comparison of SVM and neural network on testing data
| SVM | Neural network | |||||
|---|---|---|---|---|---|---|
| Genomes | Sp | Sn | H.M. | Sp | Sn | H.M |
| A. fulgidus | 96.46 | 87.26 | 91.61 | 95.60 | 82.09 | 88.33 |
| M. jannaschii | 97.29 | 94.58 | 95.91 | 97.21 | 93.30 | 95.21 |
| N. pharaonis | 97.37 | 82.71 | 89.44 | 96.10 | 77.27 | 85.66 |
| B. aphidicola | 97.94 | 93.28 | 95.56 | 98.11 | 92.16 | 95.04 |
| C. jeikeium | 97.31 | 88.64 | 92.77 | 97.15 | 84.84 | 90.58 |
| C. tepidum | 95.93 | 80.84 | 87.74 | 94.71 | 75.99 | 84.32 |
| H. pylori | 97.67 | 92.09 | 94.80 | 97.45 | 91.28 | 94.26 |
| P. marinus | 98.58 | 87.65 | 92.79 | 98.71 | 85.22 | 91.47 |
| W. endosymbiont | 88.10 | 89.66 | 88.87 | 88.69 | 87.04 | 87.86 |
| B. pseudomallei | 97.56 | 85.83 | 91.32 | 97.95 | 81.43 | 88.93 |
| P. aeruginosa | 97.64 | 88.88 | 93.05 | 97.70 | 86.90 | 91.98 |
|
|
|
|
|
|
|
|
Comparison of our method, orphelia, MGC and prodigal on testing data
| Our method (SVM) | Orphelia | MGC | Prodigal | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genomes | Sp | Sn | H.M. | Sp | Sn | H.M | Sp | Sn | H.M | Sp | Sn | H.M |
| A. fulgidus | 96.46 ±0.16 | 87.26 ±0.19 | 91.61 ±0.11 | 88.57 ±0.21 | 80.58 ±0.17 | 84.38 ±0.16 | 95.04 ±0.14 | 84.13 ±0.23 | 89.31 ±0.15 | 95.79 ±0.15 | 96.13 ±0.08 | 95.96 ±0.10 |
| M. jannaschii | 97.29 ±0.13 | 94.58 ±0.15 | 95.91 ±0.12 | 95.20 ±0.17 | 90.46 ±0.16 | 92.77 ±0.14 | 97.19 ±0.12 | 92.63 ±0.19 | 94.85 ±0.13 | 95.14 ±0.14 | 95.15 ±0.15 | 95.15 ±0.12 |
| N. pharaonis | 97.37 ±0.08 | 82.71 ±0.20 | 89.44 ±0.13 | 75.99 ±0.34 | 68.74 ±0.34 | 72.17 ±0.33 | 95.28 ±0.12 | 85.79 ±0.20 | 90.29 ±0.14 | 97.48 ±0.10 | 95.77 ±0.18 | 96.62 ±0.12 |
| B. aphidicola | 97.94 ±0.11 | 93.28 ±0.37 | 95.56 ±0.22 | 95.54 ±0.28 | 89.40 ±0.33 | 92.37 ±0.22 | 98.01 ±0.19 | 91.11 ±0.37 | 94.43 ±0.23 | 96.65 ±0.27 | 96.97 ±0.26 | 96.81 ±0.25 |
| C. jeikeium | 97.31 ±0.11 | 88.64 ±0.21 | 92.77 ±0.14 | 79.52 ±0.22 | 74.23 ±0.23 | 76.79 ±0.22 | 96.13 ±0.11 | 87.70 ±0.23 | 91.72 ±0.17 | 95.31 ±0.19 | 94.99 ±0.10 | 95.15 ±0.10 |
| C. tepidum | 95.93 ±0.12 | 80.84 ±0.23 | 87.74 ±0.17 | 77.51 ±0.22 | 66.95 ±0.23 | 71.85 ±0.21 | 93.42 ±0.14 | 79.08 ±0.24 | 85.65 ±0.18 | 94.35 ±0.14 | 88.15 ±0.19 | 91.14 ±0.11 |
| H. pylori | 97.67 ±0.12 | 92.09 ±0.21 | 94.80 ±0.12 | 94.17 ±0.20 | 88.99 ±0.22 | 91.5 ±0.20 | 97.77 ±0.14 | 89.70 ±0.22 | 93.56 ±0.17 | 95.29 ±0.14 | 93.07 ±0.14 | 94.16 ±0.12 |
| P. marinus | 98.58 ±0.07 | 87.65 ±0.25 | 92.79 ±0.16 | 94.41 ±0.20 | 84.984 ±0.24 | 89.45 ±0.20 | 97.71 ±0.11 | 87.92 ±0.20 | 92.55 ±0.12 | 97.52 ±0.17 | 91.96 ±0.20 | 94.66 ±0.15 |
| W. endosymbiont | 88.10 ±0.33 | 89.66 ±0.20 | 88.87 ±0.20 | 86.24 ±0.20 | 83.79 ±0.20 | 84.99 ±0.20 | 88.25 ±0.20 | 87.85 ±0.20 | 88.05 ±0.20 | 81.52 ±0.41 | 92.27 ±0.25 | 86.56 ±0.31 |
| B. pseudomallei | 97.56 ±0.06 | 85.83 ±0.19 | 91.32 ±0.11 | 69.54 ±0.31 | 64.79 ±0.22 | 67.08 ±0.26 | 94.79 ±0.13 | 87.84 ±0.25 | 91.18 ±0.18 | 94.28 ±0.09 | 96.47 ±0.09 | 95.37 ±0.08 |
| P. aeruginosa | 97.46 ±0.08 | 88.88 ±0.14 | 93.05 ±0.11 | 71.21 ±0.20 | 68.40 ±0.18 | 69.78 ±0.19 | 96.16 ±0.09 | 91.70 ±0.11 | 93.88 ±0.08 | 96.47 ±0.05 | 97.88 ±0.06 | 97.17 ±0.05 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fig. 2Harmonic mean of our method, Orphelia, MGC, and Prodigal