| Literature DB >> 28358836 |
Xiao Liu1,2, Bao-Jin Wang1, Luo Xu1, Hong-Ling Tang3, Guo-Qing Xu1.
Abstract
Genes that are indispensable for survival are essential genes. Many features have been proposed for computational prediction of essential genes. In this paper, the least absolute shrinkage and selection operator method was used to screen key sequence-based features related to gene essentiality. To assess the effects, the selected features were used to predict the essential genes from 31 bacterial species based on a support vector machine classifier. For all 31 bacterial objects (21 Gram-negative objects and ten Gram-positive objects), the features in the three datasets were reduced from 57, 59, and 58, to 40, 37, and 38, respectively, without loss of prediction accuracy. Results showed that some features were redundant for gene essentiality, so could be eliminated from future analyses. The selected features contained more complex (or key) biological information for gene essentiality, and could be of use in related research projects, such as gene prediction, synthetic biology, and drug design.Entities:
Mesh:
Year: 2017 PMID: 28358836 PMCID: PMC5373589 DOI: 10.1371/journal.pone.0174638
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Information on the 31 bacterial species.
| ID | Organism | Abbr. | NCBI Accession ID | Gram | Essential Gene Number | Sample Number |
|---|---|---|---|---|---|---|
| 1 | ABA | NC_005966 | - | 498 | 3307 | |
| 2 | BSU | NC_000964 | + | 271 | 4175 | |
| 3 | BFR | NC_016776 | - | 547 | 4290 | |
| 4 | BTH | NC_004663 | - | 325 | 4778 | |
| 5 | BPS | NC_006350/006351 | - | 505 | 5721 | |
| 6 | BUT | NC_007650/007651 | - | 403 | 5631 | |
| 7 | CJE | NC_002163 | - | 222 | 1572 | |
| 8 | CCR | NC_011916 | - | 401 | 3182 | |
| 9 | ECO | NC_000913 | - | 296 | 4140 | |
| 10 | FNO | NC_008601 | - | 390 | 1719 | |
| 11 | HIN | NC_000907 | - | 625 | 1602 | |
| 12 | HPY | NC_000915 | - | 305 | 1457 | |
| 13 | MTU | NC_000962 | + | 599 | 3872 | |
| 14 | MGE | NC_000908 | + | 378 | 475 | |
| 15 | MPU | NC_002771 | + | 309 | 782 | |
| 16 | PGI | NC_010729 | - | 463 | 2089 | |
| 17 | PAE | NC_002516 | - | 116 | 5476 | |
| 18 | PAU | NC_008463 | - | 335 | 5892 | |
| 19 | STY | NC_004631 | - | 347 | 4195 | |
| 20 | STS | NC_016810 | - | 353 | 4446 | |
| 21 | SET | NC_016856 | - | 104 | 5233 | |
| 22 | SLT | NC_003197 | - | 228 | 4363 | |
| 23 | SON | NC_004347 | - | 402 | 4065 | |
| 24 | SWI | NC_009511 | - | 535 | 4850 | |
| 25 | SAU | NC_002745 | + | 302 | 2582 | |
| 26 | SAN | NC_007795 | + | 345 | 2751 | |
| 27 | SPN | NC_003098 | + | 129 | 1793 | |
| 28 | SPM | NC_007297 | + | 227 | 1865 | |
| 29 | SPZ | NC_011375 | + | 241 | 1700 | |
| 30 | SSA | NC_009009 | + | 218 | 2270 | |
| 31 | VCH | NC_002505/002506 | - | 580 | 3351 |
Original features and results of selected features.
| Abbreviations | Description | Selection Results | Tool | |||
|---|---|---|---|---|---|---|
| GN | GP | Full | ||||
| Intrinsic feature | Gene size | Length of genes | ||||
| strand | Negative or positive strand | |||||
| protein size | Length of amino acids | |||||
| Codon bias | T3s | Silent base compositions about T | CodonW [ | |||
| C3s | Silent base compositions about C | |||||
| A3s | Silent base compositions about A | |||||
| G3s | Silent base compositions about G | |||||
| CAI | Codon Adaptation Index | |||||
| CBI | Codon Bias Index | |||||
| Fop | Frequency of Optimal codons | |||||
| Nc | The effective number of codons | |||||
| GC3s | G+C content 3rd position of synonymous codons | |||||
| GC | G+C content of the gene | |||||
| L_sym | Length of system amino acids | |||||
| Gravy | Hydropathicity of protein | |||||
| Aromo | The frequency of aromatic amino acids | |||||
| Amino acid usage | Amino acid | A, R, D, C, Q, H, I, N, L, K, M, F, P, S, T, W, Y, V | ||||
| Amino acid | R, D, C, E, H, L, G, N, K, F, P, S, T, M, V | |||||
| Amino acid | A, R, C, Q, D, H, I, G, N, L, K, M, F, P, S, T, W, V, Y | |||||
| Rare_aa_ratio | The frequencies of rare amino acids | |||||
| Close_aa_ratio | The number of codons that one third-base mutationis removed from a stop codon | |||||
| Physio- chemical Properties | M_weight | Molecular weight | Pepstats [ | |||
| I_Point | Isoelectric Point | |||||
| Tiny | Number of mole of the amino acids (A+C+G+S+T) | |||||
| Small | Number of mole of the amino acids (A+B+C+D+G+N+P+S+T+V) | |||||
| Aliphatic | Number of mole of the amino acids (A+I+L+V) | |||||
| Aromatic | Number of mole of the amino acids (F+H+W+Y) | |||||
| Non-polar | Number of mole of the amino acids (A+C+F+G+I+L+M+P+V+W+Y) | |||||
| Polar | Number of mole of the amino acids (D+E+H+K+N+Q+R+S+T+Z) | |||||
| Charged | Number of mole of the amino acids (B+D+E+H+K+R+Z) | |||||
| Basic | Number of mole of the amino acids (H+K+R) | |||||
| Acidic | Number of mole of the amino acids (B+D+E+Z) | |||||
| Transmembrane helices | ExpAA | The number of transmembrane amino acids | TMHMM3 | |||
| First60 | The number of transmembrane amino acids in first 60 | |||||
| PredHel | The final prediction of transmembrane helices | |||||
| Subcellular localization | Cytom | Cytoplasmic Membrane Score | PSORTb v3.0 [ | |||
| Extra | Extracellular Score | |||||
| OuterM | Outer Membrane Score | |||||
| Peri | Periplasmic Score | |||||
| Cyto | Cytoplasmic Score | |||||
| Cellw | Cell wall Score | |||||
| Loc_s | Final Score | |||||
| Hurst exponent | Hurst | The Hurst exponent | R package [ | |||
| Total features (dimension) | 37 | 38 | 40 | |||
* indicates a selected feature. If a feature was selected from two or three of the sets (GN, GP, Full), then it should be considered significantly associated with essentiality.
Fig 1Workflow of analysis procedures.
Fig 2Three ROC curves for predicting essential genes based on the original and selected features.
(A) ROC curves for Gram-negative dataset. (B) ROC curves for Gram-positive dataset. (C) ROC curves for Full dataset.
Comparison of the classification results of original and selected features.
| Gram-Negative | Gram-Positive | Full | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Original features | Selected features | Variation | Original features | Selected features | Variation | Original features | Selected features | Variation | |
| Sensitivity | 0.695 | 0.713 | 0.019 | 0.737 | 0.729 | -0.009 | 0.708 | 0.715 | 0.007 |
| Specificity | 0.737 | 0.733 | -0.005 | 0.752 | 0.769 | 0.016 | 0.743 | 0.736 | -0.006 |
| AVE | 0.716 | 0.723 | 0.007 | 0.745 | 0.749 | 0.004 | 0.725 | 0.726 | 0.000 |
| AUC | 0.782 | 0.790 | 0.009 | 0.826 | 0.828 | 0.002 | 0.797 | 0.794 | -0.003 |
| Number of features | 59 | 37 | -22 | 58 | 38 | -20 | 57 | 40 | -17 |
| Optimization time (Sec) | 56439 | 46421 | -17.750% | 3671 | 3003 | -18.197% | 140375 | 116976 | -16.669% |
| Classification and prediction time (Sec) | 2245 | 1175 | -47.661% | 152 | 112 | -26.316% | 3853 | 3399 | -11.783% |
| Total running time (Sec) | 58684 | 47596 | -18.894% | 3823 | 3115 | -18.519% | 144228 | 120375 | -16.538% |
a The optimized parameters include C and gamma, and were determined using the Grid Method with default parameters.
Comparison of the prediction performance.
| Our Results | Saha S, | Song K, | Ning LW, | Deng J, | Gustafson AM, | Gerdes SY, | Joyce AR, | Plaimas, | Ye YN, | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GN | GP | FULL (GN+GP) | Max | Combine | ||||||||||||||
| SVM | KNN | (Min) | BLAST | CEG_MATCH | ||||||||||||||
| Sensitivity | 0.709 | 0.733 | 0.715 | 0.768 | 0.742 | 0.760 | 0.792 | 0.904 | / | / | / | 0.73 | 0.52 | 0.68 | 0.26 | / | / | / |
| (0.609) | ||||||||||||||||||
| Specificity | 0.733 | 0.786 | 0.736 | / | / | 0.867 | 0.858 | 0.926 | / | / | / | 0.92 | 0.96. | 0.88 | 0.25 | / | 0.431 | 0.60 |
| (0.778) | (0.345) | (0.694) | ||||||||||||||||
| AVE | 0.721 | 0.760 | 0.726 | / | / | 0.814 | 0.825 | 0.898 | / | / | / | / | / | / | / | / | / | / |
| (0.735) | ||||||||||||||||||
| AUC | 0.789 | 0.838 | 0.794 | 0.81 | 0.81 | 0.866 | 0.870 | 0.937 | 0.82 | 0.74 | 0.76 | / | / | / | / | 0.81 | / | / |
| (0.804) | (0.75) | |||||||||||||||||
| ACC | 0.731 | 0.763 | 0.734 | 0.741 | 0.734 | 0.904 | 0.903 | 0.960 | / | / | / | / | / | / | / | / | 0.694 | 0.712 |
| (0.813) | (0.677) | (0.701) | ||||||||||||||||
| PPV | 0.226 | 0.330 | 0.243 | 0.731 | 0.730 | 0.709 | 0.673 | 0.942 | / | / | / | 0.44 | 0.53 | 0.33 | 0.42 | / | / | / |
| (0.435) | ||||||||||||||||||
| Number of feature | 37 | 38 | 40 | 13 | 13 | 494 | 494 | / | 158 | 158 | 158 | 13 | 28 | / | / | 33 | / | / |
| Number of object | 21 | 10 | 31 | 1 | 1 | 11 | 11 | / | 1 | 1 | 16 | 1 | 1 | 1 | 1 | 3 | 16 | 16 |
1 k-nearest neighbor (KNN) method
2 Date of E. coli was used for training. The data of the other 11 objects were used as test set, and the results were averaged.
3 Date of B. subtilis was used for training. The data of the other 11 objects were used as test set, and the results were averaged.
4 The maximum and the minimum values of the prediction results.
5 Results based on cross validation were chosen for comparison.
6 BLAST: Identity >50. CEG_MATCH: K = 3.
7 The features were the 93’ Z-curve features (252 variables), orthologs values (187), and other DNA or amino acid sequence based features (55).
8 The features were the topology features (25) and the genomic and transcriptomic features (8).
9 Results of 16 objects were listed in [34].