| Literature DB >> 33260643 |
Nguyen Quoc Khanh Le1,2,3, Duyen Thi Do4, Truong Nguyen Khanh Hung5,6, Luu Ho Thanh Lam5,7, Tuan-Tu Huynh8,9, Ngan Thi Kim Nguyen10.
Abstract
Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general.Entities:
Keywords: DNA sequencing; continuous bag of words; deep learning; ensemble learning; essential genetics and genomics; fastText; prediction model
Year: 2020 PMID: 33260643 PMCID: PMC7730808 DOI: 10.3390/ijms21239070
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Identification of essential genes at different levels of fastText n-grams. The performance of 6-g (area under the receiver operating characteristic curve (AUC) = 0.78) was better than the other levels.
Hyperparameters of different machine learning and deep learning classifiers used in this study. KNN: k-nearest neighbors; RF: random forest; SVM: support vector machine; MLP: multi-layer perceptron; and CNN: convolutional neural network.
| Classifier | Optimal Parameters |
|---|---|
| kNN | k = 10 |
| RF | |
| SVM | c = 23768, g = 0.001953125 |
| MLP | 100–50 nodes, dropout = 0.5, optimizer = adam, learning rate = 0.001 |
| CNN |
Five-fold cross-validation performance in identifying essential genes using different machine learning, deep learning, and ensemble learning techniques. MCC: Matthews correlation coefficient.
| Classifier | Sens (%) | Spec (%) | Acc (%) | MCC | AUC | Time (s) |
|---|---|---|---|---|---|---|
| kNN | 43.5 | 87.6 | 73.2 | 0.348 | 0.747 | 0.27 |
| RF | 46.9 | 86.6 | 73.6 | 0.367 | 0.762 | 12.85 |
| SVM | 35.9 | 92.3 | 74 | 0.353 | 0.775 | 3.37 |
| MLP | 43.5 | 89.9 | 74.8 | 0.385 | 0.775 | 94.32 |
| CNN | 42.3 | 90.4 | 74.7 | 0.381 | 0.775 | 105.18 |
| Ensemble | 50.5 | 90.2 | 77.3 | 0.452 | 0.814 | 208.12 |
Five-fold cross-validation performance in identifying essential genes using different features. PseDNC: pseudo k-tuple dinucleotide composition; PseTNC: pseudo k-tuple trinucleotide composition.
| Features | Sens (%) | Spec (%) | Acc (%) | MCC | AUC |
|---|---|---|---|---|---|
| k-mer | 35.9 | 90.2 | 72.4 | 0.316 | 0.698 |
| PseDNC | 36.5 | 91 | 73.2 | 0.337 | 0.637 |
| PseTNC | 31.7 | 93.4 | 73.3 | 0.331 | 0.625 |
| PCPseDNC | 37.6 | 89.5 | 72.6 | 0.322 | 0.704 |
| PCPseTNC | 33.4 | 93 | 73.5 | 0.341 | 0.72 |
| fastText | 50.5 | 90.2 | 77.3 | 0.452 | 0.814 |
Prediction of essential genes using different state-of-the-art predictors. SMOTE: syntactic minority over-sampling technique; NLP: natural language processing; LASSO: least absolute shrinkage and selection operator; WPCA: weighted principal component analysis.
| Predictors | Feature | Sens | Spec | Acc | MCC | |
|---|---|---|---|---|---|---|
| Original | Aromolaran [ | Auto covariance, pseudo nucleotide composition, k-mer | 40.8 | 90.7 | 74.4 | 0.371 |
| Campos et al. [ | Nucleotide composition, correlation features | 38.5 | 93 | 75.2 | 0.39 | |
| Liu et al. [ | Sequence-based features and LASSO | 41.3 | 89.7 | 73.8 | 0.361 | |
| Tian et al. [ | Hybrid features | 36.5 | 93.9 | 75.2 | 0.389 | |
| Deng et al. [ | Intrinsic and context-dependent genomic features | 34 | 93.5 | 74.1 | 0.355 | |
| Xu et al. [ | Hybrid features and WPCA | 42.7 | 86.9 | 72.6 | 0.331 | |
| Nigatu et al. [ | Information theoretic features | 38.8 | 92.1 | 74.8 | 0.377 | |
| Lin et al. [ | Hybrid features | 35.6 | 92 | 73.5 | 0.345 | |
| Pheg [ | Nucleotide composition | 35 | 94.4 | 75.1 | 0.383 | |
| iEsGene-ZCPseKNC [ | Nucleotide composition | 44.6 | 89 | 74.5 | 0.38 | |
| Ours | NLP-based features | 50.5 | 90.2 | 77.3 | 0.452 | |
| SMOTE | Aromolaran [ | Auto covariance, pseudo nucleotide composition, k-mer | 55.3 | 82.2 | 73.5 | 0.384 |
| Campos et al. [ | Nucleotide composition, correlation features | 52.9 | 85 | 74.5 | 0.399 | |
| Liu et al. [ | Sequence-based features and LASSO | 45.2 | 89.2 | 74.8 | 0.389 | |
| Tian et al. [ | Hybrid features | 50.5 | 85.5 | 74.1 | 0.384 | |
| Deng et al. [ | Intrinsic and context-dependent genomic features | 45.2 | 85.9 | 72.6 | 0.341 | |
| Xu et al. [ | Hybrid features and WPCA | 54.8 | 82.7 | 73.6 | 0.386 | |
| Nigatu et al. [ | Information theoretic features | 46.6 | 86.9 | 73.8 | 0.368 | |
| Lin et al. [ | Hybrid features | 44.2 | 87.8 | 73.5 | 0.359 | |
| Pheg [ | Nucleotide composition | 53.8 | 86.4 | 75.7 | 0.426 | |
| iEsGene-ZCPseKNC [ | Nucleotide composition | 63.7 | 77 | 72.6 | 0.396 | |
| Ours | NLP-based features | 60.2 | 84.6 | 76.3 | 0.449 |
Figure 2Performance results of identifying essential genes in representative cross-species datasets using the proposed model. Detailed information and predictive accuracy of all species are shown in Supplementary Table S2.
Figure 3Work flow of the study in identifying essential genes using sequence information. The input was comprised of genes with different lengths and containing different nucleotides. The word-embedding features were extracted by using the fastText package and then learnt by an ensemble deep neural network. After the ensemble network, the output contained binary probabilities to show whether the represented genes belonged to essential genes. Red, green triangles, green circle, blue squares and red pentagons are examples of data points.