| Literature DB >> 31312416 |
Tulio L Campos1,2, Pasi K Korhonen1, Robin B Gasser1, Neil D Young1.
Abstract
The availability of whole-genome sequences and associated multi-omics data sets, combined with advances in gene knockout and knockdown methods, has enabled large-scale annotation and exploration of gene and protein functions in eukaryotes. Knowing which genes are essential for the survival of eukaryotic organisms is paramount for an understanding of the basic mechanisms of life, and could assist in identifying intervention targets in eukaryotic pathogens and cancer. Here, we studied essential gene orthologs among selected species of eukaryotes, and then employed a systematic machine-learning approach, using protein sequence-derived features and selection procedures, to investigate essential gene predictions within and among species. We showed that the numbers of essential gene orthologs comprise small fractions when compared with the total number of orthologs among the eukaryotic species studied. In addition, we demonstrated that machine-learning models trained with subsets of essentiality-related data performed better than random guessing of gene essentiality for a particular species. Consistent with our gene ortholog analysis, the predictions of essential genes among multiple (including distantly-related) species is possible, yet challenging, suggesting that most essential genes are unique to a species. The present work provides a foundation for the expansion of genome-wide essentiality investigations in eukaryotes using machine learning approaches.Entities:
Keywords: CRISPR, Clustered regularly interspaced short palindromic repeats; Essential genes; Essentiality prediction; Eukaryotes; GBM, Gradient boosting method; GI, Genetic interaction; GLM, Generalised linear model; GO, Gene ontology; ML, Machine-learning; Machine-learning; NN, Artificial neural network; OGEE, Online GEne essentiality database; PPI, Protein-protein interaction; PR-AUC, Area under the precision-recall curve; RF, Random Forest; RNAi, RNA interference; ROC-AUC, Area under the receiver operating characteristic curve; SPLS, Sparse partial least squares; SVM, Support-Vector machine
Year: 2019 PMID: 31312416 PMCID: PMC6607062 DOI: 10.1016/j.csbj.2019.05.008
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Bioinformatic workflow for essential gene classification and evaluation using protein sequence-derived features and machine-learning methods.
Protein sequence-derived features utilised in the present study.
| Description | Number of features |
|---|---|
| Amino acid composition | 20 |
| Dipeptide composition | 400 |
| Tripeptide composition | 8000 |
| Protein autocorrelation features | 720 |
| Conjoint triad | 343 |
| Composition/Transition/Distribution | 147 |
| Quasi-Sequence-Order | 160 |
| Pseudo amino acid composition | 130 |
| Total | 9920 |
Fig. 2A. Summary of gene essentiality data obtained from different sources and used in the present study. Included are the number of genes found with multiple conflicting entries (inconsistent) as well as genes not reported as either essential or non-essential, complementing the predicted proteomes. B. Diagram exhibiting the total (red) and shared (blue) ortholog identifiers of essential genes from the OrthoOMA database used in the present study (selected species and data sets). C. Pairwise essential gene orthologs identified using the OrthoOMA ortholog groups (format: species1_source1_species2_source2). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 3Performance evaluation of essential gene classification of training sets (self-predictions) within selected eukaryotic species using Area Under Receiver Operating Characteristic and Precision-Recall Curves (ROC-AUC and PR-AUC; training set sizes between 10 and 90%, using 10% increments). The dots represent the calculated ROC-AUC/PR-AUC values, and linear models fit dots representing the performances of each machine-learning algorithm. Feature selection procedures were performed for each subsample.
Fig. 4Performance evaluation of essential gene classification of test sets within selected eukaryotic species using Area Under Receiver Operating Characteristic and Precision-Recall Curves (ROC-AUC and PR-AUC; training set sizes between 10 and 90%, using 10% increments). The dots represent the calculated ROC-AUC/PR-AUC values, and linear models fit dots representing the performances of each machine-learning algorithm performances. Feature selection procedures were performed for each subsample.
Fig. 5Heatmaps depicting the prediction performances (y-axis: ROC-AUC and PR-AUC for each test set) of five machine-learning models (x-axis) trained using multiple essentiality data sets (labels on top of the heatmaps represent each of the training sets).
Fig. 6Heatmaps depicting the prediction performances (y-axis: ROC-AUC and PR-AUC) of four machine-learning models (x-axis) using a leave-one-species-out approach. Labels on top of each heatmap represent the species that was excluded from the training set. The Mm_OGEE and Hs_OGEE data sets were not included in any of the training sets.