| Literature DB >> 30563825 |
David Tian1, Stephanie Wenlock1, Mitra Kabir1, George Tzotzos2, Andrew J Doig3,4, Kathryn E Hentges5.
Abstract
The genes that are required for organismal survival are annotated as 'essential genes'. Identifying all the essential genes of an animal species can reveal critical functions that are needed during the development of the organism. To inform studies on mouse development, we developed a supervised machine learning classifier based on phenotype data from mouse knockout experiments. We used this classifier to predict the essentiality of mouse genes lacking experimental data. Validation of our predictions against a blind test set of recent mouse knockout experimental data indicated a high level of accuracy (>80%). We also validated our predictions for other mouse mutagenesis methodologies, demonstrating that the predictions are accurate for lethal phenotypes isolated in random chemical mutagenesis screens and embryonic stem cell screens. The biological functions that are enriched in essential and non-essential genes have been identified, showing that essential genes tend to encode intracellular proteins that interact with nucleic acids. The genome distribution of predicted essential and non-essential genes was analysed, demonstrating that the density of essential genes varies throughout the genome. A comparison with human essential and non-essential genes was performed, revealing conservation between human and mouse gene essentiality status. Our genome-wide predictions of mouse essential genes will be of value for the planning of mouse knockout experiments and phenotyping assays, for understanding the functional processes required during mouse development, and for the prioritisation of disease candidate genes identified in human genome and exome sequence datasets.Entities:
Keywords: Essential genes; Essentiality database; Mouse knockout; Supervised machine learning
Mesh:
Year: 2018 PMID: 30563825 PMCID: PMC6307915 DOI: 10.1242/dmm.034546
Source DB: PubMed Journal: Dis Model Mech ISSN: 1754-8403 Impact factor: 5.758
Fig. 1.Prediction accuracies of the random forest classifiers. Prediction accuracies of the random forest classifiers. (A) ROC plot with AUC 0.803 for the random forest classifier trained on 80 features and tested on Test set 1. (B) Confusion matrix of the random forest classifier trained on 80 features and tested on Test set 1. (C) ROC plot with AUC 0.816 for the random forest classifier trained on the 39 features selected by the genetic algorithm feature selection and tested on blind test set 1. (D) Confusion matrix of the random forest classifier trained on the 39 features selected by the genetic algorithm feature selection and tested on blind test set 1.
Fig. 2.Confusion matrices of the 6 classifiers trained on all 83 features. The machine learning algorithm is listed at the top of each chart: (A) random forest; (B) RBF kernel SVM; (C) linear SVM; (D) logistic regression; (E) naïve Bayes; (F) decision tree.
Fig. 3.Differences in ‘Essentiality’ gene prediction confidence levels for experimentally validated blind and alternative mutagenesis mouse genes. (A-C) A Normal distribution was confirmed for alternative mutagenesis data (n=115 genes) using Shapiro–Wilk test. Welch's 2-sample t-test identified a significant difference between correct and incorrect prediction confidence-levels (P=0.0166) for predictions of alternative mutagenesis genes (A). Both essential (n=229 genes) and non-essential (n=802 genes) blind test set 1 data were not normally distributed (Shapiro–Wilk test). Using Wilcoxon's Rank-Sum 2-sided test, significant differences were found between prediction confidence levels of correct and incorrect predictions for essential (B) and non-essential (C) blind test set 1 genes (P=1.75×10−7 and P≤2.2×10−16, respectively).
Top 10 enriched GO terms found within DAVID for predicted essential and predicted non-essential mouse genes
GO Slim functional annotations for essential and non-essential genes
Network statistics of PPIs of known and predicted essential and non-essential datasets
Fig. 4.The genomic distribution of essential and non-essential mouse genes, separated into known and predicted essentiality. The percentages of essential and non-essential genes on each chromosome are compiled from the MED database. In the genome as a whole, we calculate that there are 28% essential genes and 72% non-essential genes when known and predicted essentiality statuses are combined. Data are provided in Table S8.
Human and mouse essential gene conservation