| Literature DB >> 32489524 |
Tulio L Campos1,2, Pasi K Korhonen1, Paul W Sternberg3, Robin B Gasser1, Neil D Young1.
Abstract
Defining genes that are essential for life has major implications for understanding critical biological processes and mechanisms. Although essential genes have been identified and characterised experimentally using functional genomic tools, it is challenging to predict with confidence such genes from molecular and phenomic data sets using computational methods. Using extensive data sets available for the model organism Caenorhabditis elegans, we constructed here a machine-learning (ML)-based workflow for the prediction of essential genes on a genome-wide scale. We identified strong predictors for such genes and showed that trained ML models consistently achieve highly-accurate classifications. Complementary analyses revealed an association between essential genes and chromosomal location. Our findings reveal that essential genes in C. elegans tend to be located in or near the centre of autosomal chromosomes; are positively correlated with low single nucleotide polymorphim (SNP) densities and epigenetic markers in promoter regions; are involved in protein and nucleotide processing; are transcribed in most cells; are enriched in reproductive tissues or are targets for small RNAs bound to the argonaut CSR-1. Based on these results, we hypothesise an interplay between epigenetic markers and small RNA pathways in the germline, with transcription-based memory; this hypothesis warrants testing. From a technical perspective, further work is needed to evaluate whether the present ML-based approach will be applicable to other metazoans (including Drosophila melanogaster) for which comprehensive data sets (i.e. genomic, transcriptomic, proteomic, variomic, epigenetic and phenomic) are available.Entities:
Keywords: CDS, coding sequence; CRISPR, Clustered Regularly Interspaced Short Palindromic Repeats; Caenorhabditis elegans; ES, Essentiality Score; EST, expressed sequence tag; Essential genes; Essentiality predictions; GBM, Gradient Boosting Method; GFF, general feature format; GLM, Generalised Linear Model; GO, gene ontology; ML, machine-learning; Machine-learning; NN, Artificial Neural Network; PPI, protein-protein interaction; PR-AUC, Area Under the Precision-Recall Curve; RF, Random Forest; RNAi, RNA interference; ROC-AUC, Area Under the Receiver Operating Characteristic Curve; SNP, single nucleotide polymorphism; SPLS, Sparse Partial Least Squares; SVM, Support-Vector Machine; TEA, Tissue Enrichment Analysis tool (WormBase); TSS, transcription start site; VCF, variant call file
Year: 2020 PMID: 32489524 PMCID: PMC7251299 DOI: 10.1016/j.csbj.2020.05.008
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Workflow employed in the present study. First, a wealth of publicly available ‘omics data sets for C. elegans were obtained (blue). Then, we employed a ‘scoring system’ to the phenomic data to annotate C. elegans genes for essentiality (green). Next, we extracted or engineered features (yellow) from the data sets to establish feature sets (FULL – all features; NR – all features from sequences containing <25% amino acid identity; NR_SELECTED – 28 highly-predictive features of essentiality, selected from the NR data set). These feature sets were used for a systematic evaluation of machine-learning (ML) approaches for essential gene predictions (orange). T-tests and correlation tests were performed on the FULL and NR_SELECTED sets, respectively. The performances of the individual ML models, and the importance of the selected features for essentiality predictions were calculated and evaluated (orange). Finally, Gene Ontology (GO), transcription and tissue enrichments were performed, as well as an analysis on the preferential genomic locations of SNPs and genes by essentiality annotations (grey). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 2Curation of essential genes from phenotype data and performance of ML methods for essentiality predictions. A. C. elegans genes were curated for essentiality using phenotype data available in WormBase. For each gene, an essentiality score (ES) was calculated (y-axis) and ordered using the formula E2/T2, were “E” is the number of entries relating to lethality/essentiality, and “T” is the total number of entries reported. Genes were annotated as ‘essential’ if ES was >0.9, or ‘non-essential’ if was ES <0.1, or ‘conditionally-essential’ otherwise. B. In the systematic evaluation of gene essentiality predictions (‘essential’ vs. ‘non-essential’) the performance of six machine-learning (ML) algorithms and a default classifier were assessed, initially with a data set (FULL) containing all genes curated previously and their features. In addition, a non-redundant (NR) data set with features from sequences that contained <25% amino acid sequence identity was created, and all features identified for these genes were included. Another data set containing the NR genes and a selection of 28 best-predictive features (NR_SELECTED) was also evaluated. For each data set, random subsets of genes (10–90%, 10% increments) were used as training sets (x-axis), and the remaining 90–10% used as independent test sets. At each step, the prediction performance was evaluated using the test set using ROC-AUC (right) and PR-AUC (left) metrics. C. Violin and box plots of ROC-AUC and PR-AUC from 1000 bootstraps of RF, XGB and GBM, with random sampling of 90% of the NR_SELECTED used for training and the remaining 10% of this feature set used for independent testing.
Fig. 3Correlations of features with essentiality; distributions of single nucleotide polymorphisms (SNPs) in and gene essentiality density along C. elegans chromosomes. A. The correlations (x-axis) of 28 highly-predictive features (y-axis) with gene essentiality. B, The pairwise correlation among these 28 predictors. C. The distribution of SNPs (1000 bp- windows) along C. elegans chromosomes, based on a variant-call file (VCF) derived from whole-genome sequencing of natural C. elegans populations [37]. D. Density plots showing the distributions of genes along C. elegans chromosomes, stratified by essentiality annotations (red – ‘essential’; blue – ‘non-essential’; yellow – ‘conditionally-essential’). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 4Relationship between ML predictions and the likelihood of a “lethal” phenotype upon knockout. Genes ranked by ML prediction probabilities were searched against a list of genes with at least one “lethal” phenotype reported in the GExplore database. Ratios were calculated cumulatively for genes from the highest to the lowest ML probabilities (red), and from the lowest to the highest (turquoise). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)