Michael Ku Yu1, Michael Kramer2, Janusz Dutkowski3, Rohith Srivas4, Katherine Licon5, Jason Kreisberg5, Cherie T Ng6, Nevan Krogan7, Roded Sharan8, Trey Ideker5. 1. Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla CA 92093, USA; Department of Medicine, University of California San Diego, La Jolla CA 92093, USA. 2. Department of Medicine, University of California San Diego, La Jolla CA 92093, USA; Biomedical Sciences Program, University of California San Diego, La Jolla CA 92093, USA. 3. Department of Medicine, University of California San Diego, La Jolla CA 92093, USA; Data4Cure, La Jolla, CA 92037, USA. 4. Department of Medicine, University of California San Diego, La Jolla CA 92093, USA; Department of Bioengineering, University of California San Diego, La Jolla CA 92093, USA. 5. Department of Medicine, University of California San Diego, La Jolla CA 92093, USA. 6. aTyr Pharmaceuticals, San Diego, CA 92121, USA. 7. Department of Cellular and Molecular Pharmacology, University of California San Francisco, San Francisco 94143, USA. 8. Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 69978, Israel.
Abstract
Accurately translating genotype to phenotype requires accounting for the functional impact of genetic variation at many biological scales. Here we present a strategy for genotype-phenotype reasoning based on existing knowledge of cellular subsystems. These subsystems and their hierarchical organization are defined by the Gene Ontology or a complementary ontology inferred directly from previously published datasets. Guided by the ontology's hierarchical structure, we organize genotype data into an "ontotype," that is, a hierarchy of perturbations representing the effects of genetic variation at multiple cellular scales. The ontotype is then interpreted using logical rules generated by machine learning to predict phenotype. This approach substantially outperforms previous, non-hierarchical methods for translating yeast genotype to cell growth phenotype, and it accurately predicts the growth outcomes of two new screens of 2,503 double gene knockouts impacting DNA repair or nuclear lumen. Ontotypes also generalize to larger knockout combinations, setting the stage for interpreting the complex genetics of disease.
Accurately translating genotype to phenotype requires accounting for the functional impact of genetic variation at many biological scales. Here we present a strategy for genotype-phenotype reasoning based on existing knowledge of cellular subsystems. These subsystems and their hierarchical organization are defined by the Gene Ontology or a complementary ontology inferred directly from previously published datasets. Guided by the ontology's hierarchical structure, we organize genotype data into an "ontotype," that is, a hierarchy of perturbations representing the effects of genetic variation at multiple cellular scales. The ontotype is then interpreted using logical rules generated by machine learning to predict phenotype. This approach substantially outperforms previous, non-hierarchical methods for translating yeast genotype to cell growth phenotype, and it accurately predicts the growth outcomes of two new screens of 2,503 double gene knockouts impacting DNA repair or nuclear lumen. Ontotypes also generalize to larger knockout combinations, setting the stage for interpreting the complex genetics of disease.
A central problem in genetics is to understand how different variations in DNA sequence, dispersed across a multitude of genes, can nonetheless elicit similar phenotypes (Waddington, 1942). In recent years, it has been repeatedly observed that different genetic drivers of a trait can be recognized by their aggregation in networks of pairwise protein or gene interactions (Califano et al., 2012; Greene et al., 2015; Hanahan and Weinberg, 2011; Kim and Przytycka, 2012; Ramanan et al., 2012; Wang et al., 2010). Rather than associate genotype with phenotype directly, variations in genotype are first mapped onto knowledge of gene networks; affected subnetworks are then statistically associated with phenotype. This approach can greatly increase our power to identify relevant associations between genotype and phenotype. This principle of “network-based” or “pathway-based” association (Califano et al., 2012) is now being applied to effectively map the genetics underlying complex phenotypes, including cancer and other common diseases (Hofree et al., 2013; Lee et al., 2011; Leiserson et al., 2014; Ng et al., 2012; Pe’er and Hacohen, 2011; Skafidas et al., 2014; Sullivan, 2012; Willsey et al., 2013).In these studies, network knowledge is represented as a set of genes and pairwise gene interactions. In reality, however, genotype is transmitted to phenotype not only through gene-gene interactions but through a rich hierarchy of biological subsystems at multiple scales: Genotypic variations in nucleotides (1nm scale) give rise to functional changes in proteins (1–10nm), which in turn affect protein complexes (10–100nm), cellular processes (100nm), organelles (1μm) and, ultimately, phenotypic behaviors of cells (1–10μm), tissues (100μm-100mm) and complex organisms (>1m). What has been less well-studied in genotype-phenotype association is how to leverage our extensive pre-existing knowledge across these scales, or how to identify the scales most relevant to a set of genetic variants (Deisboeck et al., 2011; Eissing et al., 2011; Walpole et al., 2013).In many fields, knowledge across scales is modeled by ontologies— a factorization of prior knowledge about the world into a hierarchy of increasingly specific concepts (Brachman and Levesque, 2004). For instance, intelligent systems like Apple’s Siri and IBM’s Watson carry out logical reasoning using a large collection of world knowledge represented by ontologies (Carvunis and Ideker, 2014). In molecular and cellular biology, extensive knowledge of the hierarchy of subsystems in a cell has been represented by the Gene Ontology (GO), a community standard reference database that documents interrelationships among thousands of intracellular components, processes and functions in a large hierarchy of terms (The Gene Ontology Consortium, 2014). Thus far, genotype-phenotype association methods have sometimes used prior knowledge in GO by flattening the term hierarchy to a network, in which pairwise interactions connect genes annotated with the same GO term (Pesquita et al., 2009). This flattening, however, may discard important information about the rich hierarchy of biological systems connecting genotype to phenotype. Moreover, a hierarchical model is highly complementary, and in some ways orthogonal, to flat networks: GO is primarily concerned with “deep” connectivity up and down a hierarchy of cellular processes spanning dozens of scales, whereas network models typically focus on horizontal flow of signaling, transcriptional, or metabolic information among genes or reactions at the same scale (Lee et al., 2010, 2011). Another advantage of GO is that it is continuously improved by a very large community of dozens of curators and editors, who update GO from new knowledge published in thousands of peer-reviewed papers each year (Balakrishnan et al., 2013; Huntley et al., 2014). To complement this process of manual curation, recently we and others have shown that a large hierarchy of cellular systems can be systematically assembled directly from analysis of genome-wide data sets, including molecular interactions and gene expression profiles; we call this assembly NeXO (Dutkowski et al., 2013; Gligorijević et al., 2014; Kramer et al., 2014). This ‘data-driven’ ontology closely resembles, and in some cases greatly revises and expands, the literature-curated GO.Here we report a general approach for using deep hierarchical knowledge of the cell, represented by an ontology, to translate genotype to phenotype. This approach recursively aggregates the effects of genetic variation upwards through the hierarchy: in this way, genetic variants comprising genotype are converted to effects on the cell subsystems impacted by those variants. We call the set of all such effects ‘ontotype,’ representing variation at intermediate scales between nanoscopic changes in genes and macroscopic changes in phenotype.Here, we focus on yeast genetic interactions, in which the deletion of two or more genes results in an unexpectedly slow or fast cellular growth phenotype. Genetic interactions have previously been screened systematically using synthetic genetic arrays in yeast (Costanzo et al., 2010); these experiments comprise ~3 million different genetic backgrounds and are one of the largest genotype-phenotype compendia in existence. We integrate these data with GO to produce a multi-scale computational model, the functionalized ontology. The model accurately predicts growth phenotypes of 2,503 previously untested double deletion genotypes, and it is also capable of predicting the phenotypes that result from larger combinations of gene disruptions. Similar predictive power is achieved by substituting GO with NeXO, our data-driven ontology of cellular systems. In aggregate, this work suggests a strategy for building hierarchical models of the cell whose structure and function are learned completely from data.
Results
Association between genetic interactions and hierarchical relations among cellular systems
As preparation for modeling, we identified patterns by which genetic interactions are associated with, and thus biologically explained by, the structure of gene ontologies. We observed that sets of genes assigned to the same GO term tended to be highly enriched for genetic interactions (p < 10−5), for both positive genetic interactions (double gene disruptions with better-than-expected growth, e.g. epistasis) and negative genetic interactions (double gene disruptions with worse-than-expected growth, e.g. synthetic lethality) (Figure 1A). Such interaction enrichment within GO terms occurred over a wide range of term sizes – the number of genes annotated to a term – suggesting that genetic interactions emerge from both broad and specific cellular mechanisms at multiple scales.
Figure 1
Patterns of genetic interaction reflect the hierarchical structure of the Gene Ontology
(A) Enrichment for negative (circle) or positive (triangle) genetic interactions among genes annotated to the same GO term as a function of term size, measured by the number of genes annotated to that term or its descendants. Enrichment is normalized as the fold change over expected for randomized GO annotations. (B) Genetic interactions are propagated up the GO hierarchy to support ‘between-term enrichment’ between the dynactin and kinesin complexes and ‘within-term enrichment’ within the parent ‘microtubule associated complex’. (C) Number of within-term and between-term enrichments highlighted by current genetic interaction data. Approximately half of within-term enrichments can be factored into one or more between-term enrichments that occur lower in the GO hierarchy. Percentages are calculated with respect to the total possible tests for within-term (2,719) and between-term (36,210) enrichments. (D) Number of genetic interactions involved in a within-term, between-term, or either type of enrichment. Percentages are calculated with respect to the total number of genetic interactions (107,133). The expected numbers of enrichments (C) and supporting interactions (D) were also calculated over randomized GO annotations (dark gray bars).
Due to the hierarchical structure of the cell, genetic interactions among genes annotated to a term can potentially be re-interpreted as interactions between the genes of different terms at a lower scale in GO. For example, the ‘parent’ term ‘microtubule-associated complex’ displays strong within-term interaction enrichment, which factors into strong between-term interaction enrichment across two of its ‘children’ terms, kinesin and dynactin (Figure 1B). We found that such hierarchical relationships were widespread in GO: approximately half of within-term enrichments could be factored into between-term enrichments among their descendants (Figure 1C). Occurrences of interactions within or between biological pathways have been previously investigated as separate biological interpretations (Bandyopadhyay et al., 2008; Bellay et al., 2011; Collins et al., 2010; Kelley and Ideker, 2005; Leiserson et al., 2011; Ma et al., 2008; Qi et al., 2008; Ulitsky et al., 2008). Here, both types of explanations can be applied to the same interaction, as they are related hierarchically within the unified structure of the cell. Overall, approximately 40,000 interactions were involved in 1,661 within- or between-term enrichments, representing a 24:1 compression of information (Figure 1D). Thus, GO integrates genetic interactions in an overarching hierarchy capturing multiple scales of cell biology. As one moves upwards in this hierarchy, separate disruptions to multiple systems converge to multiple disruptions to a single system, with the scale of this transition indicated naturally by the hierarchical structure.
The ontotype: an intermediate between genotype and phenotype
Guided by this concordance between the GO hierarchy and genetic interactions, we developed a general system for ontology-based translation of genotype to phenotype that involves three general steps. First, the genotype is described according to convention by the set of genes that have been disrupted relative to wild type (e.g. bΔdΔ, Figure 2A). These disruptions are propagated recursively up the ontology, such that every term is assigned the disrupted genes annotated to that term plus all of those assigned to its children. For example, since the gene KIP1 encodes a subunit of the kinesin complex (Figure 1B), its deletion in a kip1Δ strain propagates upwards in the ontology to affect the parent term ‘kinesin complex’ and continues to propagate upwards to affect ancestor terms at higher scales such as ‘microtubule associated complex’ and ‘cytoskeleton’.
Figure 2
The ontotype method of translating genotype to phenotype
(A) The relationship between genotypic and phenotypic variation is modeled through an intermediate ‘ontotype’, defined as the profile of states corresponding to the effect of genotype on each cellular component, biological process, and molecular function represented as a term in GO. To generate an ontotype, perturbations to genes are propagated hierarchically through the ontology, altering term states. A random forest regresses to predict a phenotype using the ontotype as features. An example decision tree from the forest is shown. (B) Example genotype/ontotype/phenotype associations from the ontology in (A). Different genotypes (e.g. bΔdΔ and aΔdΔ) give rise to similar or identical phenotypes by influencing similar or identical combinations of terms.
Second, every term is assigned a functional state, representing the aggregate impact of gene disruptions on the activity of the component or process that term represents. Although it is possible to envisage many ways one might compute this functional impact, as proof-of-principle we explored a simple and parameter-free computation, the number of disrupted genes associated with the term. This general approach is iterated across all terms; we call the profile of states across all terms the ‘ontotype.’ In this way, the ontotype provides a complete picture of cell function and spans scales between genotype and phenotype. Whereas genotype describes the states of genes, and phenotype describes the states of observable traits, ontotype describes the states of all known biological objects. Many of these objects exist at scales bigger than genes but too small to be classically ‘observable’ by eye, such as protein complexes and other subcellular structures, or too diffuse, such as signaling pathways (Figure 2A). In its most general definition, ontotype encompasses both genotype and phenotype, with genes and observable traits positioned at lower and higher levels of the hierarchy of objects encoding life.
A functionalized gene ontology integrating cell structure and functional prediction
Third, once genotypes are transformed to ontotypes, a supervised learning approach based on the technique of random forests regression (Breiman, 2001) is used to learn rules by which term states predict phenotypes. Rules are organized as a collection, or ‘forest’, of decision trees (Experimental Procedures), with a typical decision tree describing a series of logical true/false tests to evaluate the states of several terms (e.g., T4, T5, and T7 in Figure 2A). Making decisions on the states of terms rather than nucleotide variants or genes enables machine learning across a range of scales, so that different genotypes converging on similar ontotypes (e.g. aΔdΔ and bΔdΔ in Figure 2B) can yield the same phenotype. Decision tree logic was trained to predict quantitative genetic interaction scores from ~3 million tests for pairwise genetic interactions (Costanzo et al., 2010) (Experimental Procedures). This hierarchical structure of the ontology, when coupled to the decision logic described above, forms a “functionalized” ontology, that is, a computational cell model that defines both the sub-structures of the cell and how these sub-structures hierarchically translate genotype to phenotype.Separate functionalized ontologies were trained using either the Gene Ontology curated from the Saccharomyces literature (Cherry et al., 2012) (FGO) or a data-driven ontology assembled from Saccharomyces datasets using the method of Network-extracted Ontologies (Dutkowski et al., 2013; Kramer et al., 2014) (FNeXO). Whereas GO represents knowledge of published cell biology, application of NeXO yielded an ontology whose hierarchy of cell systems was learned directly from publicly available data, including protein-protein interactions, gene expression profiles, and protein sequence properties but excluding any prior information about genetic interactions (datasets taken from YeastNet v3 study, Kim et al., 2014). NeXO (4,805 terms) was tuned so that the resulting ontology was approximately similar in size to GO (5,125 terms). Alignment of these two ontologies revealed 1,614 significantly overlapping terms. Thus, NeXO represents a distinct hierarchy of cellular systems that provides an alternative to the hierarchy maintained by GO curators.
Quantitative assessment of performance for genotype-phenotype translation
FGO accurately predicted growth phenotypes across a range of genetic interaction scores (Figure 3A,B). The correlation between predicted and measured scores was highly significant (Figure 3C, Pearson’s r = 0.35, p < 2.2×10−16) and reduced substantially when a randomized version of the ontology was used (r = 0.04); the maximum achievable correlation, as previously determined by experimental genetic interaction replicates (Baryshnikova et al., 2010), was r = 0.67. Progressively removing either small or large terms from the model degraded the correlation (Figure 3D,E), indicating that all scales in the hierarchy aid in prediction. FNeXO achieved nearly the same correlation (Figure 3C, r = 0.32) and was also sensitive to randomization (r = 0.03).
Figure 3
Genome-wide prediction of pairwise genetic interactions in yeast
(A) Measured genetic interaction scores versus those predicted from ontotypes constructed from GO using four-fold cross validation. For each bin of predicted scores, box plots summarize the distribution of measured scores by its median (central horizontal line), interquartile range (box), and an additional 1.5σ (whiskers). (B) Number of gene pairs in each bin of predicted scores. (C) Method performance, as represented by the correlation of measured versus predicted interaction scores across gene pairs that meet an interaction significance criterion of p < 0.05 in Costanzo et al. Comparison is made to ontotypes constructed from a randomized GO or NeXO and to previous non-hierarchical methods for predicting genetic interactions. FBA correlation is reported for the set of 104,826 gene pairs considered by this model and for which gene annotations are available in GO. The ontotype correlations do not fluctuate greatly (<4%) whether computed over all gene pairs (shown) or the FBA gene pairs. See also Supplemental Figure S1–2. (D) Method performance when the ontotype is constructed from only GO terms that are no larger than (triangles) or no smaller than (circles) a size threshold. (E) The number of GO terms that meet each size threshold criteria. (F) Precision-recall curves for classification of negative genetic interactions.
Both functionalized ontologies compared favorably to non-hierarchical approaches for predicting genetic interactions (Boucher and Jenna, 2013; Lehner, 2013). We evaluated three state-of-the-art methods: Flux Balance Analysis (FBA), which uses a mechanistic model of yeast metabolic pathways to simulate the impact of gene deletions on cell growth (Szappanos et al., 2011); Guilt-By-Association (GBA), which predicts the phenotype of pairwise gene deletions based on the phenotypes of their network neighbors (Lee et al., 2010); and the Multi-Network Multi-Classifier (MNMC), a ‘black box’ supervised learning system which uses many different lines of experimental evidence as features to predict genetic interactions (Pandey et al., 2010, Experimental Procedures). In comparison to all of these approaches, the functionalized ontologies achieved substantially greater correlation between predicted and measured interaction scores (Figure 3C) as well as better tradeoffs in precision versus recall (Figure 3F) in four-fold cross-validation. We also assessed prediction performance in a challenging validation scenario in which the training set of genotypes does not disrupt any genes in the test set (Park and Marcotte, 2012, Supplemental Experimental Procedures). In this scenario, any genotype-phenotype logic that applies to individual genes is no longer generalizable; for example, promiscuous genes with a high degree of genetic interactions (Gillis and Pavlidis, 2012; Mackay, 2014) could be used to explain training data but not test data. In spite of this challenge, FGO still outperformed predictions made with a randomized GO or with the non-hierarchical methods (Supplemental Figure S1).We found that the accuracy of growth phenotype prediction depends significantly on the degree to which cellular systems have been characterized in the gene ontology. FGO was especially accurate at modeling genotypes for which the disrupted genes are well-characterized by GO annotations; conversely, it was far less able to model genotypes for which the genes are poorly characterized (Supplemental Figure S2). Moreover, many genes that are poorly characterized in GO are better characterized in NeXO, such that genotypes involving these genes lead to better phenotypic predictions by FNeXO than by FGO (Supplemental Figure S2A–C). These differences demonstrate the utility of data-driven ontologies for translating genotype to phenotype, especially in species that are lacking in GO curation but have ‘omics datasets from which a gene ontology can nonetheless be built.Finally, we investigated whether hierarchical features (i.e. the ontotype) were essential, or equally good predictions could be made from ‘flat’ features derived from the same ontologies. GO was flattened by computing the semantic similarity (Resnik, 1995), which scores every pair of genes by their functional relatedness in GO. As a non-hierarchical representation of NeXO, we directly considered the data on which it had been based: pairwise gene-gene similarities derived from different types of experimental evidence in YeastNet. Use of these flat datasets derived from the two ontologies resulted in a substantial degradation in prediction performance (FLATGO and FLATNeXO, Figure 3C), even though the same random forests regression procedure was used as for the functionalized ontologies.
Simulating growth phenotypes for ‘new’ genotypes not yet observed or examined
We next used FGO to simulate growth for all 12,512,503 pairwise deletions of non-essential yeast genes, 73% of which had not yet been tested in the laboratory (Figure 4A, Supplemental File S1). A total of 41,605 genetic interactions were predicted. These predictions were concentrated within and between particular terms and term pairs (Figure 4A,B), covering a total of 1,367 unique terms and indicating where in the ontology the logic of FGO takes place. For example, FGO predicted many genetic interactions within ‘oxidative phosphorylation’ (Figure 4C), with negative interactions linking the sub-systems of electron and proton transport and positive interactions segregating entirely within electron transport. These distinct patterns of positive/negative segregation were observed broadly across FGO (Supplemental Figure S3). Of particular interest were predicted interactions between 71 term pairs, as these terms were only distantly related in GO (Table 1, Supplemental Table S1, Supplemental Experimental Procedures). For example, all ten genes in ‘intron homing’ had negative interactions with all four genes in the ‘Phosphatidylinositol-3-kinase complex’, although neither these terms nor their parents shared any genes, and these terms were in entirely separate branches of GO (biological process versus cellular component). Thus, FGO makes predictions guided by, but not rigidly confined to, known hierarchical relations among cellular subsystems. The unexpected connections point to potential new cellular functions and functional relationships important for regulating cell growth.
Figure 4
The Functionalized Gene Ontology
(A) Visualization of FGO structure and function. Terms and hierarchical parent-child relations are represented by nodes and black edges. Colored nodes and edges denote within- and between-term interaction enrichments, illustrating how terms and term combinations are used for prediction. (B) Venn diagrams showing number of term enrichments identified for measured interactions, predicted interactions, or both. (C) Example term ‘oxidative phosphorylation’, which factors into the transport of electrons (left child) versus protons (right child). Although both positive and negative genetic interactions are predicted within the oxidative phosphorylation genes (represented by a pie with both blue and red slices), positive interactions segregate within electron transport (blue pie) while negative interactions segregate between electron and proton transport (dotted red edge). See also Supplemental Figure S3.
Table 1
Top new functional relationships in FGO. See also Supplemental Table S1.
Term A (# of Genes)
Term B (# of Genes)
Interactions/Total (%)
p-valuea
Negative Interactions
intron homing (10)
phosphatidylinositol 3-kinase complex II (4)
40/40 (100.0%)
6.74E-96
negative regulation of chromatin silencing at silent mating-type cassette (8)
protein import into mitochondrial inner membrane (3)
24/24 (100.0%)
3.56E-55
pre-mRNA binding (5)
RNA pol II transcription coactivator activity in preinitiation complex assembly (3)
15/15 (100.0%)
2.86E-32
protein lipoylation (4)
carbon-oxygen lyase activity, acting on phosphates (3)
12/12 (100.0%)
1.23E-24
Swr1 complex (8)
U6 snRNP (3)
22/24 (91.7%)
1.20E-47
alpha-1,6-mannosyltransferase complex (6)
negative regulation of chromatin silencing involved in replicative cell aging (4)
Validation and expansion of the functionalized ontology of DNA repair and nuclear lumen
Key terms in FGO were ‘DNA repair’ and ‘nuclear lumen’, which featured prominently in the decision tree logic leading to a high concentration of predicted interactions (9.0 and 7.6 times the expected interaction density, respectively) according to particular patterns of disruption (Figure 5A, Supplemental Figure S4). Genetic perturbations within each term led to particularly accurate growth phenotypes in cross-validation, as the correlation between predicted interactions and those measured by Costanzo et al. was noticeably better for gene pairs in DNA repair or nuclear lumen (both r = 0.61) than for gene pairs in other terms (average r = 0.35, Supplemental Figure S2G, Supplemental Table S2). To test whether this performance generalized to new data, we experimentally measured growth phenotypes for 1,218 pairwise deletions of DNA repair genes and 1,600 pairwise deletions of nuclear lumen genes and scored these mutants for genetic interactions (Supplemental Table S3, Supplemental Experimental Procedures). Of these, 1,345 mutants had also been scored previously by Costanzo et al. Surprisingly, we observed that the new measurements were better predicted by FGO than by the previous measurements of those same genotypes (i.e., experimental replicates, Figure 5B). Such improvement suggests that functionalized ontologies may be able to reduce experimental noise by learning the overarching patterns of cellular subsystems that translate genotype to phenotype.
Figure 5
Elucidating the genetic logic of DNA repair and the nuclear lumen
(A) DNA repair has a rich structure of predicted genetic interactions among specific repair processes. Coloring and visual style of panels follow the convention of previous figures. See also Supplemental Figure S4. (B–D) Yeast growth was experimentally measured for double gene deletion strains in which both genes are involved in either DNA repair (green) or nuclear lumen (orange). See also Supplemental Table S2–3. (B) The new measurements are correlated with previous data by Costanzo et al., 2010 as well as predictions of a FGO trained with all previous data, or predictions of a “limited” FGO trained with all previous data excluding genotypes tested in the new screen. In all cases, correlation is computed among the genotypes tested by both the new screen and Costanzo et al. Among all genotypes in the new screen, we calculated receiver-operating (C) and precision-recall curves (D) for predicting negative genetic interactions in DNA repair and the nuclear lumen using the limited FGO. The corresponding curves across all gene pairs in the previous screen are reproduced for comparison (gray, see Figure 3F).
We next tested FGO’s ability to generalize to unseen mutant genotypes. For this purpose we constructed a “limited” FGO, trained only on those genotypes that had been tested earlier (Costanzo et al., 2010) but not by our new screens. This limited FGO achieved a high sensitivity versus specificity (Figure 5C) and precision versus recall (Figure 5D) in predicting the new interactions measured for DNA repair and nuclear lumen genes. Given this validation, we combined the genetic interaction scores from both new screens with previous data (Costanzo et al., 2010) and re-trained the ontotype decision logic on this more complete dataset. The structure of this improved FGO, with the accompanying ontotype-phenotype logic, is available online on the Network Data Exchange (http://goo.gl/cYIXWJ, UUID: 01b46d52-c3a5-11e5-8fbc-06603eb7f303, Pratt et al., 2015) and as a Cytoscape file in Supplemental File S2.
Toward more complex genotypes
Although the ontotype had been trained using double deletion genotypes, we hypothesized that, once trained, it might be capable of predictions for genotypes involving mutations to larger numbers of genes. Although few studies have examined three-way or higher-order genetic interactions, a recent study (Haber et al., 2013) showed proof-of-principle for a three-way gene deletion methodology, representing one of the few systematic screens for triple mutants to-date. This work reported that deletion of CAC1 in combination with any gene in the HIR complex (HIR1, HIR2, HIR3, HPC2, RTT106), results in a synthetic growth defect (negative genetic interaction); however, the additional deletion of a third gene ASF1 suppresses this phenotype. Consistent with these findings, FGO predicted both the synthetic sickness of the double mutants and phenotypic suppression by the triple mutant (Figure 6A). Visual inspection of the model (Figure 6B) implicated decision logic based on the functional activities of two related processes, DNA replication-independent nucleosome assembly and nucleosome organization. Deleting a single gene in DNA replication-independent nucleosome assembly leads to a state in which the deletion of another gene functioning elsewhere in nucleosome organization causes synthetic sickness. In contrast, the triple mutants include deletion of two genes in DNA replication-independent nucleosome assembly (asf1ΔHIRΔ), leading to a neutral phenotype. This effect probably occurs because the double mutant impairs growth to such an extent that additional perturbations have no detectable effect. Indeed, whereas CAC1 is primarily involved in regulating DNA replication, ASF1 and the HIR complex have been linked to other chromatin-related processes, including transcriptional elongation (Formosa et al., 2002; Schwabish and Struhl, 2006) and mRNA export (Pamblanco et al., 2014). This triple-mutant case study illustrates the complexity of logic in interpreting genetic interactions, underscoring the utility of a knowledge representation and reasoning system for unraveling such combinatorial genetic effects.
Figure 6
Prediction of triple mutants
(a) Measured versus predicted interaction scores for genotypes involving pairwise and three-way deletions involving ASF1, CAC1, and genes in the HIRA complex (HIR1, HIR2, HIR3, HPC2) (Haber et al., 2013). (B) Relevant GO structure (left) and corresponding functional decision tree (right) for predicting the two- and three-way interactions in (A). At left, arrows represent parent-child relations and gene annotations in GO. At right, arrows represent decisions based on ontotype: numbers on arrows are term states; arrows point to predicted interaction scores (ε).
Discussion
Many years of work by the Gene Ontology Consortium have established an extensive description of cell structure spanning a hierarchy of biological scales. Here, we have shown that the ontology structure can also be used functionally for interpretation of genetic variants to make phenotypic predictions. The ability to systematically map and then integrate these two aspects, structure and function, outlines a general strategy for development of computational cell models. First, a knowledge base of the cell’s hierarchical structure is acquired, either through literature curation (GO) or data-driven methods (NeXO). In a second step, mathematical relations are learned by algorithms that translate how the functional states of these subsystems— the ontotype— give rise to a phenotype of interest. Together, these two steps constitute a paradigm by which cell structure is determined from physical information derived from literature or systematic data, and cell function is learned from genetic data such as synthetic-lethal interactions and genome-wide association studies.Functionalized ontologies substantially outperformed previous phenotypic predictors (Figure 3C,F), a notable finding given the simplicity of the ontotype and its use as the sole feature set for learning. We believe this success follows from several key aspects of implementation. First and most important, the utility of hierarchical organization in genotype-phenotype translation cannot be overstated. Indeed, the functionalized ontologies also outperformed predictors based on non-hierarchical versions of the same information (Figure 3C) or truncated versions of the ontology (Figure 3D,E). From the perspective of the ontology, all mutations or variants in a genotype coalesce to the same cellular module, provided one looks at a high enough level (Figure 1B). A genotype may include some mutations that map to the same gene, others to the same protein complex; still others to different complexes but to the same broad process or organelle, with all mutations falling within the highest scale represented by the cell itself. Propagating mutations upward through terms of increasing scale enables subsequent selection of the ‘right’ scale for accurate prediction. In this regard, FGO sheds light on previous, partially discrepant, studies of genetic interaction networks. Some analyses have found that negative genetic interactions tend to connect between complementary modules, whereas positive interactions tend to occur within a single module (Bandyopadhyay et al., 2008; Collins et al., 2010; Kelley and Ideker, 2005; Leiserson et al., 2011; Ma et al., 2008; Qi et al., 2008; Ulitsky et al., 2008); a more recent report identified dense patterns of both positive and negative interactions between modules (Bellay et al., 2011). Analysis of FGO suggests that both interpretations can be correct, depending on the scale of the module(s) within the cellular hierarchy.The second factor in the success of functionalized ontologies is the sustained efforts of biologists at large. GO is a rich resource of cellular knowledge that is both broad, in its extensive coverage of cell biology, and deep, in its resolution of cell subsystems across many different scales. Although not perfect, this knowledge is continuously refined, updated and expanded by the sustained efforts of a global community. Given the staggering complexity of the cell, such a collaborative approach incorporating diverse expertise and tools may be instrumental in establishing robust and complete prior knowledge for computational cell modeling. Previously, cellular modeling efforts have typically involved independent curation within a single laboratory or institute.The last factor that worked in our favor is the fact that functionalized ontologies balance rigid modeling constraints imposed by prior knowledge with flexible statistical learning guided by experiments. Computing the ontotype requires no parameters and instead leverages the topology of the ontology. Logical rules for predicting phenotype are based on the ontotype, but their functional form, i.e. which terms are used and how their states are combined, is learned from data. In contrast, many previous efforts in mechanistic modeling, e.g., see (Cahan et al., 2014; Carrera et al., 2014; Deutscher et al., 2006; Karr et al., 2012; Lerman et al., 2012; Machado et al., 2011; O’Brien et al., 2013; Orth et al., 2010; Segrè et al., 2005; Szappanos et al., 2011; Szczurek et al., 2009; Takahashi et al., 2003; Tomita et al., 1999) have been driven by low-level prior knowledge in the form of biophysical equations. While naturally conferring a mechanistic explanation when correct, these equations have a known challenge that they are often of preset form and have sensitive parameters (Apgar et al., 2010; Ashyraliyev et al., 2009; Gutenkunst et al., 2007), such that achieving accurate predictions within one dataset risks overfitting.
Extending Functional Ontologies Beyond Current Limits
FGO based its predictions principally on 1,367 terms, spread across various biological processes, cellular components and molecular functions (Figure 4A). Although this coverage of cell biology is substantial (27% of the yeast GO), one might wonder whether it should be more complete. First, some term logic is likely missed because those terms are not frequently disrupted in the current set of genotypes. For example, genes annotated to 783 GO terms were never disrupted in any genotype tested (Costanzo et al., 2010). Second, some biological processes are likely not required for the phenotype tested – growth of cells in rich media – but instead may drive a wide variety of other phenotypes (Dowell et al., 2010; Hillenmeyer et al., 2008; Ideker and Krogan, 2012; Lee et al., 2014). Third, important processes or components may not yet have been curated in GO, and some existing terms might have errors in gene annotations or relations to other terms. Such false-positive and false-negative information could obscure a term’s utility in prediction. We expect that testing additional genotypes, phenotypes, and environmental conditions will increase the functional coverage of terms and enhance FGO with new and more robust logic.Complex traits arise from a landscape of genetic variants and mutations, where it is often challenging to interpret the effects of individual genes due to many multi-gene interactions (Kim and Przytycka, 2012; Zuk et al., 2012). Towards this challenge, we have shown that gene ontologies can be transformed into multi-scale models capable of general genotype-phenotype reasoning. Although based on simple rules of propagation, the model substantially outperforms previous methods for predicting cellular growth phenotypes, whether based on mechanistic modeling of pathways or ‘black-box’ machine learning methods. It also generalizes in ways that previous predictors are incapable of doing, including the ability to analyze genotypes of arbitrary complexity. These advances are important steps towards building intelligent systems that can one day interpret the complex genetics underlying human health and disease.In moving forward, special consideration should be given to the mathematical functions that govern each term state. Here, we found success with a surprisingly straightforward and parameter-free function that counts the disrupted genes assigned to a term and its sub-terms. More generally, this function might be tailored to each term according to specific knowledge about the inner workings of that cellular component or process. Defining the mathematical relationships between genes within a cellular process has been the focus of ‘bottom up’ systems biology (Bruggeman and Westerhoff, 2007; Chen et al., 2010). In contrast, defining the broad organization of genes into cellular processes has been the domain of ‘top down’ systems biology. With its hierarchy of terms and functions spanning many different biological scales, a functionalized ontology may offer a means to bridge this long-standing divide.
Experimental Procedures
Genetic interaction data
Experimental genetic interaction scores for >6 million double mutants in yeast, measured using synthetic genetic arrays (Costanzo et al., 2010) (SGA, 1,711 queries × 3,885 arrays), were downloaded from http://drygin.ccbr.utoronto.ca/~costanzo2009/. Double gene deletion mutants impacting DNA repair and the nuclear lumen were generated on solid agar media using SGA technology as previously described (Collins et al., 2010; Tong and Boone, 2006). See also Supplemental Experimental Procedures.
Preparation of ontologies
We used all three branches of the Gene Ontology (Biological Process, Cellular Component, and Molecular Function) by joining them under an artificial root. We removed annotations with the evidence code “inferred by genetic interaction” (IGI) to avoid potential circularity in predicting genetic interactions. We also removed terms that were not annotated with any yeast genes or were redundant with respect to their children terms to construct a GO relevant to yeast (Supplemental Table S4), following a previously described procedure (Dutkowski et al., 2013, http://mhk7.github.io/alignOntology/).To construct NeXO (Supplemental Table S5), we integrated the YeastNet v3 networks (Kim et al., 2014), spanning 68 experimental studies across 8 data types excluding genetic interactions, into a single network, and then applied the method of Clique Extracted Ontology (CliXO) (Kramer et al., 2014
http://mhk7.github.io/clixo_0.3/). See also Supplemental Experimental Procedures.
Random forests regression
Random forests (Breiman, 2001) were used to regress genetic interaction scores ε, as described in the Results. Due to the very large size of the ontotype feature matrix, we optimized the random forest library from the Python scikit-learn package (Pedregosa et al., 2011); the modified code is available at https://github.com/michaelkyu/scikit-learn-fasterRF. While trees grown at approximately 29% (GO) or 37% (NeXO) of the maximal depth did improve performance slightly (<0.02 gain in correlation, Supplemental Figure S5), we chose to grow trees to maximal depth because it is unclear how significant this gain is and whether it would reproducible in different random partitions of the data for cross validation or in different genotype-phenotype datasets. See also Supplemental Experimental Procedures.
Comparison of methods for predicting genetic interactions
The MNMC method was updated from the original (Pandey et al., 2010), which was trained on a set of literature-curated synthetic lethal interactions that was much smaller in size than the set of genetic interactions considered in our study, and because the set of features used by the method to score each gene pair had been updated since the 2010 publication. To train MNMC, we calculated five basic features that were identified in the original MNMC as among the most informative for predicting synthetic lethality of a gene pair. This updated MNMC outperformed the original MNMC (Supplemental Figure S6); this performance difference may be due to the five basic features being collected more recently. See also Supplemental Experimental Procedures.
Authors: Jonathan R Karr; Jayodita C Sanghvi; Derek N Macklin; Miriam V Gutschow; Jared M Jacobs; Benjamin Bolival; Nacyra Assad-Garcia; John I Glass; Markus W Covert Journal: Cell Date: 2012-07-20 Impact factor: 41.582
Authors: Thomas Eissing; Lars Kuepfer; Corina Becker; Michael Block; Katrin Coboeken; Thomas Gaub; Linus Goerlitz; Juergen Jaeger; Roland Loosen; Bernd Ludewig; Michaela Meyer; Christoph Niederalt; Michael Sevestre; Hans-Ulrich Siegmund; Juri Solodenko; Kirstin Thelen; Ulrich Telle; Wolfgang Weiss; Thomas Wendl; Stefan Willmann; Joerg Lippert Journal: Front Physiol Date: 2011-02-24 Impact factor: 4.566
Authors: Patrick Cahan; Hu Li; Samantha A Morris; Edroaldo Lummertz da Rocha; George Q Daley; James J Collins Journal: Cell Date: 2014-08-14 Impact factor: 41.582
Authors: Aditya Pratapa; Neil Adames; Pavel Kraikivski; Nicholas Franzese; John J Tyson; Jean Peccoud; T M Murali Journal: Bioinformatics Date: 2018-07-01 Impact factor: 6.937
Authors: Jason Fan; Anthony Cannistra; Inbar Fried; Tim Lim; Thomas Schaffner; Mark Crovella; Benjamin Hescott; Mark D M Leiserson Journal: Nucleic Acids Res Date: 2019-05-21 Impact factor: 16.971
Authors: A Jeremy Willsey; Montana T Morris; Sheng Wang; Helen R Willsey; Nawei Sun; Nia Teerikorpi; Tierney B Baum; Gerard Cagney; Kevin J Bender; Tejal A Desai; Deepak Srivastava; Graeme W Davis; Jennifer Doudna; Edward Chang; Vikaas Sohal; Daniel H Lowenstein; Hao Li; David Agard; Michael J Keiser; Brian Shoichet; Mark von Zastrow; Lennart Mucke; Steven Finkbeiner; Li Gan; Nenad Sestan; Michael E Ward; Ruth Huttenhain; Tomasz J Nowakowski; Hugo J Bellen; Loren M Frank; Mustafa K Khokha; Richard P Lifton; Martin Kampmann; Trey Ideker; Matthew W State; Nevan J Krogan Journal: Cell Date: 2018-07-26 Impact factor: 41.582
Authors: Darren M Hutt; Salvatore Loguercio; Daniela Martino Roth; Andrew I Su; William E Balch Journal: J Biol Chem Date: 2018-07-13 Impact factor: 5.157