Literature DB >> 30548534

Integrating molecular networks with genetic variant interpretation for precision medicine.

Emidio Capriotti¹, Kivilcim Ozturk², Hannah Carter³.

Abstract

More reliable and cheaper sequencing technologies have revealed the vast mutational landscapes characteristic of many phenotypes. The analysis of such genetic variants has led to successful identification of altered proteins underlying many Mendelian disorders. Nevertheless the simple one-variant one-phenotype model valid for many monogenic diseases does not capture the complexity of polygenic traits and disorders. Although experimental and computational approaches have improved detection of functionally deleterious variants and important interactions between gene products, the development of comprehensive models relating genotype and phenotypes remains a challenge in the field of genomic medicine. In this context, a new view of the pathologic state as significant perturbation of the network of interactions between biomolecules is crucial for the identification of biochemical pathways associated with complex phenotypes. Seminal studies in systems biology combined the analysis of genetic variation with protein-protein interaction networks to demonstrate that even as biological systems evolve to be robust to genetic variation, their topologies create disease vulnerabilities. More recent analyses model the impact of genetic variants as changes to the "wiring" of the interactome to better capture heterogeneity in genotype-phenotype relationships. These studies lay the foundation for using networks to predict variant effects at scale using machine-learning or algorithmic approaches. A wealth of databases and resources for the annotation of genotype-phenotype relationships have been developed to support developments in this area. This overview describes how study of the molecular interactome has generated insights linking the organization of biological systems to disease mechanism, and how this information can enable precision medicine. This article is categorized under: Translational, Genomic, and Systems Medicine > Translational Medicine Biological Mechanisms > Cell Signaling Models of Systems Properties and Processes > Mechanistic Models Analytical and Computational Methods > Computational Methods.

Entities: Chemical Disease Gene Mutation Species

Keywords: disease mechanism; genetic disease; network analysis; variant interpretation

Mesh：

Substances：
Receptor, trkB

Year: 2018 PMID： 30548534 PMCID： PMC6450710 DOI： 10.1002/wsbm.1443

Source DB: PubMed Journal: Wiley Interdiscip Rev Syst Biol Med ISSN： 1939-005X

INTRODUCTION

Recent advances in sequencing technologies have significantly reduced the costs of genome sequencing and genetic testing, allowing the detection of genetic variants at scale. In particular for humans, previous studies have aimed to identify genetic variants common to different populations (Abecasis et al., 2010) and nucleotide changes associated with phenotypes (Bush, Oetjens, & Crawford, 2016). Variant data collected in population studies have been used to describe the evolutionary history of humans (Schraiber & Akey, 2015) while, in medical settings, research has aimed to detect disease causing variants (Ku, Naidoo, & Pawitan, 2011) and/or variants that increase susceptibility (MacArthur et al., 2017). The analysis of genomic data and its relation to phenotypes is a fundamental step for enabling precision medicine (Fernald, Capriotti, Daneshjou, Karczewski, & Altman, 2011). Although in the last decades many studies have uncovered genetic variants associated to diseases (Amberger, Bocchini, Schiettecatte, Scott, & Hamosh, 2015), these discoveries only partially explain the biological complexity of most human diseases (Manolio et al., 2009). This observation is more evident in the case of polygenic disorders where the associated genetic variants are carried by only a fraction of the patients (Bomba, Walter, & Soranzo, 2017). Indeed many common and individually weak alleles have been detected for schizophrenia, bipolar disorder (Purcell et al., 2009) and rheumatoid arthritis (Stahl et al., 2012). The presence in the general population of large numbers of rare variants under strong selection suggests the hypothesis that these variants may contribute to a variety of diseases, potentially affecting many genes and pathways (Eichler et al., 2010). Furthermore, it is well established that genes do not cause disease in isolation but rather encode elements that form a dynamic molecular network in which perturbations may result in different phenotypes (Sahni et al., 2013; Vidal, Cusick, & Barabasi, 2011). As many disease mutations affect protein function or expression, this overview focuses on networks of proteins and their interactions. Indeed, knowledge of the protein–protein interaction network has proven relevant for understanding the mechanisms of many human disorders, including ataxia (Kahle et al., 2011), autism (Sakai et al., 2011), Huntington's disease (Kalathur, Pedro Pinto, Sahoo, Chaurasia, & Futschik, 2017) and breast cancer (Taylor et al., 2009). In addition, analysis of the interactome is important for the identification of cross‐phenotype genetic associations (Hackinger & Zeggini, 2017; Zhong et al., 2009). This phenomenon, referred to as pleiotropy, was introduced more than 100 years ago by Ludwig Plate to describe cases where a mutation affecting the same gene results in clinically distinguishable phenotypes (Stearns, 2010). These observations sustain the need for more accurate tools for genome interpretation that consider the constellation of variants carried by an individual as possible perturbations of the underlying molecular interaction networks. In this review we summarize available resources, and describe how analysis of the interactome has led to an understanding of how the organization of biological systems leads to disease vulnerabilities. We discuss emerging strategies for predicting the impact of genetic variation using interactome networks and highlight future opportunities for network analysis of variants.

KEY CONCEPTS

In principle, genetic variants in the protein coding region of the genome can have a broad array of effects on protein activity, ranging from no consequence to severe alteration to the function and/or structure of a protein. The effect of a variant at the single protein level may not reflect the severity of the associated phenotype. Some loss of function mutations are well tolerated. Instead the severity of perturbation to the complex network of molecular interactions in the cell may more closely capture potential to generate a phenotype. Missense mutations that change only a single amino acid in the protein can significantly affect protein–protein, protein–DNA and enzyme–substrate interactions (Sahni et al., 2015). Studying variants in the context of protein–protein interaction (PPI) networks and biochemical pathways can improve our understanding of the mechanisms underlying genotype–phenotype relationships.

Graphical modeling for computing on networks

Graphical models provide a mathematical framework for studying the architecture of biological systems. Biological systems can be represented as networks, wherein nodes usually represent biomolecules, and edges represent interactions among them. Graphical modeling then allows quantitative measures to be derived from the network topology for analyzing different aspects of biological systems. The network structure itself can be analyzed, or biological measurement data can be mapped onto network nodes and edges to facilitate integration or interpretation of those measurements in the context of the organization of the underlying system. In the context of the relating genotype to phenotype, genetic alterations are mapped onto their respective proteins to identify the PPIs and biochemical pathways that are potentially affected.

Network analysis measures

Various network measures have been developed to describe the characteristics of nodes within networks (Newman, 2010). In Figure 1 we summarize a few important network measures used to describe nodes, using the PPI network of the NTRK2 activation pathway as an example. Node degree describes the number of interaction partners, and can be used to designate proteins as hubs or peripheral nodes (Figure 1a). The clustering coefficient of a node describes how connected the immediate network neighborhood of a node is (Figure 1b). Various measures of centrality have been developed to capture the importance of a node to information flow in a network. For example, degree centrality captures how connected a node is to the rest of the network (Figure 1c), and betweenness centrality describes the number of shortest paths that traverse a node (Figure 1d). Nodes can also be assigned to modules within the network using community detection algorithms (Leung, Hui, Lio, & Crowcroft, 2009; Newman, 2006). An example is represented in Figure 1 Panel D where NTRK2, BDNF, NTF4 constitute a module obtained using the Girvan–Newman algorithm (Girvan & Newman, 2002) and NTRK2 is a bottleneck connecting two communities. The network measures described above can be used to identify nodes with key “roles” within the network topology (Guimera & Amaral, 2005).

Figure 1

Network analysis measures. PPI network of the NTRK2 activation pathway through FRS2/FRS3. (a) The degree (k) is the number of edges of a node. The degree of FRS2 is 8. The edges are highlighted in red. (b) The clustering coefficient C of a node is calculated as the ratio between the connected triangles (delimited by the solid red and gray lines) and the total number of possible triangles k × (k − 1). The dashed lines represent the unbound triangles. For FRS2, C is 0.75 (21/28). (c) The degree centrality (C ) of a node is the number of edges divided by the total number of possible edges. The C of FRS2 is 0.8 (8/10). Red dotted lines represent the missing edges. (d) The betweenness centrality (B) is the sum over all the possible pairs of the fraction of shortest path passing through a node (red) divided by the total number of shortest paths. B of FRS2 is 9.167 (18 × 0.5 + 0.167). Gray edges are part of the shortest paths not passing through FRS2. In this example, edge length is determined by the layout algorithm and does not have a quantitative interpretation

DATABASES AND RESOURCES

The development of new methods for analyzing and predicting the impact of genetic variants in the context of the PPI network requires the collection of high‐quality data from biological experiments and clinical reports. The primary sources of data needed for such analyses can be divided into three main categories: genetic variants, biological networks and disease association databases.

Genetic variant databases

There are a growing number of databases of genetic variants available on the internet. The most comprehensive database of small variants is dbSNP (Sherry et al., 2001) which, in the current version (Build 151), includes more than 113 million validated genetic variants. In spite of the name, dbSNP does not contain only single nucleotide polymorphisms (SNPs) but also rare and somatic variants. Considering these different types of genetic variation, single nucleotide variants (SNVs) account for ~90% of small variants. A significant amount of these data are the result of the 1000 Genomes project (Abecasis et al., 2010). The 1000 Genomes Consortium sequenced the whole genomes of more than 2,500 individuals from different populations allowing a more accurate characterization of the landscape of genetic variants in humans and better estimates of the average load of variants per individual (Abecasis et al., 2012). Several data sources are more focused on collecting information about the phenotypic effects of genetic variants. For example, Clinvar hosts curated information about the health consequences of genetic variants (Landrum et al., 2016). Clinvar includes ~440,000 variants with some supporting evidence of a relationship to human phenotypes (Clinvar release July 16, 2018). Focusing on the variants with clinical significance, Clinvar contains ~81,000 genetic variants classified either as “Pathogenic” or “Benign.” The Pathogenic subset consists of ~52,000 variants from ~3,500 genes which are associated with more than 4,400 Mendelian disorders. Clinvar also includes a small set of disease‐associated variants in intronic and noncoding regions (~4,900), and pathogenic synonymous SNVs (~200). Another important database that contains information about the impact single amino acid variants (SAVs) and their relationship to human phenotype is SwissVar (Mottaz, David, Veuthey, & Yip, 2010). SwissVar curators classify the impact of SAVs either as “Pathogenic” or “Polymorphism,” extracting relevant information from the literature. The current release of SwissVar database (release 18, July 2018) contains ~70,000 SAVs, 42% of which (~29,000) are “Pathogenic” in ~2,750 genes. These pathogenic SAVs are associated with more than 3,450 Mendelian diseases. Since 2008, the published results of genome‐wide association studies (GWAS) have been systematically reviewed to extract significant association between common variants (SNPs) and complex disorders. This data is hosted by the GWAS Catalog (MacArthur et al., 2017) that contains significant SNP‐Trait associations (p‐value <9e−6) for 50,900 unique genomic locations. About 63% of these loci are mapped to ~11,400 genes while the remaining variants are located in intergenic regions. Among complex diseases, cancer has been the focus of many sequencing studies (International Cancer Genome Consortium, 2010). The analysis of genomic data from cancer patients resulted in the detection of a large number of somatic variants, found by comparing the genetic variants in tumor cells with those in the normal cells from the same individual. The somatic variants detected with this approach are collected in the COSMIC database (Forbes et al., 2017). Version 85 of the COSMIC database (May 2018) contains ~4.4 million variants from ~253,000 tumors samples across 45 primary sites. A small number of the variants reported in COSMIC (Kamburov et al., 2015) are additionally described as causing clinical resistance to pharmaceutical therapies. These 141 mutations affected 21 different genes, but were most prevalent in ABL1, EGFR, and KIT.

Resources for biological network analysis

Biological networks can be built directly from expert knowledge, or in an unbiased fashion from large experimental screens (Rual et al., 2005) (see Box 1). Perhaps the most intuitive biological network model is a PPI network, where nodes represent proteins and edges indicate physical interactions between proteins. Other common networks model intracellular signaling, mRNA coexpression, gene regulation or metabolic flux. A broad selection of pathways and networks are hosted via online databases such as KEGG (Kanehisa, Furumichi, Tanabe, Sato, & Morishima, 2017), IntAct (Orchard et al., 2014), iRefIndex (Razick, Magklaras, & Donaldson, 2008), Reactome (Fabregat et al., 2018), Pathway Commons (Cerami et al., 2011), BioPlex (Huttlin et al., 2017), DIP (Salwinski et al., 2004) and STRING (Szklarczyk et al., 2017). These databases can be classified according to the type of information collected. Although KEGG (Kyoto Encyclopedia of Genes and Genomes) collects many types of information, it serves as a reference database for biological pathways. The KEGG PATHWAY resource consists of graphical representations of cellular processes, such as metabolism, membrane transport, signal transduction and cell cycle. Recently, the MINT (Licata et al., 2012) and IntAct (Kerrien et al., 2012) databases merged their efforts to provide a curated repository of experimentally determined interactions. The current version of IntAct (August 2018) contains ~546,000 unique PPIs, 35% of which occur between ~23,500 human proteins. IntAct also contains a small set of proteins unlikely to be engaged in an interaction. This negative interaction set represents less than 0.2% of the total number of IntAct interactions. Other resources such as iRefIndex (Razick et al., 2008), Reactome (Fabregat et al., 2018), Pathway Commons (Cerami et al., 2011) integrate pathways and/or PPI data from several primary sources. PPIs can be detected in a variety of ways. The two most common technologies for high‐throughput PPI screening are yeast two‐hybrid (Y2H) and mass spectrometry (MS). These technologies have advantages and limitations that must be considered when analyzing the resulting interactomes. Y2H can effectively detect binary interactions, but cannot detect multiprotein complexes. MS can characterize the elements of multiprotein complexes, however it does not provide enough information to determine which proteins in the complex are in direct physical contact. Sometimes such complexes are depicted as “cliques” in networks, such that all participating proteins in the complex are linked together by edges. Both technologies have associated false positive (finding an interaction when none exists) and false negative (failing to find an existing interaction) detection rates. Interactions can be also be obtained from low throughput experiments, for example co‐crystal structures obtained via X‐ray crystallography, cryo‐electron microscopy or negative stain electron microscopy. Because many studies that probe protein interactions are not performed as high‐throughput screens, the associated interactions are generally mined from the biomedical literature. Networks generated from literature mining are more susceptible to study bias than networks derived from high‐throughput screens, however networks constructed from the literature tend to contain more interactions and those interactions may be less prone to random error (Chen, Aronow, & Jegga, 2009). Another widely used resource for PPIs is the STRING (Szklarczyk et al., 2017) database which integrates known and predicted interaction data across multiple organisms. Apart from experimentally verified PPIs, STRING collects data derived from different sources including gene co‐expression analyses, automated text‐mining and computational inference based on gene orthology. For human alone, the latest release of STRING (version 10.5, May 2017) contains more than 691,000 unique associations between ~18,700 genes. New resources for studying biological networks, such as the Network Data Exchange (NDEx) (Pratt et al., 2015) and the Cell Collective (Helikar et al., 2012), are emerging to provide collaborative platforms for data sharing, analysis and model simulation. In particular NDEx implements a RESTful API which can be programmatically accessed by any application. In the most recent version of NDEx, the curators improved the quality and abundance of biological networks relevant to the cancer research community. Similarly, the Cell Collective platform enables users to build and analyze network models, and use them to run simulations via a web interface. This application can be used to simulate loss/gain of function and test possible scenarios in real time. High‐throughput experiments for detecting molecular interactions are important for mapping the interactome and are also useful for assessing the quality of available databases, validating the performance of methods that predicting PPIs and selecting sub‐networks obtained from the same experimental technologies. A recent study based on affinity purification mass spectrometry detected more than 56,000 interactions among ~11,000 human proteins (Huttlin et al., 2017). The results of this experiment are accessible through the BioPlex website (http://bioplex.hms.harvard.edu/). Crosslink mass spectrometry is an alternative technique to profile PPIs (Kaake et al., 2014; Zybailov, Glazko, Jaiswal, & Raney, 2013). A popular high‐throughput approach for detecting interactions is by yeast two‐hybrid (Y2H) experiments that detect binary PPIs (Cafarelli et al., 2017; Rolland et al., 2014; Rual et al., 2005). Many such interactions are available through the Human Reference Protein Interaction Mapping Project (HuRI; http://interactome.baderlab.org/). The large number of available resources for studying biological networks poses a question about the implications of selecting a network for a particular study, including the reliability of particular networks for specific applications (see Box 2). Focusing on the ability to recover disease gene sets, a recent study evaluated 21 human genome‐wide interaction networks (Huang et al., 2018). This analysis showed that performance increased with network size. The STRING database had the best overall performance; however after correcting for size, the smaller network from the Database of Interacting Proteins (DIP) (Salwinski et al., 2004) had the highest per edge performance. The availability of numerous biological networks constructed from different experimental techniques and literature mining poses a difficult question about the accuracy and reliability of networks for disease studies. Generally speaking the evaluation of the quality of the interaction networks is problem dependent. Focusing on the task of recovering disease‐gene associations based on colocation and connectivity in the interactome, larger networks tend to achieve better overall performance (higher sensitivity) than smaller networks. Contrarily, if the analysis aims to minimize the number of false positive genes recovered, a network with a smaller and well‐curated set of interactions on average scores with higher precision than larger networks (Peng, Wang, Peng, Wu, & Pan, 2017). Interactions that have been detected by multiple technologies are often considered more reliable, thus some studies include interactions with multiple evidences (Chou & Cai, 2006; Sharan et al., 2005; Trivodaliev, Bogojeska, & Kocarev, 2014). However this can throw away real interactions that could be informative for a particular study. As a rule of thumb, PPI data from X‐ray crystallography are more reliable than other types of data. An important limitation for evaluating the quality of PPI networks is the low number of known of noninteracting proteins (negative set). A fair assessment of the quality of a PPI network should include an analysis of the performance on a negative set.

Disease/phenotype annotation and classification

Pivotal resources for studying the impact of genetic variants in the context biological networks are databases for the annotation and classification of diseases and phenotypes. A systematic classification of diseases and their genetic causes was carried out by McKusick, who developed the primary comprehensive curated repository for genotype/phenotype relationships. Available online since 1987, the Online Mendelian Inheritance in Man (OMIM) database (Amberger et al., 2015) synthesizes and summarizes information extracted from the biomedical literature by careful curation. The OMIM database is freely available upon request for the academic community, and it its current version (August 2018) contains ~5,300 phenotypes with known molecular basis. Cataloging and description of distinct phenotypes is a limiting step for their analysis and comparison. To overcome this limitation, different standardized vocabularies have been developed to ensure consistent, reusable and sustainable descriptions of human diseases. Initially the U.S. National Library of Medicine (NLM) developed the Unified Medical Language System (UMLS) (Bodenreider, 2004) which includes names, concepts and relationships from different biomedical vocabularies. Similarly, Medical Subject Headings (MeSH) (Rogers, 1963) defines a hierarchically‐organized terminology for indexing and cataloging biomedical information, and the Systematized Nomenclature of Medicine (SNOMED) provides a systematic, computer‐processable collection of medical terms (Ruch, Gobeill, Lovis, & Geissbuhler, 2008). The medical terms developed by NLM curators are part of specific ontologies for the classification of disease and phenotype such as the Disease Ontology (DO) (Kibbe et al., 2015) and the Human Phenotype Ontology (HPO) (Köhler et al., 2017). DO is a hierarchical disease‐centric ontology collecting additional facts about disease. In the latest version, DO curators expanded the utilities for examination and comparison of genetic variants, phenotypes, proteins, drugs and epitopes. In contrast to DO, the HPO focuses on the analysis of phenotypic abnormalities. The HPO project is divided into three components: the phenotype vocabulary, disease‐phenotype annotations and algorithms that operate on these. With respect to DO, HPO implements a better nomenclature for the description of rare diseases. There are a number of additional resources for disease‐gene associations such as DisGeNET (Piñero et al., 2017), dSysMap (Mosca et al., 2015) and the Comparative Toxicogenomics Database (CTD) (Davis et al., 2017). In particular, DisGeNET is one of the largest collections of genes and variants associated with human diseases, and integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. The current version of DisGeNET (v5.0, Integrative Biomedical Informatics Group, Barcelona) contains more than 560,000 gene‐disease associations, between ~17,000 genes and more than 20,000 diseases and traits. In terms of variants, it contains more than 135,000 variant‐disease associations, between ~83,000 variants and ~9,200 diseases and phenotypes. Focusing more on protein structure, dSysMap maps human disease‐related mutations onto the structural interactome. In its latest version (April 2018), dSysMap contains ~29,000 mutations in ~2,700 proteins associated to ~3,600 phenotypes. Finally, the CTD is a database that aims to advance the understanding of the effect of environmental exposures on human health. The database is divided in six categories with 11 relationships among them. Apart from gene–disease associations and gene–gene interaction data, CTD collects associations between chemicals and diseases. The current version of the database (June 2018) is composed of ~37,500 curated gene–disease and ~211,500 curated chemical–disease associations. A summary of the resources described above is provided in Table 1.

Table 1

Selected databases and resources for variant interpretation in the context of biological interactions

Database	Data	Web address
Variant databases
1000 Genomes	Whole genome and variants of >2,500 individuals	http://www.internationalgenome.org
ClinVar	Human variants with clinical significance	https://www.ncbi.nlm.nih.gov/clinvar
COSMIC	Catalog of somatic mutations in cancer	https://cancer.sanger.ac.uk/cosmic
dbSNP	Small variants from several organisms	https://www.ncbi.nlm.nih.gov/snp
GWAS catalog	Disease‐associated variants from published GWAS	https://www.ebi.ac.uk/gwas
SwissVar	Annotated single amino acid variants	https://swissvar.expasy.org
Network resources
BioPlex	Human PPIs from AP‐MS	http://bioplex.hms.harvard.edu
HuRI	Human PPIs from Y2H	http://interactome.baderlab.org/
IntAct	Manually curated PPIs from literature	https://www.ebi.ac.uk/intact
iRefIndex	Integration of PPIs from many databases	http://irefindex.org
KEGG	Reference database for biochemical pathways	https://www.genome.jp/kegg
NDEx	Platform for sharing and analyzing biological networks	http://www.ndexbio.org
Pathway Commons	Human PPIs and pathways from different sources	http://www.pathwaycommons.org
Reactome	Integration of PPIs and pathways from many databases	https://reactome.org
STRING	Experimental and predicted PPIs	https://string‐db.org
Disease/phenotype association and classification
CTD	Curated gene and chemical‐phenotype associations	http://ctdbase.org
Disease ontology	Hierarchical ontology for description of diseases.	http://disease‐ontology.org
DisGeNet	Resource of variant and gene association to disease	http://www.disgenet.org
dSysMap	Maps of disease mutations on the structural interactome	https://dsysmap.irbbarcelona.org
HPO	Ontology for the description of phenotypic abnormalities	https://hpo.jax.org
OMIM	Database of genes implicated in Mendelian disorders	https://omim.org

Note. AP‐MS: affinity purification‐mass spectroscopy; PPI: protein–protein interaction; Y2H: yeast two‐hybrid.

Selected databases and resources for variant interpretation in the context of biological interactions Note. AP‐MS: affinity purification‐mass spectroscopy; PPI: protein–protein interaction; Y2H: yeast two‐hybrid.

Computational methods for variant, network and disease annotation

Although not directly relevant to the current review, it is worth noting that a variety of computational tools have been developed to prioritize, annotate and extend the three categories of information required for network analysis of variants. Over the last decade many algorithms have been developed to predict the impact of SNVs (Capriotti, Nehrt, Kann, & Bromberg, 2012; Compiani & Capriotti, 2013), PPIs (Esmaielbeiki, Krawczyk, Knapp, Nebel, & Deane, 2016; Gonzalez & Kann, 2012), and disease‐gene associations (Bromberg, 2013; Tranchevent et al., 2011). In particular many machine learning methods are available to predict deleterious SNVs (Capriotti et al., 2012) and the effect of single amino acid substitutions on protein stability (Compiani & Capriotti, 2013). The prediction of new PPIs can be performed using sequence and/or structure information (Gonzalez & Kann, 2012). Some methods have also been trained to predict the interface residues that mediate the interactions between proteins (Esmaielbeiki et al., 2016). New associations between genes and diseases are frequent in the literature. Thus, methods for mining the literature to recover new disease‐gene associations are essential (Gonzalez, Tahsin, Goodale, Greene, & Greene, 2016) The majority of such tools use algorithms comparing regular text, specific ontologies and biological networks (Bromberg, 2013; Tranchevent et al., 2011). The computational methods described in the above‐cited reviews represent important early attempts to bridge the gap between the vast numbers of catalogued genetic variants and their association with human phenotypes, and are frequently applied to inform mechanistic studies.

NETWORK TOPOLOGY AND DISEASE

Networks provide a versatile framework for modeling the architecture of biological systems. Biological network architectures arise through evolution which should select for characteristics that confer a fitness advantage to an organism. For example, protein interaction networks have evolved to be robust to random genetic variation (Albert, Jeong, & Barabasi, 2000; Félix & Barkoulas, 2015; Kaneko, 2012; Payne & Wagner, 2015). As a result, studying mutation rates together with location in PPI networks can provide information about evolutionary constraints on particular proteins. Proteins under stronger evolutionary constraint should be less tolerant to error, and mutations in those proteins should more likely be associated with extreme phenotypes. Thus topology should also be helpful for bridging the gap between genotype and phenotype. Indeed networks that recapitulate the organization of biological systems can be used to study the relevance of the constituent molecules and molecular interactions to fitness or disease (Barabási & Oltvai, 2004; Ideker & Krogan, 2012; Vidal et al., 2011).

Properties of biological interactome networks and how they relate to phenotype

Studies of PPI network topology have generated multiple insights linking protein location and connectivity within the network to particular phenotypes. The characteristics of PPI network topology that enable function and robustness of biological systems also create certain kinds of vulnerability. PPI networks tend to have a scale‐free topology, such that the number of edges with degree k scales as a power‐law distribution (p(k) ~ k ) (Barabasi & Albert, 1999) where the exponent (γ) typically ranges between 2 and 3. As a result, a minority of nodes are hubs with a very large number of interaction partners while most nodes participate in very few interactions. This is thought to render the system robust to random error, since genetic variants at random are more likely to affect a protein with few interactors, and thus cause only a minor perturbation the overall topology of the network. However, this leads to vulnerabilities, as mutations affecting a highly connected hub are likely to have a significant impact on the system (Albert et al., 2000). In PPIs networks, on average, the shortest path length of edges separating a pair of randomly selected nodes grows proportionally to the logarithm of the number of nodes in the network (small‐world network) (Albert, 2005). The shortest path length is thought to be important for efficient transfer of information and rapid response to perturbation. Redundant paths between nodes may confer robustness to genetic variation (Albert, 2005; Papin & Palsson, 2004). Although small‐world properties would not be expected to generate a modular network topology per se (Gallos, Makse, & Sigman, 2012), biological networks tend to be modular, with densely connected subnetworks that are linked into the global network architecture by a small number of connections (Han et al., 2004). Indeed, there is selection against the formation of new interactions between nodes that are already highly connected. Instead, links between highly connected nodes and nodes with few interactions are favored (Maslov & Sneppen, 2002). The observed balance between modularity and small‐worldness in PPI networks may provide the optimal architecture for information flow (Gallos et al., 2012; Jarman, Steur, Trengove, Tyukin, & van Leeuwen, 2017; Yachie et al., 2016). However this architecture also creates bottlenecks, nodes that bridge more clustered regions of the network, and as such may create additional vulnerabilities (Yu, Kim, Sprecher, Trifonov, & Gerstein, 2007). Integrating protein topology with other data layers, such as gene expression or protein structure can reveal more detail about how the elements of the network function together. Hubs can be further divided into party and date hubs, dependent on whether interaction partners are all co‐expressed with the hub protein, or co‐expressed at different times or locations (Han et al., 2004). Incorporating protein structure into analysis of topology, Kim, Lu, Xia, and Gerstein, (2006) observed that hubs can be grouped according to whether interactors used different interfaces, allowing multiple simultaneous interactions, or used only a single interface, in which case interactions would be mutually exclusive. Interestingly, multi‐interface hubs corresponded to party hubs, whereas hubs that interacted with multiple partners via a single interface were more likely to be date hubs. The differences in co‐expression and interface usage by date and party hubs suggest that different evolutionary constraints may act on these subsystems. Indeed, in studies of the Saccharomyces cerevisiae PPI, date hubs were found to evolve more rapidly than party hubs, and their removal had a more extreme effect on the average path length of the network (Bertin et al., 2007; Han et al., 2004; Kim et al., 2006). Similarly, multi‐interface hubs were reported to evolve more slowly than single interface hubs (Biswas, Acharya, Podder, & Ghosh, 2018). Bottlenecks are also reportedly significantly less coexpressed with other network nodes, suggesting they may play a more dynamic role in biological systems (Yu et al., 2007).

Network topology determines the potential of genes to drive phenotypes

Given the clear evidence that nodes with distinct characteristics in the network support different aspects of the function of biological systems and are under different evolutionary constraints, it makes sense to evaluate the implications for fitness‐related phenotypes. Many studies have taken advantage of network measures to examine different classes of gene. Essential genes encode proteins that are required for organismal survival, such that loss of the gene is lethal. Essential genes are reported to have higher degree (Jeong, Mason, Barabási, & Oltvai, 2001), higher clustering coefficients (Feldman, Rzhetsky, & Vitkup, 2008; Said, Begley, Oppenheim, Lauffenburger, & Samson, 2004) and higher betweenness centrality in the PPI network (Yu et al., 2007). Grouping high degree nodes according to their status as party and date hubs further revealed that party hubs are more often essential than date hubs (Han et al., 2004). Cancer genes were also found to be enriched at hubs by several studies (Garcia‐Alonso et al., 2014; Goh et al., 2007; Jonsson & Bates, 2006; Sun & Zhao, 2010). In contrast, Mendelian disease genes were found to be less central in the network than essential and cancer genes (Garcia‐Alonso et al., 2014; Goh et al., 2007), particularly when essential Mendelian genes are excluded from the analysis (Goh et al., 2007). Interestingly, disease genes associated with dominant disorders had higher degree in the network than genes associated with recessive disorders (Feldman et al., 2008). In contrast, gene deletion at the periphery of the network was less frequently associated with an essential or disease phenotypes (Garcia‐Alonso et al., 2014; Khurana, Fu, Chen, & Gerstein, 2013). Figure 2 shows an example of analyzing network feature distributions for different classes of gene.

Figure 2

Exploring network topology as a determinant of gene–phenotype relationships. Topological location within the network has implications for biological function. (a) Nodes can be described with respect to particular characteristics in the network, including high degree hubs (red), nodes at the periphery (yellow) and nodes with the highest centrality according to four popular measures of centrality. We calculated network measures including (b) degree and (c) betweenness centrality for four groups of genes: 1,371 essential genes (Hart et al., 2015), 125 cancer genes (Vogelstein et al., 2013), 2,921 Mendelian disease genes (Stenson et al., 2017), and 7,099 other genes based on the latest release of STRING (Szklarczyk et al., 2017) to illustrate the types of observation that have been revealed by systematic studies of genes with respect to location in the interactome The relationship between network location, gene essentiality and disease raises the possibility of inferring the importance of a gene using network measures. For example, Xu and Li (2006) used a k‐nearest neighbors approach to implicate genes with similar network characteristics as likely Mendelian disease genes. Such approaches have also been generalized to predicting drug targets and toxicities (Kotlyar, Fortney, & Jurisica, 2012; Sun, Zhu, Zheng, & Xu, 2015). Kotlyar et al. (2012) found that the centrality of genes regulated by a drug target was correlated with the toxicity of the drug. The modular organization of the PPI network has been useful for implicating disease genes. According to the disease module hypothesis, genes related to a particular disease or symptom are likely to reside in the same region of the interactome (Bauer‐Mehren et al., 2011; Goh et al., 2007; Menche et al., 2015). A variety of community detection algorithms are available to identify tightly clustered groups of nodes that are more likely to be functionally related. Approaches have included random walk‐based algorithms (Rosvall & Bergstrom, 2008) and nonnegative matrix factorization (Zhang, Wang, & Zhang, 2007). Other algorithms use modularity to implicate disease genes for various classes of genetic disease. For example, the HetRank method uses networks to rank candidate genes for monogenic diseases exhibiting locus heterogeneity (Dand et al., 2015). In the setting of complex multigenic risk for disorders such as obesity, heart disease or diabetes, disease modularity has been used to uncover shared biological mechanisms underlying diseases by mapping distant risk variants, such as are identified by genome wide association studies, to genes that are close together in the network. Under the assumption that risk genes for the same disorder are more likely to be functionally related, Taşan et al. (2015) used a network of functionally associated genes to prioritize genes in disease‐associated genomic regions. Efforts to catalog mutations in thousands of tumor genomes have uncovered substantial genetic heterogeneity in cancer as well; despite their phenotypic similarities (cancer cells display certain hallmark behaviors; Hanahan & Weinberg, 2011), individual tumors rarely share the same mutations (Tian, Basu, & Capriotti, 2015). Although there is very little overlap in the genes that are mutated between tumors, the genes affected by causal driver mutations tend to converge on a limited set of pathways (Ali & Sjöblom, 2009; Garraway & Lander, 2013; Vogelstein et al., 2013). Since genes within a pathway also tend to cluster in biological networks, mutations can be mapped onto a network in order to identify sub‐networks of genes that are enriched for alterations (Leiserson et al., 2015; Vandin, Upfal, & Raphael, 2011; Zhang, Zhang, Wang, & Zhang, 2013) or assess tumor similarity using the set of pathways or network regions mutated in common (Hofree, Shen, Carter, Gross, & Ideker, 2013; Zhong, Yang, Zhao, Shyr, & Li, 2015). A more in depth discussion of network analysis for tumor genomes is provided by Ozturk, Dow, Carlin, Bejar, and Carter (2018). A phenotype of great medical interest is synthetic lethality. In the setting of synthetic lethality, mutations impairing one gene render loss of function at another gene lethal to the cell (Shen & Ideker, 2018). The most extensive studies of synthetic lethality have been performed by knocking out all genes individually and in pairwise combination is S. cerevisiae (Costanzo et al., 2010). Synthetic lethality occurs when cells are robust to knock out of each gene independently but sensitive to the loss of both. This raises the possibility of designing therapies that exploit preexisting mutations in cells to selectively eliminate diseased cells, a strategy that has been successfully used to combat cancer (Bryant et al., 2005; Farmer et al., 2005; Lord, Tutt, & Ashworth, 2015). Analysis using the interactome network topology revealed that synthetic lethal genes pairs were frequently clustered and coded for functionally related proteins that shared interaction partners, implicating protein interactions as a major source of synthetic lethality (Talavera, Robertson, & Lovell, 2013). More recently, CRISPR‐Cas9 was used to analyze the consequences of pairwise gene knockout in mammalian cells. Synthetic lethal pairs overlapped across three cell lines, but also showed significant differences, suggesting that lethality may vary considerably across cell types and conditions (Shen et al., 2017; Shen & Ideker, 2018).

NETWORK‐INFORMED VARIANT INTERPRETATION

From the observed relationship between network topology and essential or disease genes, it follows that the potential of genetic variants to cause a phenotype is determined by the location of the altered protein within the network (Carter, Hofree, & Ideker, 2013; Feldman et al., 2008; Vidal et al., 2011). Analysis of loss of function variants with and without pathogenic consequences confirms that interactome topology is a determinant of genotype to phenotype relationships. Garcia‐Alonso et al. reported that loss of function variants in healthy individuals were more frequently observed in genes located near the periphery of the interactome. In contrast, loss of function variants with pathological phenotypes was more central (Garcia‐Alonso et al., 2014). Piñero, Berenstein, Gonzalez‐Perez, Chernomoretz, and Furlong (2016) found that network centrality was inversely correlated with tolerance to mutation and that this could be observed at different scales, global and local, within the interactome. Of note, the association between network centrality and tolerance to loss of function variants was found to hold for PPI and regulatory networks, but not metabolic networks (Khurana et al., 2013). These studies suggest that the topological location of a variant within the network may be helpful in determining its pathogenicity.

Modeling variants as network perturbations

Most genetic variants are not loss of function events, but rather result in more subtle changes to protein sequences, and mutations within the same protein can have very different effects on its function (Taipale, 2018). It has been shown experimentally that most nonsynonymous Mendelian disease mutations generate stable proteins, supporting that mutation effects on specific protein activities rather than absence of protein drives disease phenotypes (Sahni et al., 2015). Mapping variants to nodes in the network cannot capture such subtle differences in variant effect, however the integration of information about protein structure and functional sites with protein interactions has made it possible to better discriminate variants in some cases by mapping them to network edges. Das et al. (2014), Sahni et al. (2013) and Zhong et al. (2009) introduced the concept of “edgetics” to describe the potential of mutations to perturb distinct interactions in which a protein participates. Under this model, variants mapping to the core have the potential to eliminate all interactions by destabilizing the protein, while variants mapping to interaction interfaces have the potential to perturb specific subsets of interaction (Figure 3).

Figure 3

Conceptual framework of edgetics. Location of variants within a protein has implications for their phenotypic consequences. Variants that map to the core of the protein are more likely to destabilize it, resulting in a loss of all interactions in which the protein participates. In contrast, mutations at protein interaction interfaces are more likely to perturb specific interactions. Variants mapping to the protein surfaces outside of binding interfaces are less likely to create a phenotype than core or interface variants Studies of edgetic effects require information about the three‐dimensional structure of protein complexes, so that amino acid residues can be labeled according to their location in the protein core, on the surface or at an interface between interacting proteins (Figure 4). The framework of edgetics thus allows variants to be studied not only in the context of their location in the network, but also according to their direct impact on network topology. Structurally resolved interactome networks, which integrate information about the domains or amino acid residues that physically interact, are increasingly available to explore the mechanisms by which mutations cause disease at scale (Betts et al., 2015; Das et al., 2014; Meyer, Das, Wang, & Yu, 2013; Mosca, Céol, & Aloy, 2013; Vázquez, Valencia, & Pons, 2015). Multiple studies using structurally resolved networks revealed a statistical excess of known disease mutations at protein interaction interfaces (David, Razali, Wass, & Sternberg, 2012; Guo et al., 2013; Wang et al., 2012), with in‐frame disease mutations enriched at interfaces relative to truncating mutations (Wang et al., 2012), confirming the utility of such networks for systematically investigating disease mechanism.

Figure 4

Mapping amino acid position to potential to interfere with protein interactions. Protein structures of FGF2 and FGFR1 are shown on the left and right respectively, and as a complex in the center (protein data Bank structure ID: 1CVS). In the complex, residues are colored according to location in the protein core (purple and green), at the interface (pink and blue) or at the surface outside of the interface (transparent pink and blue) on the two proteins respectively Early edgetic analyses focused on Mendelian mutations and revealed several key advantages to the edgetic model. Zhong et al. (2009) proposed that edgetics had the potential to describe aspects of genetic disease that could not be captured by topological location of a protein alone (Figure 5). These aspects include: (a) allelic heterogeneity (or pleiotropy), in which a single gene is associated with multiple phenotypes, (b) locus heterogeneity, where a single disorder is caused by a mutation in one of several genes, (c) variable penetrance, wherein not all individuals with a variant have a disease, and (d) variable expressivity, wherein individuals with a disease are not affected equally. Indeed, classic examples of pleiotropy and locus heterogeneity could be explained by edgetics. Mutations in the WASP protein associated with Wiskott–Aldrich syndrome versus X‐linked Neutropenia were found to map to distinct interfaces, while mutations associated with hemolytic uremic syndrome in the C3 and CFH proteins were found to map to reciprocal interfaces on the two proteins (Wang et al., 2012). Guo et al. (2013) further investigated the relationship between edgetic effects and the inheritance mode of diseases caused by particular mutations, finding that recessive disease mutations affecting reciprocal interfaces were more likely to cause the same phenotype than similar mutations associated with dominant effects. They described cases where dominant truncating mutations removed some interfaces while preserving others, such as was found for TRIM27 mutations in ovarian cancer, supporting that truncating variants may also frequently have edgetic effects (Guo et al., 2013).

Figure 5

Modeling the edgetic effects of genetic variants supports exploration of disease mechanisms. (a) Pleiotropy can result when different variants in the same gene affect different interactions in which a protein participates. (b) Variants at reciprocal interfaces of interacting proteins can contribute to locus heterogeneity Edgetics have also provided mechanistic insights for complex diseases including autism and cancer. For example, a study of 1,733 de novo missense mutations from autism spectrum disorder exomes demonstrated a significant enrichment of missense mutations affecting protein interactions in probands relative to unaffected siblings (Chen et al., 2018). Experimental assessment using the CloneSeq pipeline to test the effect of mutations on binary protein interactions (Wei et al., 2014) found that 74 of 361 (20%) of tested interactions were altered in probands versus only 21 of 208 (10%) in unaffected siblings. Several studies have demonstrated that somatic mutations found in tumor genomes are overrepresented at protein interaction interfaces (Engin, Kreisberg, & Carter, 2016; Kamburov et al., 2015; Porta‐Pardo, Garcia‐Alonso, Hrabe, Dopazo, & Godzik, 2015; Raimondi et al., 2016), suggesting that perturbations of protein interactions frequently contribute to tumor development. The Cancer Cell Map Initiative is now systematically experimentally mapping the impact of driver mutations on the interactome in cancer to further reveal the links between network rewiring and tumorigenesis (Krogan, Lippman, Agard, Ashworth, & Ideker, 2015).

Genetic variants perturbing network topology

While most edgetic variants would be expected to perturb existing interactions, it is also possible for variants to generate new interactions. For examples, the R273H amino acid substitution in TP53 was found to create a binding site for NRD1, and this novel interaction was found to promote cellular invasion in the setting of cancer (Coffill et al., 2012; Sahni et al., 2013). Such novel interactions may be difficult to predict, and it remains unclear how frequently amino acid substitutions generate new interfaces. The simplest way for novel interactions to emerge is likely through modified specificity at existing binding sites, and such events may be most prevalent in proteins families with high functional similarity. For example, many mutations in cancer are thought to alter the specificity of kinases for their substrates (Creixell et al., 2015; Reimand, Wagih, & Bader, 2013). Frequent cancer mutations were reported at acetylation and ubiquitination sites as well (Narayan, Bader, & Reimand, 2016). Experimental evidence also suggested that many disease mutations affect the motif specificity of transcription factors, with many mutations reducing the specificity of DNA binding sites, thereby allowing more promiscuous binding (Sahni et al., 2015). Nonsynonymous variants and small insertions and deletions have been the focus of most edgetic studies, however there are other types of alteration that have the potential to “rewire” the interactome. Alternative splicing generates different proteins from the same gene, and dependent on which exons are included, these protein isoforms can include different binding interfaces. Yang et al. (2016) isoforms used experimental approaches to map the interactome of 1,423 protein, and subsequently Ghadie, Lambourne, Vidal, and Xia (2017)) used in silico analysis of interface containing domain usage by various splice isoforms to construct an isoform‐specific interactome. Both studies found that patterns of protein interaction with different isoforms were associated with divergent disease phenotypes, creating additional opportunity for pleiotropy (Ghadie et al., 2017; Yang et al., 2016). In cancer, aberrant splicing was enriched at domain families that mediate protein interactions which also frequently harbored other types of mutation (Climente‐González, Porta‐Pardo, Godzik, & Eyras, 2017). Not all genes create equal opportunity for “rewiring” biological networks. In cancer, mutations often affect genes capable of causing large changes in the network. For example, cancer mutations frequently affect genes involved in chromatin remodeling, resulting in widespread changes to gene expression (Garraway & Lander, 2013). Fusion proteins are another common event in tumor genomes that can result in network rewiring. A recent pan‐cancer analysis found that fusions often involved genes with high degree in the network and that did not interact prior to the fusion event (Latysheva et al., 2016). Some recurrent cancer fusions involve genes that regulate the activity of multiple targets such as transcription factors, (e.g., RUNX1, ERG) or kinases (e.g., ABL1, NTRK1). Mutations with large impacts on network architecture are likely to be uncommon in inherited disease, as their consequences may be too severe to support early development in multicellular organisms. This hypothesis is consistent with reports that Mendelian disease genes are less central in the PPI network and participate in fewer protein interactions than cancer genes (Garcia‐Alonso et al., 2014).

Studying variant accumulation at network edges instead of nodes

Studying proteins in the context of their location in the network focuses analysis on node‐level properties. In contrast, in the framework of edgetics, variants are viewed based on their ability to perturb network edges. From this perspective it becomes possible to analyze the accumulation of variation affecting network edges rather than network nodes in order to better understand disease mechanism. Many cases of mutations affecting interacting genes and leading to similar phenotypes such as Branchiootic syndrome, Charcot–Marie–Tooth disease and Polycystic kidney disease have been reported (Oti, Snel, Huynen, & Brunner, 2006). Several groups have developed scoring strategies to prioritize network edges that are enriched for disease mutation. This requires first mapping variants to interface residues or interacting domains of proteins. Some methods evaluate the ratio of observed to expected variants in interface regions controlling or the size of the region relative to the size of the protein (Porta‐Pardo et al., 2015) and or the size and the amino acid composition (Raimondi et al., 2016). An alternative approach uses the nonsynonymous to synonymous (dN/dS) ratio at interfaces, a signature of selective pressure that has been used to identify cancer genes (Greenman et al., 2007), to evaluate whether interfaces are unexpectedly biased toward functional variants (Engin et al., 2016). Mechismo evaluates the consequences of amino acid substitutions at interfaces in terms of their impact on the pair potentials relative to what is expected (Betts et al., 2015), enabling an assessment of the likely effect of the variant on the interface. Other analyses used methods such as FoldX (Schymkowitz et al., 2005) to estimate the impact of the amino acid substitution on the stability of the protein complex.

Using networks to predict the phenotypic consequences of variants

Since networks influence the potential of variants to have a phenotypic effect, and variants result in phenotypic effects by perturbing biological networks in different ways, network measures should be informative for computational tasks relating to variant prioritization and interpretation. Some methods have begun to incorporate network information into classification tasks. Khurana et al. (2013) combined network and evolutionary properties to build a classifier capable of predicting gene tolerance to loss of function mutations, enabling automated prioritization of loss of function events according to their potential to have phenotypic consequences. Location of a variant at a protein interface was found to be one of the most informative features for discriminating driver missense mutations from neutral passenger mutations in analysis of tumor genomes (Tokheim & Karchin, 2018). Gao et al. (2018) used features derived from gene regulatory networks to predict the functional consequences of noncoding variants under the assumption that the causal effects of gene‐expression altering variants must be transmitted through this network. There may be various ways to derive features from interactome networks for the purpose of predicting variant effects. Yu et al. (2016) derived “ontotypes,” a signature of nodes reached by a gene in an S. cerevisiae‐specific hierarchical interactome network, to predict the impact of specific gene knockouts on colony growth. Several approaches have used the structure of the network more directly in predicting variant effects. To capture the potential for variants in the same gene to have different effects on biological processes, Engin, Hofree, and Carter (2015) mapped cancer mutations affecting different interfaces on the same protein onto different perturbed network architectures. Network propagation was applied to identify downstream subnetworks specifically affected by each mutation (Figure 6). These subnetworks could then be analyzed for functional enrichment to suggest mechanisms by which mutations perturbing distinct activities of a protein could result in different phenotypes. In a different study, Poole, Leinonen, Shmulevich, Knijnenburg, and Bernard (2017) evaluated the statistical association of clusters of somatic mutations within selected proteins to pathway level changes in gene expression in tumors, raising the possibility of aggregating mutations with common edgetic effects across samples and simultaneously analyzing them using networks.

Figure 6

Propagating variant effects on networks. Variants can be used as signal sources for network propagation in order to identify network neighborhoods affected by variants. Edgetic effects can be used to constrain network propagation according to the effects of variants on specific protein interactions. On the left side of this schematic, two variants to the purple node affect interactions with different subsets of partners (indicated by blue and pink nodes respectively). Network propagation can be used to implicate network regions likely to be affected by each variant, and these can be contrasted to identify regions perturbed by both variants that could explain shared phenotypes (right network, circled purple shaded nodes), or regions affected specifically by each variant (right network, blue and pink shaded regions) which could help explain pleiotropic effects Recently, Ma et al. (2018) developed DeepCell wherein the hierarchical interactome was used to constrain the architecture of a deep convolutional neural network that was trained to predict growth effects from genotype in S. cerevisiae. DeepCell was found to effectively simulate growth phenotypes observed in the laboratory directly from genotype. Furthermore, the weights learned by the neural network could be used to develop testable hypotheses about the mechanisms by which variants generated growth phenotypes, with some variants displaying Boolean‐type effects on specific subsystems. Thus DeepCell mines the multiscale, hierarchical organization of the interactome to extract novel information linking genotype to phenotype. In summary, multiple strategies have been developed to annotate variants according their consequences for biological networks to support their interpretation in the context of various phenotypes. These works lay the foundation for the next generation of automated variant interpretation tools that will integrate information about the architecture of the biological system and the potential of variants to perturb it. Strategies for quantifying variant effects on the network, or using network information to predict variant activity have thus far focused largely on single variants, however we note that multiple variants can simultaneously be mapped to network effects. This raises the possibility of using networks to predict the joint effects of combinations of variants, such as occur in complex polygenic diseases.

CHALLENGES AND FUTURE DIRECTIONS

While early works integrating networks with variant information to understand the mechanisms driving genetic disease show great promise, many challenges remain to be addressed. At times, analyses of network topologies have generated contradictory findings (Coulomb, Bauer, Bernard, & Marsolier‐Kergoat, 2005; Gandhi et al., 2006; Ivanic, Yu, Wallqvist, & Reifman, 2009; Jordan, Wolf, & Koonin, 2003; Yu et al., 2008), suggesting that careful consideration must be given to the limitations of the available data and the implications of choices made in constructing and analyzing network models when drawing biological conclusions. Data availability and quality is an important consideration for network analyses. PPI networks remain incomplete and may contain many false positive connections. In addition, networks assembled from various published experiments may exhibit literature bias; proteins associated with certain phenotypes may be more studied and as a result, may appear more connected in the network, giving the illusion that higher degree nodes are associated with phenotypes of interest (Chen et al., 2009; Feldman et al., 2008; Xu & Li, 2006). Choice of network may thus influence the biological conclusions drawn from network analysis. Indeed, Huang et al. (2018) showed that network performance at recovering known disease genes varied according to the disease, and no single network performed best for all diseases. To partially address this issue, several methods for selecting high‐quality PPI datasets and score the reliability of an interaction have been developed (Peng et al., 2017). Models frequently focus on specific aspects of biology while ignoring others. For example, many molecular interactome networks ignore regulatory interactions by which distant proteins in the network can influence each other, and often do not account for posttranslational modifications. Hybrid networks that incorporate both physical and regulatory interactions can be constructed (Rioualen et al., 2017) but statistical analysis may be complicated by the inclusion of different types of relationship. More sophisticated applications of natural language processing may support direct inference of more complex network models directly from the literature (Gyori et al., 2017). Many of the approaches discussed in this overview treat proteins as static network nodes and do not attempt to incorporate protein levels. Integrating interactome networks with more quantitative modeling techniques that rely on differential equations (Schnoerr, Sanguinetti, & Grima, 2017) or agent based models (Abar, Theodoropoulos, Lemarinier, & O'Hare, 2017) could present a pathway for including quantitative and dynamic information about protein levels, however these methods are frequently computationally intensive and may not be practical for large networks. Structure is not available for the majority of human proteins, limiting the investigation of variants affecting protein interactions. The extent to which networks themselves are complete also remains poorly understood. Many conclusions have been drawn based on the architecture of the human interactome, however some estimates suggest that at most 20% of interactions have been experimentally measured (Menche et al., 2015). In addition, gene expression patterns differ widely across cell types, suggesting that for accurate inference, network architectures need to be cell‐type specific. Indeed, disease network modules tend to include genes that are coexpressed in specific tissues (Kitsak et al., 2016), and several groups have now constructed tissue‐specific networks to study disease variation (Greene et al., 2015; Pierson et al., 2015). Many of the analyses described here have yet to be revisited in a tissue‐ or cell‐type specific setting. Finally, network representations are usually static, whereas the biological networks that they represent are dynamic and conditional. Protein interactions often require particular localization or posttranslational modification. Novel technologies such as APEX, a proximity labeling technique recently developed to enable spatially resolved analysis of protein interaction networks (Lobingier et al., 2017), may provide a solution to further resolve cell‐type specific interactions and subcellular location thereof. Distinguishing between constitutive and transient interactions, and cell‐state specific interactions may be important for further understanding the potential of variants to generate relevant phenotypes (Perkins, Diboun, Dessailly, Lees, & Orengo, 2010). Thus, the Stable Isotope Labeling by Amino acids in Cell culture (SILAC) method, a mass spectrometry technique able to detect differences in protein abundance among samples using nonradioactive isotopic labeling, has been used to measure stability of PPI (Budayeva & Cristea, 2014). New technologies are emerging that can accelerate the pace of interaction profiling and that will create more complete networks and new opportunities for analysis. Next generation sequencing‐based interaction screening technologies such as BFG‐Y2H (Yachie et al., 2016) and CrY2H‐seq (Trigg et al., 2017) allow higher throughput screening for binary interaction. Combining such technologies with techniques like deep mutational scanning (Fowler & Fields, 2014) could allow more systematic profiling of the edgetic effects of variants. Technologies such as Perturb‐seq also allow high‐throughput profiling of the effects of variants on single cell transcriptional profiles, creating opportunities to quantify the broader effects of variants at the resolution of specific cell types (Dixit et al., 2016). Interactome networks have the potential to play a key role in the interpretation of such assays.

CONCLUSION

A rapidly growing body of research has demonstrated the utility of network analysis for understanding disease mechanism. Many early studies relied on interactome networks derived from model organisms, such as S. cerevisiae, but the less complete human interactome has been instrumental for studying the relationships between genetic variation, genes and disease. Many insights have been gained from the study of genes with clear phenotypes, including essential genes, Mendelian disease genes and cancer genes. Interactome networks have also been successfully used to identify drug targets and study mechanisms underlying toxicities. However, a significant proportion of human genetic disease remains poorly understood, particularly for complex multigenic disorders. As new high‐throughput experimental assays emerge, databases of genetic variation and network models will become increasingly available and more complete. Large consortia are generating invaluable multilayer datasets that can create new opportunities for integrated analysis for variant interpretation. For example, the eGTEx project will add epigenetic measurements to complement the library of tissue‐specific expression data in GTEx (GTEx Consortium, 2013), enabling integration of epigenetic factors and tissue specific gene regulation into models for interpreting disease variants (eGTEx Project, 2017; Li et al., 2017). These new methodologies and datasets create opportunities for the development of computational models that support network‐informed inference to predict the phenotypic consequences of genetic variation and reveal the mechanisms by which variants contribute to complex human diseases. Recent technological advances are enabling the development of emerging research fields such as the Molecular Pathological Epidemiology (MPE) incorporating interpersonal heterogeneity of a disease process into epidemiology (Hamada, Keum, Nishihara, & Ogino, 2017). In this framework the integration of data capturing the complex combination of genetic heterogeneity (endogenous factors) and the environment (exogenous factors) is providing novel insights underlying etiologic mechanisms of different cancer types (Ogino, Nowak, Hamada, Milner Jr, & Nishihara, 2018) and define new therapeutic strategies (Ogino et al., 2018). Although these early efforts show great promise, new technological and algorithmic advances will be required to realize the full potential of networks to inform variant interpretation and precision medicine.

CONFLICT OF INTEREST

The authors have declared no conflicts of interest for this article.