Literature DB >> 34849568

Challenges and opportunities in network-based solutions for biological questions.

Margaret G Guo^1,2, Daniel N Sosa¹, Russ B Altman^3,4.

Abstract

Network biology is useful for modeling complex biological phenomena; it has attracted attention with the advent of novel graph-based machine learning methods. However, biological applications of network methods often suffer from inadequate follow-up. In this perspective, we discuss obstacles for contemporary network approaches-particularly focusing on challenges representing biological concepts, applying machine learning methods, and interpreting and validating computational findings about biology-in an effort to catalyze actionable biological discovery.

Entities: Chemical

Keywords: biological validation; embeddings; interpretability; knowledge graphs; networks

Mesh：
Machine Learning

Year: 2022 PMID： 34849568 PMCID： PMC8769687 DOI： 10.1093/bib/bbab437

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 13.994

Networks: A Useful But Limited Abstraction

With over 700 publicly available pathway and molecular interaction databases [4, 5], it is difficult to choose the right network. Networks can model biological systems at levels ranging from molecular to population-scale [6-13], where edges typically represent interactions between nodes corresponding to biological entities (drugs, genes, proteins, diseases, etc.; see [14, 15] for comprehensive reviews of graph theory applied to biological applications). Biological networks are often incomplete [16, 17]. The missingness of protein–protein interaction (PPI) data is as much as 80% [18]. Even with high-throughput datasets [19-21], building accurate and comprehensive network models is a behemoth task. The first step to ensure network quality is proper documentation of process and metadata annotation. The second step is to evaluate the network’s ability to recapitulate known interactions; manual curation is the typical gold standard [22]. A silver standard is the corroboration of interactions derived from orthogonally curated experimental sources, as done with PCNet [23]. Finally, another means of ensuring network specificity is removing potential false positive interactions due to experimental artifact. CRAPome is a contaminant repository for mass spectrometry (AP–MS) experiments used to build PPI networks that provides putative negative interaction data. Each of these approaches can increase confidence in the accuracy of new networks. To address the issue of sparsity, networks are often aggregated from independent data sources to form a more comprehensive ‘interactome’ [24]. However, integrating heterogeneous information into a homogeneous network abstracts away biological nuance, such as cell-type specificity [8], spatial [25] and temporal [26] resolution or environmental factors [27], and so precision suffers. In addition, PPI networks are inherently biased [28, 29] by the characteristics of experimental methods as well as external factors such as funding biases—these may make heavily-studied proteins appear to have artificially high degree in networks. A harmonious research pipeline for network methods in machine learning applied to biology. One potential solution to the problem of heterogeneous data is to use attributed knowledge graphs [30]—edges are qualified by specific semantic relations between nodes and can record relevant attributes such as the ‘confidence’ in a relationship. These graphs are able to capture nuance that qualifies ‘known’ knowledge in the network [31, 32]. These techniques have not been broadly applied to molecular biology, and machine learning methods for these heterogeneous models are needed. A host of network-based biological models can capture dynamic relationships, particularly the relationship between genes, proteins and other cellular entities in gene regulatory networks (GRNs; [33]). GRNs are flexible and enable temporal representation of node states that incorporate uncertainty in stochastic (as opposed to deterministic) models, thus making them amenable to Boolean [34, 35] and Bayesian network approaches [36]. These networks have been used to model dynamic cellular behavior [37-41]. Other architectures for dynamic models include differential equations [42], neural nets [43] and information theory-based approaches [44], all of which use gene expression data under differing experimental conditions to capture a system’s behavior in response to perturbations. The number of perturbational datasets and parameters required to accurately recapitulate a system is a combinatorial optimization problem [37], making it computationally difficult to kinetically model full-scale networks. Increasing computing power and the proliferation of large-scale sequencing datasets may enable more tractable modeling of the dynamics of biological systems at scale.

Furthering Biologically Principled Inference Over Networks

A major force driving the explosion of network biology is the availability of network-based machine learning methods to biological problems [1, 2, 45–47]. These have often been framed as the tasks of link prediction, community detection and network alignment [48]; comprehensive reviews [49, 50] survey applications of these network inference methods. Without diligence, however, the mapping from biological questions to neat network methods may be unprincipled and suffer from inadequate biological follow-up [3]. A key issue when using network inference methods is the quantity and quality of data used for training; however, systematic evaluations of the sensitivity of results to these parameters are rare. Huang et al. [23] studied the ability of different network topologies to recapitulate known disease gene sets using a network propagation approach [51]. They concluded that larger networks, such as STRINGdb [32], yield the best performance but observe diminishing returns in the size of the network. In addition, Menche et al. [18] used percolation theory (which describes the behavior of clustered components in networks as one randomly adds or removes edges) to draw connections between network sparsity and utility for biological tasks; they proposed heuristics about which disease gene sets might form identifiable modules in the network and their potential utility for applications. Machine learning methodologies that use vectorized representations of graphs present opportunities and challenges when ported to biology. Recently, network embedding methods, whereby low-dimensional representations of network structures are learned, have become popular in network biology due to their power and flexibility [52]. In addition, graph-based representation learning has become popular in deep learning-based frameworks for inference over networks [53]. These methods, however, have limitations. First, the network embedding strategy must be relevant in the context of a biological question. For instance, if nodes are embedded based on local network topology, then the biological problem should depend strongly on topology alone, since other features are not captured in this embedding. Second, embedding methods usually include simplifying assumptions, for example regarding transitive and semantic matching [54], which may limit their ability to capture symmetric, inversion and compositional properties, all of which may be biologically relevant. Finally, many embedding methods offer no biological interpretation to explain predictions, which limits their broader utility to biologists, although work in this space is emerging [55, 56].

Closing the Loop with Biological Validation

In the machine learning community, validation typically entails data partitioning followed by testing on a held-out dataset containing gold standard interactions. Although this can lead to reproducible results, it has drawbacks. First, in network theory, the idea of a truly isolated, held-out partition of data is difficult to implement. Cross-validation via edge removal across the network removes key network structural features, thus biasing algorithmic evaluation [57]. Second, biological gold standard data are incomplete, and ‘truly negative’ relationships are difficult to define [58]. Therefore, it is critical to validate on a variety of sources and use metrics that are robust to the level of missing data. Cross-validation across multiple networks may reduce specific network bias. However, given that networks often share a common underlying structure and content, purely computational validation may not distinguish true biological discovery from sensitive informational retrieval. In biology, independent and prospective experimental validation remains the only generally agreed-upon gold standard. Indeed, the strongest form of validation comes from experimental and/or clinical evidence that support network-generated hypotheses. Drug repurposing studies propose drugs that can be examined by subject matter experts and validated by in vitro drug screens or even clinical trials [22]. However, these efforts are rare due to cost (time and money). Case studies can demonstrate biological applicability [59], but these studies can only provide incremental evidence of biological validity. Biologists routinely expect that computational models produce inference that are mechanistically grounded and experimentally confirmable. ‘Interpretable machine learning’ seems desirable but is ill-defined [60]. For network biology, interpretability has two facets. ‘Representational interpretability’ is the ease of mapping biological abstractions to computational abstractions. It defines the scope of information represented by the network; capturing nuance such as cell-type, dynamics and directionality yields representations that are more faithful to underlying biology [61-64]. ‘Algorithmic interpretability’ is the ability to generate traceable features sets that support a biological hypothesis. For instance, in link prediction tasks over knowledge graphs, the capacity to find paths of known biological relations might serve as a form of deductive reasoning to support generated hypotheses [46, 65]. The pipeline from computational exploration to biological validation is not a linear path but rather an iterative process, wherein each step must be closely aligned with fundamental biological principles (Figure 1). We are optimistic that by first ensuring robust and relevant mappings to biological concepts, network methods will generate impactful insights that will accelerate progress in biological discovery.

Figure 1

A harmonious research pipeline for network methods in machine learning applied to biology.

The promise of network tools for biological discovery is great, albeit the field is filled with addressable computational and validation challenges. Heterogenous network models, such as knowledge graphs, are needed to capture the growing number of literature-based and structured biological datasets and can provide context and metadata for properly qualifying our biological models. The availability of more computationally powerful hardware allows cross-validating and testing on multiple networks and thus reduces specific network bias while enabling better empirical ‘null’ models used to assess significance within methods. Machine learning methods for more complex, heterogenous network models are still needed.

50 in total

1. Neural network model of gene expression.

Authors: J Vohradský
Journal: FASEB J Date: 2001-03 Impact factor: 5.191

Review 2. Genome-scale metabolic networks.

Authors: Marco Terzer; Nathaniel D Maynard; Markus W Covert; Jörg Stelling
Journal: Wiley Interdiscip Rev Syst Biol Med Date: 2009 Nov-Dec

Review 3. New tools for pathology: a user's review of a highly multiplexed method for in situ analysis of protein and RNA expression in tissue.

Authors: Jérémie Decalf; Matthew L Albert; James Ziai
Journal: J Pathol Date: 2019-02-20 Impact factor: 7.996

4. The yeast two-hybrid system: a tool for mapping protein-protein interactions.

Authors: Jitender Mehla; J Harry Caufield; Peter Uetz
Journal: Cold Spring Harb Protoc Date: 2015-05-01

5. Structural and dynamical analysis of biological networks.

Authors: Cecilia Klein; Andrea Marino; Marie-France Sagot; Paulo Vieira Milreu; Matteo Brilli
Journal: Brief Funct Genomics Date: 2012-08-20 Impact factor: 4.241

6. Bias tradeoffs in the creation and analysis of protein-protein interaction networks.

Authors: Jesse Gillis; Sara Ballouz; Paul Pavlidis
Journal: J Proteomics Date: 2014-01-27 Impact factor: 4.044

7. Pathguide: a pathway resource list.

Authors: Gary D Bader; Michael P Cary; Chris Sander
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. The Encyclopedia of DNA elements (ENCODE): data portal update.

Authors: Carrie A Davis; Benjamin C Hitz; Cricket A Sloan; Esther T Chan; Jean M Davidson; Idan Gabdank; Jason A Hilton; Kriti Jain; Ulugbek K Baymuradov; Aditi K Narayanan; Kathrina C Onate; Keenan Graham; Stuart R Miyasato; Timothy R Dreszer; J Seth Strattan; Otto Jolanki; Forrest Y Tanaka; J Michael Cherry
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971