Literature DB >> 34179849

Modeling in systems biology: Causal understanding before prediction?

Abstract

Babur et al. (2021) developed the CausalPath tool to infer causal signaling interactions in high-throughput proteomics data that may foster mechanical understanding from large-scale biological datasets.

Entities: Disease Gene Species

Year: 2021 PMID： 34179849 PMCID： PMC8212131 DOI： 10.1016/j.patter.2021.100280

Source DB: PubMed Journal: Patterns (N Y) ISSN： 2666-3899

Main text

Recent advancements of high-throughput technologies allow acquisition of large-scale biological datasets of different modalities, like transcriptomics, (phospho)proteomics, or metabolomics (generally “omics” data), even on the level of single cells. While these datasets promise unique opportunities to understand molecular mechanisms behind biological phenotypes in health and disease, their correct interpretation is complicated by several factors. At first, standard analysis methods in most cases return only lengthy lists of differentially expressed or phenotype-correlated genes or proteins, which hamper the effort to gain mechanistic insight about the observed phenotype. Also, the high dimensionality of experimental data (e.g., ∼20,000 in the case of transcriptomics) makes it complicated to distinguish between simple correlations and causal associations—for understanding and therapeutic interventions, the latter is essential. The main aim of different systems biology modeling and analysis techniques is to overcome these limitations. Generally, these approaches can be classified as knowledge- or data-driven ones (Figure 1). Knowledge-driven methods use in most cases extensive, curated lists of gene sets form connected biological processes or pathways and use statistical methods (with or without explicit pathway information) to find overrepresentation/enrichment of these gene sets in biological datasets. These methods tend to give more biological insights than simple lists of differentially expressed genes, thus they are more appropriate for hypothesis generation. However, in most cases the used gene sets are too general to identify real causal information from data. On the other side, data-driven methodologies, including machine learning models, focus on predictive performance. Predictive performance of systems biology models are important from several points. At first, predictive models can be important in different fields of biology from drug discovery to patient stratification. Also, one can argue, if some biological phenotype is predictable from omics data, that means that the prediction model identifies the underlying biological mechanisms. However, these later claims are unfortunately overrated: machine learning models can learn some technical biases and confounding factors of the analyzed datasets, which foster prediction performance but hamper biological understanding and generalization. Also, several of the best performing machine learning models are “black-box” models, meaning it is complicated to derive the exact prediction mechanisms from them, which also prevents biological interpretation.

Figure 1

Schematic representation of systems biology modeling directions

Knowledge-driven methods (top) use literature-curated gene sets of functionally related genes and perform some kind of overrepresentation/enrichment analysis using them. The enriched gene sets can help to interpret associations with different biological mechanisms; however, causal interactions are hard to be identified. Data-driven methods (bottom) use statistical/machine-learning methods to predict biological phenotypes. While these methods reach good predictive performance, their generalization and ability to gain mechanistic insight is limited in several cases. Causal reasoning methods (middle) use prior-knowledge network information together with data to identify contextualized causal signaling networks. The identified causal interactions can be used for hypothesis generation; however, future benchmarking of these methods is needed. Figure was created with BioRender.com.

Schematic representation of systems biology modeling directions Knowledge-driven methods (top) use literature-curated gene sets of functionally related genes and perform some kind of overrepresentation/enrichment analysis using them. The enriched gene sets can help to interpret associations with different biological mechanisms; however, causal interactions are hard to be identified. Data-driven methods (bottom) use statistical/machine-learning methods to predict biological phenotypes. While these methods reach good predictive performance, their generalization and ability to gain mechanistic insight is limited in several cases. Causal reasoning methods (middle) use prior-knowledge network information together with data to identify contextualized causal signaling networks. The identified causal interactions can be used for hypothesis generation; however, future benchmarking of these methods is needed. Figure was created with BioRender.com. Recently, several new methods were developed to bridge these differences between knowledge- and data-driven methodologies.3, 4, 5 These “causal reasoning tools” connect prior-knowledge networks (like signaling pathways or gene regulatory networks) with genome scale gene expression or proteomics measurements and use statistical tools to identify contextualized, sample-specific signaling network alterations and thus causal effects explaining the observed data. These methods have been shown to better estimate pathway activity changes than classical knowledge-driven methodologies in different benchmarks. Babur et al. (2021) added a new, interesting methodology to this later toolset. CausalPath uses kinase/phosphatase—substrate and transcription factor—regulated gene relationships from the Pathway Commons database to create graphical patterns. These graphical patterns are causal associations like “KinaseA is active when phosphorylated on site P1. Active KinaseA phosphorylates ProteinB on site P2.” These kinds of graphical patterns are matched with measurements like “KinaseA is phosphorylated on site P1, and ProteinB is phosphorylated on site P2”, leading to causal conjectures like “KinaseA phosphorylates ProteinB in the given dataset”, identifying the potential causal way of signaling. CausalPath also tests the statistical significance of the derived results using a data label permutation-based approach. In their paper, the authors test their methodology in different cancer related datasets, and they successfully identify mechanisms of action of different ligands and drugs from proteomics data. The results of Babur et al. (2021) also highlight the importance of using the correct type of prior knowledge with the corresponding omics modality. When they used gene regulatory networks with proteomics data, the inferred causal networks were not statistically significant, while using the same prior-knowledge network with gene-expression data resulted in significant causal associations. These results also highlight a general problem of systems biology modeling: given the higher abundance of transcriptomics datasets (compared to phosphoproteomics, for example), gene expression data are more frequently used in modeling studies. However, the used prior-knowledge networks are defined on the level of protein activities (pathways) in most cases. As the association between gene expression and protein abundance/activity can be modest, using gene-expression data with pathway networks can lead to incorrect interpretation of the results. These considerations, and also the results of Babur et al. (2021), suggest the crucial importance of using matching prior-knowledge networks and data, like gene regulatory networks with transcriptomics and signaling networks with (phospho)proteomics. Correct integration of different types of prior-knowledge networks and data types also promises to identify causal associations in multi-omics datasets. While currently the most important aspect of causal reasoning tools is biological hypothesis generation, assessing the predictive performance or causal reasoning tools is also crucial for benchmarking the different methods to select the best-performing ones. In their paper, Babur et al. (2021) compared their method to several existing ones, which is a good first step toward this direction. However, as more and more related tools are developed, it is crucial to perform unbiased, independent benchmarking. A bottleneck for this benchmarking is high-quality data where causal associations are already known. For this purpose, perturbation data (where the general cause of changes is given by the used perturbation, i.e., drug, genetic manipulation etc.) looks most suitable, but off-target effects of perturbations (drugs, small interfering RNA [siRNA]) can complicate method evaluation. Nevertheless, large-scale benchmarking projects, like Dialogue on Reverse Engineering Assessment and Methods (DREAM) Challenges can foster the development and assessment of causal reasoning systems biology tools in the future.

8 in total

1. Why do pathway methods work better than they should?

Authors: Bence Szalai; Julio Saez-Rodriguez
Journal: FEBS Lett Date: 2020-12-14 Impact factor: 4.124

2. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

3. Discovering causal pathways linking genomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE).

Authors: Evan O Paull; Daniel E Carlin; Mario Niepel; Peter K Sorger; David Haussler; Joshua M Stuart
Journal: Bioinformatics Date: 2013-08-27 Impact factor: 6.937

Review 4. The Library of Integrated Network-Based Cellular Signatures NIH Program: System-Level Cataloging of Human Cells Response to Perturbations.

Authors: Alexandra B Keenan; Sherry L Jenkins; Kathleen M Jagodnik; Simon Koplev; Edward He; Denis Torre; Zichen Wang; Anders B Dohlman; Moshe C Silverstein; Alexander Lachmann; Maxim V Kuleshov; Avi Ma'ayan; Vasileios Stathias; Raymond Terryn; Daniel Cooper; Michele Forlin; Amar Koleti; Dusica Vidovic; Caty Chung; Stephan C Schürer; Jouzas Vasiliauskas; Marcin Pilarczyk; Behrouz Shamsaei; Mehdi Fazel; Yan Ren; Wen Niu; Nicholas A Clark; Shana White; Naim Mahi; Lixia Zhang; Michal Kouril; John F Reichard; Siva Sivaganesan; Mario Medvedovic; Jaroslaw Meller; Rick J Koch; Marc R Birtwistle; Ravi Iyengar; Eric A Sobie; Evren U Azeloglu; Julia Kaye; Jeannette Osterloh; Kelly Haston; Jaslin Kalra; Steve Finkbiener; Jonathan Li; Pamela Milani; Miriam Adam; Renan Escalante-Chong; Karen Sachs; Alex Lenail; Divya Ramamoorthy; Ernest Fraenkel; Gavin Daigle; Uzma Hussain; Alyssa Coye; Jeffrey Rothstein; Dhruv Sareen; Loren Ornelas; Maria Banuelos; Berhan Mandefro; Ritchie Ho; Clive N Svendsen; Ryan G Lim; Jennifer Stocksdale; Malcolm S Casale; Terri G Thompson; Jie Wu; Leslie M Thompson; Victoria Dardov; Vidya Venkatraman; Andrea Matlock; Jennifer E Van Eyk; Jacob D Jaffe; Malvina Papanastasiou; Aravind Subramanian; Todd R Golub; Sean D Erickson; Mohammad Fallahi-Sichani; Marc Hafner; Nathanael S Gray; Jia-Ren Lin; Caitlin E Mills; Jeremy L Muhlich; Mario Niepel; Caroline E Shamu; Elizabeth H Williams; David Wrobel; Peter K Sorger; Laura M Heiser; Joe W Gray; James E Korkola; Gordon B Mills; Mark LaBarge; Heidi S Feiler; Mark A Dane; Elmar Bucher; Michel Nederlof; Damir Sudar; Sean Gross; David F Kilburn; Rebecca Smith; Kaylyn Devlin; Ron Margolis; Leslie Derr; Albert Lee; Ajay Pillai
Journal: Cell Syst Date: 2017-11-29 Impact factor: 10.304

5. CausalR: extracting mechanistic sense from genome scale data.

Authors: Glyn Bradley; Steven J Barrett
Journal: Bioinformatics Date: 2017-11-15 Impact factor: 6.937

6. From expression footprints to causal pathways: contextualizing large signaling networks with CARNIVAL.

Authors: Anika Liu; Panuwat Trairatphisan; Enio Gjerga; Athanasios Didangelos; Jonathan Barratt; Julio Saez-Rodriguez
Journal: NPJ Syst Biol Appl Date: 2019-11-11

7. Systematic auditing is essential to debiasing machine learning in biology.

Authors: Fatma-Elzahraa Eid; Haitham A Elmarakeby; Yujia Alina Chan; Nadine Fornelos; Mahmoud ElHefnawi; Eliezer M Van Allen; Lenwood S Heath; Kasper Lage
Journal: Commun Biol Date: 2021-02-10

8. Causal integration of multi-omics data with prior knowledge to generate mechanistic hypotheses.

Authors: Aurelien Dugourd; Christoph Kuppe; Marco Sciacovelli; Enio Gjerga; Attila Gabor; Kristina B Emdal; Vitor Vieira; Dorte B Bekker-Jensen; Jennifer Kranz; Eric M J Bindels; Ana S H Costa; Abel Sousa; Pedro Beltrao; Miguel Rocha; Jesper V Olsen; Christian Frezza; Rafael Kramann; Julio Saez-Rodriguez
Journal: Mol Syst Biol Date: 2021-01 Impact factor: 11.429

8 in total

1 in total

1. Analyzing causal relationships in proteomic profiles using CausalPath.

Authors: Augustin Luna; Metin Can Siper; Anil Korkut; Funda Durupinar; Ugur Dogrusoz; Joseph E Aslan; Chris Sander; Emek Demir; Ozgun Babur
Journal: STAR Protoc Date: 2021-11-23

1 in total