| Literature DB >> 35391796 |
Nasim Vahabi1, George Michailidis1.
Abstract
Through the developments of Omics technologies and dissemination of large-scale datasets, such as those from The Cancer Genome Atlas, Alzheimer's Disease Neuroimaging Initiative, and Genotype-Tissue Expression, it is becoming increasingly possible to study complex biological processes and disease mechanisms more holistically. However, to obtain a comprehensive view of these complex systems, it is crucial to integrate data across various Omics modalities, and also leverage external knowledge available in biological databases. This review aims to provide an overview of multi-Omics data integration methods with different statistical approaches, focusing on unsupervised learning tasks, including disease onset prediction, biomarker discovery, disease subtyping, module discovery, and network/pathway analysis. We also briefly review feature selection methods, multi-Omics data sets, and resources/tools that constitute critical components for carrying out the integration.Entities:
Keywords: clustering method; data-ensemble; model-ensemble; multi-omics; network analysis; sequential analysis; unsupervised integration
Year: 2022 PMID: 35391796 PMCID: PMC8981526 DOI: 10.3389/fgene.2022.854752
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Different layers of multi-Omics data (genome, transcriptome, proteome, metabolome), the interactions between them (black dashed-arrow), types of the Omics features in each layer, and different approaches to analyze Omics data in different layers (SNP: Single nucleotide polymorphism, SNV: Single nucleotide variation, CNV: Copy number variation, CAN: Copy number alternation, CGIs: CpG islands, Indels: Insertion and deletion, GWAS: Genome-wide association study, MWAS: Methylation-wide association study, RNA: Ribonucleic acid, mRNA: Messenger RNA, rRNA: Ribosomal RNA, tRNA: Transfer RNA, tmRNA: Transfer-messenger RNA, miRNA: Micro RNA, lncRNA: Long-noncoding RNA, snRNA: Small nuclear RNA, siRNA: Small interfering RNA, GSE: Gene-set enrichment).
FIGURE 2Unsupervised multi-omics data integration pipeline (input data, integration methods, and output). Data-ensemble methods concatenate the multi-Omics data from different molecular layers to a single matrix as the input data. Model-ensemble methods analyze each Omics data independently and then ensemble/fuse the results to construct an integrative analysis. A “module” is a combination of different Omics markers with similar functions or associations regarding the underlying outcome. A “class” is a group of Omics markers that have the same effect on the outcome. A “sub-sample” is a group of biological samples (e.g., human, animal, plant) with the same behavior regarding the underlying outcome. and indicate the features and outcome variable, respectively. and show the sample size and the number of the Omics features in the Omics type.
High-level: Unsupervised multi-Omics data integration methods.
| Category | Approach | Key methods |
|---|---|---|
| Regression/Association-based | Sequential Analysis | CNAMet (2011), MEMo (2012), iPAC (2013) |
| Integration Methods | CCA- and CIA-based Methods | Sparse MCCA (2009), BCCA (2013), MCIA (2014), sMCIA (2020) |
| (Refer to | Factor Analysis-based Methods | Joint Bayesian Factor (2014), MOFA (2018), BayRel (2020) |
| Clustering-based | Kernel-based Clustering Methods | L-MKKM (2014), SNF (2014), rMKL-LPP (2015), WSNF (2016), mixKernel (2018), DSSF (2018), ANF (2018), NEMO (2019), ab-SNF (2019), MvNE (2020), INF (2020), SmSPK (2020), PAMOGK (2020) |
| Integration Methods | Matrix Factorization-based Clustering Methods | iCluster (2009), jNMF (2012), iClusterPlus (2013), FA (2013), moCluster (2016), JIVE (2016), iNMF (2016), PFA (2017), IS -means (2017), MOGSA (2019), SCFA (2020) |
| (Refer to | Bayesian Clustering Methods | TMD (2010), PARADIGM (2010), PSDF (2011), MDI (2012), BCC (2013), LRAcluster (2015) |
| Multivariate and Other Clustering Methods | COCA (2014), iPF (2015), Clusternomics (2017), PINS (2017), iDRW (2018), PINSPlus (2019), Subtype-GAN (2021) | |
| Network-based | Matrix Factorization-based Networks | CMF (2008), NBS (2013), DFMF 2014), FUSENET (2015), Medusa (2016), MAE (2019), DisoFun (2020), IMCDriver (2021), RAIMC (2021) |
| Integration Methods | Bayesian Networks | PARADIGM (2010), CONEXIC (2010) |
| (Refer to | Network Propagation-based Networks | GeneticInterPred (2010), RWRM (2012), TieDIE (2013), SNF (2014), HotNet2 (2015), NetICS (2018), RWR-M (2019), RWR-MH (2019), MSNE (2020), RWRF (2021) |
| Correlation-based and Other Networks | WGCNA (2008), GGM (2011), GEM (2013), DBN (2015), Lemon-Tree (2015), TransNet (2018) |
Low-level: Regression/Association-based unsupervised integration methods.
| Approach | Method | Macro category* | Author | Objective | Omics data** | Software*** |
|---|---|---|---|---|---|---|
| Sequential Analysis | • CNAMet | MS-SA |
| Biomarker-prediction | CNV, DM, GE | • |
| • MEMo (Mutual Exclusivity Modules) | MS-SA |
| Module-discovery | CNA, GE | • JAVA code ( | |
| • iPAC (in-trans Process Associated and cis-Correlated) | MS-SA |
| Biomarker-prediction | CNV, GE | • | |
| CCA & CIA | • Sparse MCCA (Sparse Multiple Canonical Correlation Analysis) | DatE |
| Disease insight, Hotspot-detection | GE, CNV | • |
| • BCCA (Bayesian Canonical Correlation Analysis) | DatE |
| Disease insight | Any Omics | • | |
| • MCIA (Multiple Co-Inertia Analysis) | DatE |
| Disease-subtyping, Biomarker-prediction | GE, PE | • | |
| • | ||||||
| • sMCIA (sparse Multiple Co-Inertia Analysis) | DatE |
| Biomarker-prediction | Any Omics | • | |
| Factor Analysis | • Joint Bayesian Factor | DatE |
| Biomarker-prediction | CNV, DM, GE | • Matlab code ( |
| • MOFA (Multi-Omics Factor Analysis) | DatE |
| Biomarker-prediction | Any Omics | • | |
| ( | ||||||
| • BayRel (Bayesian Relational learning) | DatE |
| Biomarker-prediction | Any Omics | • TensorFlow ( |
*Macro categories include (A) Multi-step and Sequential Analysis (MS-SA), (B) Data-ensemble (DatE), (C) Model-ensemble (ModE). ** CNV: copy number variation, DM: DNA methylation, GE: gene expression, PE: Protein expression. ***R packages, unless otherwise stated.
Low-level: Network-based unsupervised integration methods.
| Approach | Model | Macro category* | Author | Omics data** | Objective | Software*** |
|---|---|---|---|---|---|---|
| Matrix Factorization-based (MF-based) Networks | • CMF/CMF-W (Collective Matrix Factorization) | ModE |
| Any Omics | Outcome/Interaction-prediction | • Python code ( |
| • NBS (Network-Based Stratification) | ModE |
| MiE, CNV, DM, GE, PE | Patient-subtyping | • pyNBS Python code ( | |
| • DFMF (Data Fusion by Matrix Factorization) | ModE |
| GE, GO-terms, MeSH-descriptor | Gene function-prediction | • | |
| • FUSENET | ModE |
| GE, Mutation | Disease-insight (Gene-Disease association- prediction) | • Python code ( | |
| • Medusa | ModE |
| Any Omics | Module-discovery, Gene-Disease association- prediction | • Python code ( | |
| • MAE (Multi-view factorization AutoEncoder) | ModE |
| MiE, DM, GE, PE, PPIs | Disease-prediction | PyTorch code ( | |
| • DisoFun (Differentiate isoform Functions with collaborative matrix factorization) | ModE |
| GE, IE | Disease-function Prediction | MATLAB code ( | |
| • IMCDriver | DatE |
| GE, Mutation, PPIs | Gene-discovery | Python code ( | |
| • RAIMC (RBP-AS Target Prediction Based on Inductive Matrix Completion) | ModE |
| AS, RBPs | Protein-prediction | MATLAB code ( | |
| Bayesian Networks ( | • PARADIGM (PAthway Recognition Algorithm using Data Integration on Genomic Models) | ModE |
| CNV, GE, PE | Disease-subtyping, Disease-insight | • |
| • CONEXIC | ModE |
| GE, CNV | Gene-discovery | • - | |
| Network Propagation-based Networks (Random walk-, and Network Fusion-based Methods) | • GeneticInterPred | ModE |
| GE, PE | Interaction-prediction | • - |
| • RWRM (Random Walk with Restart on Multigraphs) | ModE |
| GE, PPIs | Gene-prioritizing | • - | |
| • TieDIE (Tied Diffusion through Interacting Events) | ModE |
| GE, TF, PPIs | Module/sub-network detection | • Python code ( | |
| • SNF (Similarity Network Fusion) | ModE |
| MiE, DM, GE | Patient-subtyping | • | |
| • HotNet2 | ModE |
| SNV, CNA, GE, PPIs | Sub-network detection | • HotNet software ( | |
| • NetICS | ModE |
| MiE, CNV, GE | Biomarker-prediction | • Matlab code ( | |
| • RWR-M (Random Walk with Restart for Multiplex networks) | ModE |
| GE, Co-expression, PPIs | Gene-prediction | • R code ( | |
| • RWR-MH (RWR for Multiplex-Heterogeneous networks) | ModE |
| GE, Co-expression, PPIs | Gene-prediction | • | |
| • MSNE (Multiple Similarity Network Embedding) | ModE |
| CNV, DM, GE | Disease-subtyping | • Python code ( | |
| • RWRF (Random Walk with Restart for multi-dimensional data Fusion) | ModE |
| MiE, DM, GE | Disease-subtyping | • R code ( | |
| Correlation-based and Other Networks | • WGCNA (Weighted Gene Co-expression Network Analysis) | DatE |
| GE (from multiple platforms/species) | Gene-prioritizing | • |
| • GGM (Gaussian Graphical Model) | ModE |
| SNP, GE, Met | Metabolite-pathway reactions | • - | |
| • GEM (GEnome scale Metabolic models) | ModE |
| GE, Met | Metabolite-subnetwork | • - | |
| • DBN (Deep Belief Network) | ModE |
| MiE, DM, GE | Disease-subtyping | • Python code ( | |
| • Lemon-Tree | ModE |
| CNV, GE | Biomarker-discovery | • JAVA command ( | |
| • TransNet (Transkingdom Network) | ModE |
| Any Omics | Causal network | • TransNetDemo R code ( |
*Main categories include (A) Multi-step and Sequential Analysis (MS-SA), (B) Data-ensemble (DatE), (C) Model-ensemble (ModE). ** CNV: copy number variation, CAN: copy number alternation, SNV: single nucleotide variation, DM: DNA methylation, AS: alternative splicing, MiE: Micro RNA expression, GE: gene expression, TF: transcriptional factor, IE: isoform expression, PE: protein expression, RBPs: RNA-Binding Proteins, PPI: Protein-protein interactions, Met: Metabolite. ***R packages, unless otherwise stated.
Low-level: Clustering-based unsupervised integration methods.
| Approach | Clustering method | Macro category* | Author | Objective | Omics data** | Software*** |
|---|---|---|---|---|---|---|
| Kernel-based Clustering Methods | • L-MKKM (Localized Multiple Kernel | ModE |
| Sample-subtyping | CNV, DM, GE | • Matlab code ( |
| • SNF (Similarity Network Fusion) | ModE |
| Disease-subtyping | Any Omics | • | |
| • | ||||||
| • | ||||||
| • rMKL-LPP (regularized Multiple Kernels Learning with Locality Preserving Projections) | ModE |
| Disease-subtyping | DM, MiE, GE | • - | |
| • WSNF (Weighted SNF) | ModE |
| Disease-subtyping | MiE, GE | • | |
| • mixKernel | ModE |
| Sample-subtyping | GE, MiE, DM | • | |
| • DSSF (Deep Subspace Similarity Fusion) | ModE |
| Disease-subtyping | DM, MiE, GE | • - | |
| • ANF (Affinity Network Fusion) | ModE |
| Sample-subtyping | DM, MiE, GE | • | |
| • NEMO (NEighborhood based Multi-Omics clustering) | ModE |
| Disease-subtyping | DM, MiE, GE | • | |
| • | ||||||
| • ab-SNF (association-signal-annotation boosted SNF) | ModE |
| Sample-subtyping | DM, GE | • R code ( | |
| • MvNE (Multiview Neighborhood Embedding) | ModE |
| Molecular-classification | DM, MiE, GE | • - | |
| • INF (Integrative Network Fusion) | DatE/ModE |
| Disease-subtyping, Disease-prediction | CNV, MiE, GE, PE | • Python/R code ( | |
| • SmSPK (Smoothed Shortest Path graph Kernel) | ModE |
| Sample-subtyping | GE, PE, Mutation | • Python code ( | |
| • PAMOGK (PAthway-based MultiOmic Graph Kernel clustering) | ModE |
| Sample-subtyping | GE, PE, Mutation | • Python code ( | |
| (Non-negative) Matrix Factorization-based Clustering Methods | • iCluster | ModE |
| Disease-subtyping, Biomarker-identification | CNV, GE | • |
| • | ||||||
| • | ||||||
| • | ||||||
| • | ||||||
| • jNMF (Joint Non-negative Matrix Factorization) | ModE |
| Disease-insight, Module-discovery | MiE, DM, GE | • - | |
| • iClusterPlus | ModE |
| Disease-subtyping | CNV, DM, GE | • | |
| Biomarker-identification | ||||||
| • FA (Factor Analysis) | DatE |
| Disease-subtyping | MiE, GE, PE | • | |
| • moCluster | ModE |
| Disease-subtyping, Molecular-subtyping | MiE, DM, PE | • | |
| • | ||||||
| • JIVE (Joint and Individual Variation Explained) | ModE |
| Disease-subtyping | MiE, DM, GE | • | |
| • iNMF (integrative Non-negative Matrix Factorization) | ModE |
| Disease-subtyping | MiE, DM, GE | • | |
| • Python code ( | ||||||
| • PFA (Pattern Fusion Analysis) | ModE |
| Disease-subtyping | MiE, DM, GE | • - | |
| • IS | DatE |
| Disease-subtyping | CNV, DM, GE | • IS-Kmeans ( | |
| • MOGSA (Multi-Omics Gene-Set Analysis) | DatE |
| Disease-insight | • GE, CNV, PE | • | |
| • SCFA (Subtyping via Consensus Factor Analysis) | ModE |
| Disease-subtyping | • DM, MiE, GE | • R code ( | |
| Bayesian Clustering Methods | • TMD (Transcriptional Modules Discovery) | ModE |
| Disease-subtyping | GE, TF | • - |
| • PARADIGM (PAthway Recognition Algorithm using Data Integration on Genomic Models) | ModE |
| Disease-subtyping and Disease-insight | CNV, GE, PE | • | |
| • PSDF (Patient-Specific Data Fusion) | ModE |
| Disease-subtyping | CNV, GE | • Matlab code ( | |
| • MDI (Multiple Dataset Integration) | ModE |
| Disease-subtyping | GE, PE | • Matlab code ( | |
| • BCC (Bayesian Consensus Clustering) | ModE |
| Disease-subtyping | MiE, DM, GE, PE | • | |
| • LRAcluster (Low-Rank-Approximation) | ModE |
| Disease-subtyping | CNV, DM, GE | • | |
| • | ||||||
| Multivariate and Other Clustering Methods | • COCA (Cluster-Of-Cluster Assignment) | ModE |
| Disease-subtyping | MiE, CNV, DM, GE, PE | • |
| • | ||||||
| • iPF (integrative Phenotyping Framework) | DatE |
| Sample-subtyping | MiE, GE | • | |
| • Clusternomics | ModE |
| Disease-subtyping | MiE, DM, GE, PE | • | |
| • PINS (Perturbation clustering for data INtegration and disease Subtyping) | ModE |
| Disease-subtyping | MiE, CNV, DM, GE | • - | |
| • iDRW (integrative Directed Random Walk) | DatE |
| Disease-subtyping, Biomarker-discovery | DM, GE | • R code ( | |
| • PINSPlus | ModE |
| Disease-subtyping | MiE, CNV, DM, GE | • | |
| • | ||||||
| • Subtype-GAN | ModE |
| Disease-subtyping | MiE, CNV, DM, GE | • R code ( |
*Macro categories include (A) Multi-step and Sequential Analysis (MS-SA), (B) Data-ensemble (DatE), (C) Model-ensemble (ModE). ** CNV: copy number variation, DM: DNA methylation, MiE: Micro RNA expression, GE: gene expression, TF: transcriptional factor, PE: Protein expression. ***R packages, unless otherwise stated.