| Literature DB >> 30669303 |
Cen Wu1, Fei Zhou2, Jie Ren3, Xiaoxi Li4, Yu Jiang5, Shuangge Ma6.
Abstract
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.Entities:
Keywords: Bayesian variable selection; Penalization; integrative analysis; multi-level omics data; parallel and hierarchical integration
Year: 2019 PMID: 30669303 PMCID: PMC6473252 DOI: 10.3390/ht8010004
Source DB: PubMed Journal: High Throughput ISSN: 2571-5135
Figure 1The Horizontal and Vertical Integration Schemes.
Reviews on Integrating Multi-level Omics Data (a partial list).
| Reference | Type | Description |
|---|---|---|
| Richardson et al. [ | Comprehensive | Review statistical methods for both vertical integration and horizontal integration. Introduce different types of genomic data (DNA, Epigenetic marks, RNA and protein), genomics data resources and annotation databases. |
| Bersanelli et al. [ | Comprehensive | Review mathematical and methodological aspects of data integration methods, with the following four categories (1) network-free non-Bayesian, (2) network-free Bayesian, (3) network-based non-Bayesian and (4) network-based Bayesian. |
| Hasin et al. [ | Comprehensive | Different from the studies with emphasis on statistical integration methods, this review focuses on biological perspectives, i.e., the genome first approach, the phenotype first approach and the environment first approach. |
| Huang et al. [ | Comprehensive | This review summarizes published integration studies, especially the matrix factorization methods, Bayesian methods, network based methods and multiple kernel learning methods. |
| Li et al. [ | Comprehensive | Review the integration of multi-view biological data from the machine learning perspective. Reviewed methods include Bayesian models and networks, ensemble learning, multi-modal deep learning and multi-modal matrix/tensor factorization. |
| Pucher et al. [ | Comprehensive (with case study) | Review three methods, sCCA, NMF and MALA and assess the performance on pairwise integration of omics data. Examine the consistence among results identified by different methods. |
| Yu et al. [ | Comprehensive | This study first summarizes data resources (genomics, transcriptome, epigenomics, metagenomics and interactome) and data structure (vector, matrix, tensor and high-order cube). Methods are reviewed mainly following the bottom-up integration and top-down integration. |
| Zeng et al. [ | Comprehensive | The statistical learning methods are overviewed from the following aspects: exploratory analysis, clustering methods, network learning, regression based learning and biological knowledge enrichment learning. |
| Rappoport et al. [ | Clustering (with case study) | Review studies conducting joint clustering of multi-level omics data. Comprehensively assess the performance of nine clustering methods on ten types of cancer from TCGA. |
| Tini et al. [ | Unsupervised integration (with case study) | Evaluation of five unsupervised integration methods on BXD, Platelet, BRCA data sets, as well as simulated data. Investigate the influences of parameter tuning, complexity of integration (noise level) and feature selection on the performance of integrative analysis. |
| Chalise et al. [ | Clustering (with case study) | Investigate the performance of seven clustering methods on single-level data and three clustering methods on multi-level data. |
| Wang et al. [ | Clustering | Discuss the clustering methods in three major groups: direct integrative clustering, clustering of clusters and regulatory integrative clustering. This study is among the first to review integrative clustering with prior biological information such as regulatory structure, pathway and network information. |
| Ickstadt et al. [ | Bayesian | Review integrative Bayesian methods for gene prioritization, subgroup identification via Bayesian clustering analysis, omics feature selection and network learning. |
| Meng et al. [ | Dimension Reduction (with case study) | Review dimension reduction methods for integration and examine visualization and interpretation of simultaneous exploratory analyses of multiple data sets based on dimension reduction. |
| Rendleman et al. [ | Proteogenomics | This study is not another review on the statistical integrative methods. Instead, it discusses integration with an emphasis on the mass spectrometry-based proteomics data. |
| Yan et al. [ | Graph- and kernel-based (with case study) | Graph- and kernel- based integrative methods have been systematically reviewed and compared using GAW 19 data and TCGA Ovarian and Breast cancer data in this study. Kernel-based methods are generally more computationally expensive. They lead to more complicated but better models than those obtained from the graph-based integrative methods. |
| Wu et al. [present review] | Variable Selection based | This review investigates existing multi-omics integrating studies from the variable selection point of view. This new perspective sheds fresh insight on integrative analysis. |
Figure 2A taxonomy of variable selection in supervised, unsupervised and semi supervised analyses.
Published multi-omics Integration studies using penalization methods (a partial list).
| Method | Formulation | Data | Package |
|---|---|---|---|
| Sparse CCA [ | PMD + L1 penalty | comparative genomic hybridization (CGH) data | PMA |
| Sparse mCCA [ | CCA criteria + LAASO/fused LASSO | DLBCL copy number variation data | PMA |
| Sparse sCCA [ | Modified CCA criteria + LASSO/fused LASSO | DLBCL data with gene expression and copy number variation data | PMA |
| Sparse PLS [ | Approximate loss (F norm) + LASSO | Liver toxicity data, arabidopsis data, wine yeast data | mixOmics |
| CollRe [ | Multiple least square loss + L1 penalty/ridge/fused LASSO | Neoadjuvant breast cancer data with gene expression and CNV | N/A |
| PCIA [ | Co-inertia-based loss + LASSO/network penalty | NCI-60 cancer cell lines gene expression and protein abundance data | PCIA |
| iCluster [ | Complete data loglikelihood + L1 penalty | Lung cancer gene expression and copy number data | iCluster |
| iCluster [ | Complete data loglikelihood + L1 penalty/fused LASSO/Elastic Net | Breast cancer DNA methylation and gene expression data | iCluster |
| iCluster+ [ | Complete data loglikelihood + L1 penalty | (1) CCLE data with copy number variation, gene expression and mutation | iClusterPlus |
| JIVE [ | Approximation loss + L1 penalty | TCGA GBM data with gene expression and miRNA | r.JIVE |
| LRM [ | Approximation Loss (F norm) + L1 penalty | TCGA | Github * |
| ARMI [ | Multiple LAD loss + L1 penalty | (1) TCGA SKCM gene expression and CNV | Github * |
| remMap [ | Least square loss + L1 penalty + L2 penalty | Breast cancer with RNA transcript level and DNA copy numbers | remMap |
| Robust network [ | Semiparametric LAD loss + MCP + group MCP + network penalty | TCGA cutaneous melanoma gene expression and CNV | Github * |
| GST-iCluster [ | Complete data loglikelihood + L1 penalty + approximated sparse overlapping group LASSO | (1) TCGA breast cancer mRNA, methylation and CNV | GSTiCluster |
| IS K-means [ | BCSS + L1 penalty | (1) TCGA breast cancer mRNA, CNV and methylation | IS-Kmeans |
Note: * The corresponding authors’ Github webpage.
Summary of case studies from published reviews (a partial list).
| Reference | Methods Compared | Dataset | Major Conclusion |
|---|---|---|---|
| Rappoport et al. [ | K-means; | TCGA Cancer Data: AML, BIC, COAD, GBM, KIRC, LIHC, LUSC, SKCM, OV and SARC | MCCA has the best prediction performance under prognosis. rMKL-LPP outperforms the rest methods in terms of the largest number of significantly enriched clinical labels in clusters. Multi-omics integration is not always superior over single-level analysis. |
| Tini et al. [ | MCCA [ | Murine liver (BXD), Platelet reactivity and breast cancer (BRCA). | For integrating more than two omics data, MFA performs best on simulated data. Integrating more omics data leads to noises and SNF is the most robust method. |
| Pucher et al. [ | sCCA [ | The LUAD, the KIRC and the COAD data sets | For pairwise integration of omics data, sCCA has the best identification performance and is most computationally efficient. The consistency among results identified from different methods is low. |