| Literature DB >> 35169688 |
Zhaoxiang Cai1, Rebecca C Poulos1, Jia Liu1,2, Qing Zhong1.
Abstract
Multi-omics data analysis is an important aspect of cancer molecular biology studies and has led to ground-breaking discoveries. Many efforts have been made to develop machine learning methods that automatically integrate omics data. Here, we review machine learning tools categorized as either general-purpose or task-specific, covering both supervised and unsupervised learning for integrative analysis of multi-omics data. We benchmark the performance of five machine learning approaches using data from the Cancer Cell Line Encyclopedia, reporting accuracy on cancer type classification and mean absolute error on drug response prediction, and evaluating runtime efficiency. This review provides recommendations to researchers regarding suitable machine learning method selection for their specific applications. It should also promote the development of novel machine learning methodologies for data integration, which will be essential for drug discovery, clinical trial design, and personalized treatments.Entities:
Keywords: machine learning; omics; systems biology
Year: 2022 PMID: 35169688 PMCID: PMC8829812 DOI: 10.1016/j.isci.2022.103798
Source DB: PubMed Journal: iScience ISSN: 2589-0042
Figure 1Growth of publications in omics
Line charts showing the number of articles published in each year from 1995 to 2020 in PubMed, colored by different omics. The y axis is plotted in log scale. Search terms used are “genomics,” “epigenomics,” “transcriptomics,” “proteomics,” and “multi-omics”.
Figure 2Illustration of early, middle, and late integration for merging data matrices generated by different omics
In early integration, features from different data matrices are concatenated. Middle integration uses machine learning models to consolidate data without concatenating features or merging results. In late integration, each omics layer is analyzed independently, and results are combined at the end.
Key portals for accessing publicly available multi-omics datasets
| URL | Omic and other data types | Notes | |
|---|---|---|---|
| TCGA ( | Genomics Epigenomics Transcriptomics | Tumor data Large coverage of tumors | |
| ICGC ( | Genomics Transcriptomics | Tumor data Powerful online analytics tools | |
| CPTAC | Proteomics | Tumor data The largest proteomic data portal | |
| COSMIC Cell Lines ( | Genomics Epigenomics Transcriptomics Drug response CRISPR-Cas9 screen | Cancer cell line data Manually curated Large coverage of cell lines | |
| DepMap ( | Genomics Epigenomics Transcriptomics Proteomics Drug response CRISPR-Cas9 screen | Cancer cell line data Large coverage of omic types Powerful online tools | |
| COSMIC ( | Genomics Epigenomics Transcriptomics | Tumor data Manually curated Focus on genomics Overlap with other portals |
Figure 3Unique contribution of this review
First, we describe a balance of both biological and technical content covering topics from genomics to proteomics and from machine learning to multi-omics integration tools. Second, we propose a new classification that categorizes the reviewed tools into two categories, namely general-purpose and task-specific, and then review these tools for four types of applications in biomedical sciences. Third, we provide an independent benchmarking analysis to compare integration methods for cancer type classification and drug response prediction.
Machine learning tools for multi-omics data integration
| Name | Model | Programming language | API for custom data | Can handle missing values | Publication year | Citations to date | Source code |
|---|---|---|---|---|---|---|---|
| General-purpose | |||||||
| MOFA2/MOFA ( | Matrix factorisation | R | Yes | Yes | 2020/2018 | 77/295 | |
| sCCA ( | CCA | R | No | No | 2020 | 5 | |
| DIABLO ( | CCA/LDA | R | Yes | Yes | 2019 | 140 | |
| web-rMKL ( | Multi-kernel | Web-interface | Yes | No | 2019 | 1 | |
| iClusterBayes/iClusterPlus/iCluster ( | Bayesian model | R | Yes | No | 2018/2013/2009 | 76/NA/206 | |
| moCluster ( | Bayesian model | R | Yes | No | 2016 | 49 | |
| sGCCA ( | CCA | R | Yes | No | 2014 | 134 | |
| JIVE ( | Matrix factorisation | R | Yes | No | 2013 | 331 | |
| DeepCCA ( | Deep learning + CCA | Python | No | No | 2013 | 73 | |
| Task-specific | |||||||
| NEMO ( | Affinity clustering | R | Yes | No | 2019 | 50 | |
| Similarity Network Fusion (SNF) ( | Network-based | R/MATLAB | Yes | No | 2014 | 980 | |
| MOLI ( | Deep learning | Python | No | No | 2019 | 78 | |
| CaDRReS ( | Recommender System | Python 2 | No | No | 2018 | 49 | |
| HNMDRP ( | Network-based | R/MATLAB | No | No | 2018 | 47 | |
| DRLP ( | Network-based | MATLAB | No | No | 2017 | 52 | |
Figure 4Details of the benchmarking analysis
(A) The process of determining the scope of the benchmarking analysis.
(B) An overview of the steps included in the benchmarking analysis.
Figure 5Benchmarking of machine learning-based integration tools using the CCLE multi-omics data
(A) Accuracy of each method for cancer type prediction, showing standard errors of the mean derived from 100 runs of five-fold cross-validation, totalling 500 experiments (∗ signifies p value < 0.05 and ∗∗∗ signifies p value < 0.001 by an unpaired two-tailed Student’s t test).
(B) MAE comparison for drug response prediction across 1,448 drugs, error bars representing standard errors of the mean (∗∗∗ signifies p value < 0.001 and n.s. stands for not significant by an unpaired two-tailed Student’s t test).
(C) Runtime comparison. PCA is omitted as the runtime was negligible compared with the five multi-omics integration methods.
(D) A summary of the benchmarking study, derived from the results of cancer type prediction, drug response prediction (MAE between the measured AUC and predicted AUC), runtime comparison, and the number of citations since publication. The number of citations for PCA was set to the maximum for better visualization and because of its widespread use. The inverse of the runtime and drug response prediction MAE values are plotted so that higher values indicate better performance in all dimensions, and all values are plotted in the range of 0 to 1 in the radar plot.