| Literature DB >> 35945544 |
Dongjin Leng1, Linyi Zheng2, Yuqi Wen1, Yunhao Zhang2, Lianlian Wu3, Jing Wang4, Meihong Wang2, Zhongnan Zhang5, Song He6, Xiaochen Bo7.
Abstract
BACKGROUND: A fused method using a combination of multi-omics data enables a comprehensive study of complex biological processes and highlights the interrelationship of relevant biomolecules and their functions. Driven by high-throughput sequencing technologies, several promising deep learning methods have been proposed for fusing multi-omics data generated from a large number of samples.Entities:
Mesh:
Year: 2022 PMID: 35945544 PMCID: PMC9361561 DOI: 10.1186/s13059-022-02739-2
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 17.906
Fig. 1Schematic of the benchmarking workflow. a Three different multi-omics datasets cover simulated, single-cell, and cancer multi-omics datasets. b 16 DL methods were used to fuse the multi-omics data. c The DL methods were evaluated in various scenarios
Evaluation of the DL-based multi-omics data fusion methods and metrics
| Dataset | Evaluation model | Task type | Evaluation metrics | |
|---|---|---|---|---|
| lfNN, efNN, lfCNN, efCNN, moGCN, moGAT | Classification | 1. Accuracy 2. F1 macro 3. F1 weighted | ||
| lfAE, efAE, lfDAE, efDAE, lfVAE, efVAE, lfSVAE, efSVAE, lfmmdVAE, efmmdVAE | Clustering | 1. Jaccard Index 2. C-index 3. Silhouette score 4. Davies Bouldin score | ||
| lfNN, efNN, lfCNN, efCNN, moGCN, moGAT | Classification | 1. Accuracy 2. F1 macro 3. F1 weighted | ||
| lfAE, efAE, lfDAE, efDAE, lfVAE, efVAE, lfSVAE, efSVAE, lfmmdVAE, efmmdVAE | Clustering | 1. Jaccard Index 2. C-index 3. Silhouette score 4. Davies Bouldin Score | ||
| BRCA, GBM, SARC, LUAD, STAD | lfNN, efNN, lfCNN, efCNN, moGCN, moGAT | Classification | 1. Accuracy 2. F1 macro 3. F1 weighted | |
| AML, BRCA, COAD, GBM, KIRC, LIHC, LUSC, SKCM, OV, SARC | lfAE, efAE, lfDAE, efDAE, lfVAE, efVAE, lfSVAE, efSVAE, lfmmdVAE, efmmdVAE | Clustering | 1. C-index 2. Silhouette score 3. Davies Bouldin Score | |
| Association of embeddings with survival | Cox proportional-hazards regression | |||
| Association of embeddings with clinical annotations | Selectivity score | |||
BRCA breast cancer, GBM glioblastoma, SARC sarcoma, LUAD lung adenocarcinoma, STAD stomach cancer, AML acute myeloid leukemia, BRCA breast cancer, COAD colon cancer, GBM glioblastoma, KIRC kidney clear cell carcinoma, LIHC kidney chromophobe, LUSC lung squamous cell carcinoma, SKCM melanoma, OV ovarian cancer, SARC sarcoma
Fig. 2Workflow of the evaluation on simulated multi-omics datasets. a InterSIM CRAN package generated three kinds of omics data that were used as input. b Supervised DL methods are evaluated in the classification tasks. The performance of these methods was based on 4-fold cross-validation and was evaluated by three metrics: accuracy, F1 macro, and F1 weighted score. c Unsupervised DL methods are applied to fuse the simulated multi-omics data to obtain 5-dimensional, 10-dimensional, and 15-dimensional embeddings first. Then k-means algorithm is used to cluster the multi-omics dimensionality reduction results. We employed JI, C-index, silhouette score and Davies Bouldin score as the evaluation indexes of clustering
Performance of six supervised methods in the condition that all clusters have the same size
| Methods | 5 clusters of the same size | 10 clusters of the same size | 15 clusters of the same size | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | F1 macro | F1 weighted | Accuracy | F1 macro | F1 weighted | Accuracy | F1 macro | F1 weighted | |
| lfNN | 0.900 | 0.860 | 0.875 | 0.860 | 0.792 | 0.818 | |||
| efNN | |||||||||
| lfCNN | 0.760 | 0.707 | 0.696 | 0.880 | 0.751 | 0.835 | |||
| efCNN | 0.920 | 0.911 | 0.893 | ||||||
| moGCN | |||||||||
| moGAT | |||||||||
Fig. 3JI, C-index, silhouette score, and Davies Bouldin score of the ten unsupervised methods for clustering on simulated multi-omics datasets. An external index JI (a) and three internal indices C-index, silhouette score and Davies Bouldin score (b, c, d) were calculated based on the clustering on the simulated data. The cluster number is set to 5, 10, and 15. SS and RS represent two conditions, i.e., all clusters have the same size and the clusters have variable random sizes. The k-means clustering was run over 1000 times. The results of a are presented as mean values of JIs
Fig. 4Workflow of the evaluation on single-cell multi-omics datasets. a Two kinds of omics data were used as input. b Supervised DL methods are evaluated in the classification tasks. The performance of these methods was based on 4-fold cross-validation and was evaluated by three metrics: accuracy, F1 macro, and F1 weighted score. c Unsupervised DL methods were first applied to fuse the single-cell multi-omics data to obtain the fused two-dimensional embeddings. Then k-means algorithm was used to cluster the multi-omics dimensionality reduction results into three categories. We employed JI, C-index, silhouette score and Davies Bouldin score as the evaluation indexes of clustering
Performance of six supervised methods on single-cell multi-omics datasets
| Accuracy | F1 macro | F1 weighted | |
|---|---|---|---|
| lfNN | |||
| efNN | |||
| lfCNN | 0.962 | 0.952 | 0.962 |
| efCNN | 0.981 | 0.976 | 0.981 |
| moGCN | |||
| moGAT |
Fig. 5JI, C-index, silhouette score, and Davies Bouldin score of the ten unsupervised methods for clustering on single-cell multi-omics datasets. An external index JI (a) and three internal indices C-index, silhouette score and Davies Bouldin score (b, c, d) were calculated based on the clustering on the single-cell data. The cluster number is three. The k-means clustering was run over 1000 times. The results of a are presented as mean values of JIs
Fig. 6Workflow of the evaluation on cancer multi-omics datasets. a Three kinds of omics data were used as input. b Supervised DL methods are evaluated in the classification tasks. The performance of these methods was based on 4-fold cross-validation and was evaluated by three metrics: accuracy, F1 macro, and F1 weighted score. c Unsupervised DL methods were first applied to fuse the cancer multi-omics data to obtain the fused 10-dimensional embeddings. Then k-means algorithm was used to cluster the multi-omics dimensionality reduction results into several categories. We employed C-index, silhouette score and Davies Bouldin score as the evaluation indexes of clustering. Furthermore, the associations of the embeddings with survival and clinical annotations were evaluated
The sizes and features of cancer benchmark datasets used in classification task
| Cancers | Categories (cancer subtypes) | # of samples | # of features |
|---|---|---|---|
| Exp, Meth, miRNA | |||
| BRCA | Luminal A: 28, Luminal B: 15, Basal-like: 12, HER2-enriched: 4 | 59 | 6000, 5000, 892 |
| GBM | Proneural: 71, Classical: 70, Mesenchymal: 84, Neural: 47 | 272 | 6000, 5000, 534 |
| SARC | DDLPS: 50, LMS: 80, UPS: 44, MFS: 17, MPNST: 5, SS: 10 | 206 | 6000, 5000, 1046 |
| LUAD | TRU: 51, PI: 52, PP: 41 | 144 | 6000, 5000, 554 |
| STAD | EBV: 20, MSI: 38, GS: 43, CIN: 97 | 198 | 6000, 5000, 519 |
# number, Exp gene expression, Meth DNA methylation, miRNA miRNA expression, DDLPS dedifferentiated liposarcoma, LMS leiomyosarcoma, UPS undifferentiated pleomorphic sarcoma, MFS myxofibrosarcoma, MPNST malignant peripheral nerve sheath tumor, SS synovial sarcoma, TRU formerly bronchioid, PI formerly squamoid, PP formerly magnoid, EBV Epstein–Barr virus, MSI microsatellite instability, GS genomically stable, CIN chromosomal instability
Performance of six supervised methods on cancer benchmark datasets used in classification task
| Methods | BRCA | GBM | SARC | LUAD | STAD | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | F1 macro | F1 weighted | Accuracy | F1 macro | F1 weighted | Accuracy | F1 macro | F1 weighted | Accuracy | F1 macro | F1 weighted | Accuracy | F1 macro | F1 weighted | |
| lfNN | 0.475 | 0.161 | 0.306 | 0.309 | 0.118 | 0.146 | 0.767 | 0.544 | 0.739 | 0.843 | 0.853 | 0.844 | |||
| efNN | 0.660 | 0.544 | 0.617 | 0.588 | 0.512 | 0.554 | 0.772 | 0.668 | 0.771 | 0.861 | 0.848 | 0.855 | |||
| lfCNN | 0.525 | 0.248 | 0.382 | 0.548 | 0.435 | 0.481 | 0.587 | 0.328 | 0.525 | 0.688 | 0.683 | 0.683 | 0.657 | 0.543 | 0.606 |
| efCNN | 0.594 | 0.431 | 0.546 | 0.455 | 0.347 | 0.372 | 0.786 | 0.626 | 0.768 | 0.868 | 0.866 | 0.868 | 0.818 | 0.834 | 0.818 |
| moGCN | 0.600 | 0.601 | 0.647 | 0.633 | 0.585 | 0.861 | 0.860 | 0.861 | 0.740 | 0.733 | 0.716 | ||||
| moGAT | 0.545 | 0.731 | 0.719 | 0.735 | 0.861 | 0.861 | 0.863 | 0.820 | 0.779 | 0.819 | |||||
Fig. 7C-index, silhouette score, Davies Bouldin score, and the association of the embeddings with survival and clinical annotations for the ten unsupervised methods on cancer multi-omics datasets. C-index (a), silhouette score (b) and Davies Bouldin score (c) were calculated based on the clustering on the cancer data. The number of clusters is set from two to six. The k-means clustering was run over 1000 times. d The embeddings which had strong association with survival (the Bonferroni-corrected p-values smaller than 0.05). The X-axis represents the number of survival-associated embeddings. The Y-axis represents cancers, and every cancer is assigned a color. e Selectivity score of the ten unsupervised methods for ten different cancer types. The score is displayed if it is higher than the average score (0.49), and the higher the selectivity score, the brighter the orange block
Fig. 8Graphic summary of the cancer sub-benchmark. a The details of testing the association of embeddings with survival. b The details of testing the association of embeddings with clinical annotations
Fig. 9DL-based multi-omics data fusion methods benchmarked by average unified score in this study. a The unified performances of supervised models in three different datasets. b The unified performances of unsupervised models in three different datasets. We used the highest unified score of every scenario as reference (marked 100%) to calculate the percentage