| Literature DB >> 32437529 |
Theodoulos Rodosthenous1, Vahid Shahrezaei1, Marina Evangelou1.
Abstract
MOTIVATION: Recent developments in technology have enabled researchers to collect multiple OMICS datasets for the same individuals. The conventional approach for understanding the relationships between the collected datasets and the complex trait of interest would be through the analysis of each OMIC dataset separately from the rest, or to test for associations between the OMICS datasets. In this work we show that integrating multiple OMICS datasets together, instead of analysing them separately, improves our understanding of their in-between relationships as well as the predictive accuracy for the tested trait. Several approaches have been proposed for the integration of heterogeneous and high-dimensional (p≫n) data, such as OMICS. The sparse variant of canonical correlation analysis (CCA) approach is a promising one that seeks to penalize the canonical variables for producing sparse latent variables while achieving maximal correlation between the datasets. Over the last years, a number of approaches for implementing sparse CCA (sCCA) have been proposed, where they differ on their objective functions, iterative algorithm for obtaining the sparse latent variables and make different assumptions about the original datasets.Entities:
Year: 2020 PMID: 32437529 PMCID: PMC7750936 DOI: 10.1093/bioinformatics/btaa530
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Data characteristics and simulation scenarios used to evaluate the three sCCA methods for integrating two or three datasets
| Scenarios | Data characteristics |
|---|---|
|
| |
| Null |
|
| 1 |
|
| 2 |
|
| 3 |
|
| 4 |
|
| 5 |
|
| 6 |
|
|
| |
| 1 |
|
| 2 |
|
| 3 |
|
Note: n represents the number of samples, while and p represent the cross-correlated and total number of features, respectively, in dataset i, for i = 1, 2.
Fig. 1.sCCA performance on simulated data for integrating two datasets. (a) ROC curve plots on all five sCCA methods after averaging over all data-generating models and all scenarios. (b) Box-plots of the overall loss of the first canonical vector () averaged over all data-generating models and scenarios, and (c) canonical correlation in the simulation studies for sCCA. (d) ROC curve plots, showing averaged results (over the models) for each scenario on . (Results on can be seen in the Supplementary Material). (e) ROC curve plots, showing averaged (over the scenarios) results for each model on
Null simulation model
| Canonical correlation on Null simulation model | |||||
|---|---|---|---|---|---|
|
|
|
|
|
| |
| Sample size |
|
|
|
|
|
|
| 0.55 (0.08) | 0.81 (0.05) | 0.80 (0.02) | 0.96 (0.05) | 0.98 (0.02) |
|
| 0.22 (0.03) | 0.48 (0.02) | 0.48 (0.04) | 0.51 (0.01) | 0.50 (0.02) |
|
| 0.12 (0.03) | 0.26 (0.05) | 0.24 (0.03) | 0.26 (0.04) | 0.26 (0.04) |
Note: Canonical correlations of PMDCCA, ConvCCA and RelPMDCCA averaged across 100 runs on the null scenarios.
Fig. 2.sCCA performance on Null scenario. ROC curves of the first canonical vector by all three sCCA on Null scenario with sample sizes n = 100, 1000, 10 000
Orthogonality of sCCA methods
| Orthogonality of all simulations with five canonical variates | |||
|---|---|---|---|
| Methods |
|
|
|
|
|
None Orthogonality Partial Orthogonality Partial |
Full Orthogonality Partial Full |
Partial None None |
|
|
None Partial Full |
None None None |
Full Full Partial |
|
|
Full Full Full |
None None None |
Full Full Full |
Note: The table shows whether the algorithms succeed in obtaining orthogonal pairs. None refers to not obtaining orthogonality at all; Full refers to obtaining orthogonality between all pairs; Partial for some, but not all. For each scenario, simulations via the simple simulation model, single-latent variable model and covariance-based model are represented by the first, second and third rows, respectively.
Fig. 3.Multiple sCCA performance on simulated data for integrating three datasets. (a) Box-plots showing the canonical correlation along the ConvCCA, RelPMDCCA and PMDCCA methods in a multiple setting. (b) An example of a scatter plot for the first estimated canonical vector. (c) ROC curves on multiple sCCA simulations
Fig. 4.sCCA performance on nutriMouse data. Box-plots presenting the accuracy of sCCA methods, k-NN and SpReg with LASSO and SCAD with the response being (a) diet and (b) genotype. (c) Scatter plots of the canonical vectors from the first canonical variate pair of a random nutriMouse test set, after applying sCCA and multiple sCCA
Fig. 5.sCCA performance on cancerTypes data. (a) Model performance for the prediction of samples’ survival status. The best overall performed model is shown with bold. (b) Scatter-plots of canonical variates in cancerTypes analysis through multiple sCCA
A summary on the performance of sCCA methods based on both the simulation studies conducted and the analysis of real data
| Summary on the performance of sCCA methods | ||
|---|---|---|
|
|
| Great performance on simulation studies, especially on single-latent model |
| Over-fitted cancerTypes data and performed well on nutriMouse | ||
| Low time complexity | ||
|
| Good performance on simulation studies, especially on simple model | |
| Over-fitted cancerTypes data and performed well on nutriMouse | ||
| Low time complexity | ||
|
| Moderate to good performance on simulation studies | |
| Had the best performance in analysing two real datasets | ||
| High time complexity | ||
|
|
| Good performance on simulation studies |
| Avoided over-fitting and improved performance in both data studies | ||
| Low time complexity | ||
|
| Very good performance on simulation studies | |
| Avoided over-fitting and improved performance in both data studies | ||
| Low time complexity | ||
|
| Moderate to good performance on simulation studies | |
| Overall obtained the best results in both data studies | ||
| High time complexity | ||
Note: It is an intuitive evaluation of the methods, split into having two datasets or multiple.