Literature DB >> 35561193

MOSS: multi-omic integration with sparse value decomposition.

Agustin Gonzalez-Reymundez1, Alexander Grueneberg1, Guanqi Lu1, Filipe Couto Alves1, Gonzalo Rincon2, Ana I Vazquez1.   

Abstract

SUMMARY: This article presents multi-omic integration with sparse value decomposition (MOSS), a free and open-source R package for integration and feature selection in multiple large omics datasets. This package is computationally efficient and offers biological insight through capabilities, such as cluster analysis and identification of informative omic features.
AVAILABILITY AND IMPLEMENTATION: https://CRAN.R-project.org/package=MOSS. SUPPLEMENTARY INFORMATION: Supplementary information can be found at https://github.com/agugonrey/GonzalezReymundez2021.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2022        PMID: 35561193      PMCID: PMC9113319          DOI: 10.1093/bioinformatics/btac179

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Omic data are characterized by many features from multiple layers of data (e.g. genome, transcriptome and proteome). Thus, traditional methods (e.g. ordinary least squares) are insufficient to obtain significant insights from this multi-layer, high-dimensional data. To effectively integrate multi-omic data, novel methods have been developed (González-Reymúndez ; Lock ; Rohart ; Shen , 2016; Zhang ). These methods have profoundly contributed to our understanding of variation in complex traits across diverse levels of regulation (e.g. mutations in coding genes and epigenetic regulation) (Hasin ; Ritchie ). Thanks to ongoing data collection efforts, omic data increase in the number of features and available samples. This increase in sample size provides more opportunity for inference and prediction of characteristics of interest (Müller ). However, more extensive data sizes can make computations progressively lengthier and impossible to perform in some cases (Mangul ). Moreover, extensive data sizes also compromise parallelizing complex algorithms (e.g. convolutional neural networks) (Chiroma ). We developed ‘multi-omic integration with sparse value decomposition’ (MOSS) to handle these limitations. MOSS is a free and open-source R package that performs data integration and feature selection on large datasets. It combines the flexibility of sparse value decomposition (SVD) with parallel and in-disk computations to accommodate data sizes reaching biobank dimensions.

2 Implementation

The package’s primary function is called moss. Omic data are given to moss as a list where each element corresponds to a different omic (see help pages for function moss). Each omic enters the function as a numeric array. The rows of each array represent samples (e.g. a subject per row) and the column of each array an omic feature (e.g. expression of a gene). The rows of the different numeric arrays on the list need to be sorted in the same order (i.e. each row belongs to the same sample across omic blocks). Integration of omic blocks occurs by appending them, column-wise, into an extended matrix. Before making the extended matrix, blocks are normalized and standardized. If missing values are present, they are imputed by the mean. The effects of potential confounders can be internally adjusted by giving moss a data frame, vector or matrix with covariates. When omic blocks are too big to be handled in memory, File-backed Big Matrix (FBM) (Privé ) can be passed to moss. For this task, the package bigstatsr (Privé ) must be installed. Suppose the omic blocks fit in memory but are still too large to be handled in a reasonable time. In that case, moss allows turning the omic blocks into FBM objects internally. MOSS performs a sparse singular value decomposition (sSVD) on the integrated omic blocks to obtain latent dimensions as sparse factors (i.e. with zeroed out elements), representing variability across subjects and features. Sparsity is imposed via Elastic Net (Zou ) (EN) on the sSVD solutions. MOSS allows an automatic tuning of the number of elements different from zero, adapting the procedure in Shen and Huang (2008). The primary output of MOSS is a list with the results of standard (dense) and sSVD. However, a flexible set of arguments extends the output to include cluster analysis, non-linear embedding and accompanying visualizations (Supplementary Information). Further statistical and algorithmic details and a description of moss’ arguments, plus examples of usage, are provided in Supplementary Information.

3 Moss identifies informative omic features as competently as existing methods

MOSS matches the performance of current analogous methods (Fig. 1A). To illustrate this point, we compared MOSS against existing methods of omic integration and feature selection. This comparison was done in terms of the methods ability to detect informative features. The methods included iCluster (Shen ), NMF (Gaujoux and Seoighe, 2010), SNFtool (Wang ), mixOmics (Rohart ) and OmicsPLS (el Bouhaddani ). The data consisted of simulations on top of gene and protein expression profiles from breast tumors from The Cancer Genome Atlas (TCGA; Chang ) repository (see Supplementary Information) and supplied within mixOmics. In each simulation, omic features were decorrelated by randomly shuffling tumors, one feature at a time. To define informative features in each simulation, a subgroup of randomly chosen features was left intact. These features conserved the naturally occurring correlation present in the data. The two scenarios compared used 10% and 80% of the total features to define the signal. A total of 1000 random simulations were run by scenario. Figure 1A shows MOSS’s ranking amongst the best performance methods. When using strict variable selection (EN parameter equal to 1), MOSS’s performance is inversely related to the number of informative features. In scenarios with a larger number of informative features, methods like NMF, more suitable for dense solutions, are more sensible. However, MOSS can compensate for the loss in sensitivity by compromising variables selection in favor of shrinkage (e.g. by setting EN parameter to values between 0 and 1).
Fig. 1.

(A) Performance of MOSS and existing omic integration and features selection methods. Each panel represents a different proportion of informative features. Each curve represents the average specificity and sensitivity of features selection across 1000 random simulations for increasing sparsity degrees (e.g. null effect features). Confidence bands represent inter-simulations noise. (B) Comparison of computational time between MOSS and other methods. The plot shows the computational time taken by MOSS and five other omic integration methods. Scenarios corresponded to a different combination of samples (n) and features (p) in simulated data. Column panels represent the number of samples, and row panels represent the number of features. Each bar represents a different omic integration method. The y-axis shows the time in hours. The symbols ‘*’ and ‘†’ represent a method running for more than a day or crashing, respectively. MOSS was used with dense matrices (reg. matrices) or filed-backed big matrices (FBM). (C) Performance of MOSS on real high-dimensional data. The plot shows the performance of MOSS on simulations using data presented in (González-Reymúndez and Vázquez (2020). Different colors represent alternative proportions of features with signals

(A) Performance of MOSS and existing omic integration and features selection methods. Each panel represents a different proportion of informative features. Each curve represents the average specificity and sensitivity of features selection across 1000 random simulations for increasing sparsity degrees (e.g. null effect features). Confidence bands represent inter-simulations noise. (B) Comparison of computational time between MOSS and other methods. The plot shows the computational time taken by MOSS and five other omic integration methods. Scenarios corresponded to a different combination of samples (n) and features (p) in simulated data. Column panels represent the number of samples, and row panels represent the number of features. Each bar represents a different omic integration method. The y-axis shows the time in hours. The symbols ‘*’ and ‘†’ represent a method running for more than a day or crashing, respectively. MOSS was used with dense matrices (reg. matrices) or filed-backed big matrices (FBM). (C) Performance of MOSS on real high-dimensional data. The plot shows the performance of MOSS on simulations using data presented in (González-Reymúndez and Vázquez (2020). Different colors represent alternative proportions of features with signals

4 Moss requires less computational time than existing methods and scales to datasets reaching biobank sizes

One of MOSS's essential capabilities is the handling of big data. While other tools demonstrate similar analytical performance (Fig. 1A), MOSS is specifically designed for big data. As a result, even when regular R matrices are used (i.e. omic data handled in RAM), MOSS can still perform in a short amount of time compared to other omic integration and feature selection methods (Fig. 1B). For huge datasets (e.g. scenario n = 1e5 and p = 1e6 in Fig. 1B), tuning of degree of sparsity with MOSS becomes prohibitive. However, dense solutions are still possible (i.e. without imposing sparsity).

5 Moss can be applied to high-dimensional real datasets

In González-Reymúndez and Vázquez (2020), we showed that MOSS could also retrieve biologically meaningful results from real data. Figure 1C shows the results of applying the above simulation scheme to data used in González-Reymúndez and Vázquez (2020), consisting of ∼60 000 features from whole-genome gene expression profiles, DNA methylation and copy numbers across ∼5000 tumors from 33 different cancer types.

6 Conclusions

Omic integration emerged as a group of techniques to collectively analyze multiple omic data layers and retrieve helpful information of shared biological processes (Hasin ). However, the computational and statistical tools used to carry out these tasks are constantly challenged by the vast amount of data generated (Conesa and Beck, 2019; Gomez-Cabrero ). As a result, omic integration can become a vast and challenging problem. Consequently, existing algorithms can become painfully slow or impossible to run. As a features selection tool, MOSS performance is best as the number of signal features decreases (e.g. some signaling pathways affected in cancer, such as canonical MAPK pathway; Braicu ). However, lower performance for a larger number of signal features is an unsolved challenge among omic integration and feature selection methods (Tini ). In MOSS, this performance could be increased by compromising variable selection in favor of shrinking by varying the value of the EN parameter. For instance, in González-Reymúndez and Vázquez (2020), a EN parameter value of 0.5 was used to show MOSS's ability to detect clusters of tumors beyond original diagnoses and molecular signatures of potential therapeutic use. The training of this additional parameter, however, can drastically increase computational time, particularly for large datasets. More sophisticated alternatives might involve the use different penalties by omic block or set of features, a capability that we are considering for future versions of MOSS. Despite its benefits as a data integration and mining tool, MOSS lacks statistical inference to support feature selection. Future versions of MOSS can deal with these limitations by adopting fast bootstrap techniques applied to high-dimensional SVD (Fisher ). In addition to unsupervised analysis, MOSS can fit supervised analyses via partial least squares, linear discriminant analysis and low-rank regressions. Nevertheless, these options are currently limited by the lack of cross-validation schemes to evaluate supervised models and address their performance. In sum, MOSS is a flexible and fast tool to perform data integration. It shares capabilities with popular methods, including estimation of latent data dimensions, feature selection and convenient graphical displays. Nevertheless, unlike these methods, MOSS integrates datasets too large to be handled in RAM and requires considerably shorter amounts of time.
  18 in total

1.  Multi-omics integration-a comparison of unsupervised clustering methodologies.

Authors:  Giulia Tini; Luca Marchetti; Corrado Priami; Marie-Pier Scott-Boyer
Journal:  Brief Bioinform       Date:  2019-07-19       Impact factor: 11.622

Review 2.  Biobanks for life sciences and personalized medicine: importance of standardization, biosafety, biosecurity, and data management.

Authors:  Heimo Müller; Georges Dagher; Martina Loibner; Cornelia Stumptner; Penelope Kungl; Kurt Zatloukal
Journal:  Curr Opin Biotechnol       Date:  2019-12-30       Impact factor: 9.740

3.  The Cancer Genome Atlas Pan-Cancer analysis project.

Authors:  John N Weinstein; Eric A Collisson; Gordon B Mills; Kenna R Mills Shaw; Brad A Ozenberger; Kyle Ellrott; Ilya Shmulevich; Chris Sander; Joshua M Stuart
Journal:  Nat Genet       Date:  2013-10       Impact factor: 38.330

4.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis.

Authors:  Ronglai Shen; Adam B Olshen; Marc Ladanyi
Journal:  Bioinformatics       Date:  2009-09-16       Impact factor: 6.937

5.  Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

Authors:  Florian Privé; Hugues Aschard; Andrey Ziyatdinov; Michael G B Blum
Journal:  Bioinformatics       Date:  2018-08-15       Impact factor: 6.937

6.  Prediction of years of life after diagnosis of breast cancer using omics and omic-by-treatment interactions.

Authors:  Agustín González-Reymúndez; Gustavo de Los Campos; Lucía Gutiérrez; Sophia Y Lunt; Ana I Vazquez
Journal:  Eur J Hum Genet       Date:  2017-03-08       Impact factor: 4.246

7.  Distinct 5-methylcytosine profiles in poly(A) RNA from mouse embryonic stem cells and brain.

Authors:  Thomas Amort; Dietmar Rieder; Alexandra Wille; Daria Khokhlova-Cubberley; Christian Riml; Lukas Trixl; Xi-Yu Jia; Ronald Micura; Alexandra Lusser
Journal:  Genome Biol       Date:  2017-01-05       Impact factor: 13.583

8.  Multi-omic signatures identify pan-cancer classes of tumors beyond tissue of origin.

Authors:  Agustín González-Reymúndez; Ana I Vázquez
Journal:  Sci Rep       Date:  2020-05-20       Impact factor: 4.379

9.  Author Correction: A compendium of geochemical information from the Saanich Inlet water column.

Authors:  Mónica Torres-Beltrán; Alyse K Hawley; David Capelle; Elena Zaikova; David A Walsh; Andreas Mueller; Melanie Scofield; Chris Payne; Larysa Pakhomova; Sam Kheirandish; Jan Finke; Maya Bhatia; Olena Shevchuk; Esther A Gies; Diane Fairley; Céline Michiels; Curtis A Suttle; Frank Whitney; Sean A Crowe; Philippe D Tortell; Steven J Hallam
Journal:  Sci Data       Date:  2019-01-15       Impact factor: 6.444

10.  Increased Proportion of Variance Explained and Prediction Accuracy of Survival of Breast Cancer Patients with Use of Whole-Genome Multiomic Profiles.

Authors:  Ana I Vazquez; Yogasudha Veturi; Michael Behring; Sadeep Shrestha; Matias Kirst; Marcio F R Resende; Gustavo de Los Campos
Journal:  Genetics       Date:  2016-04-29       Impact factor: 4.562

View more
  1 in total

1.  BRANEnet: embedding multilayer networks for omics data integration.

Authors:  Surabhi Jagtap; Aurélie Pirayre; Frédérique Bidard; Laurent Duval; Fragkiskos D Malliaros
Journal:  BMC Bioinformatics       Date:  2022-10-17       Impact factor: 3.307

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.