Laura M Zingaretti1,2, Gilles Renand3, Diego P Morgavi4, Yuliaxis Ramayo-Caldas3,5. 1. Plant and Animal Genomics, Statistical and Population Genomics Group, CSIC-IRTA-UAB-UB Consortium, Centre for Research in Agricultural Genomics (CRAG), 08193 Bellaterra, Spain. 2. IAPCBA and IAPCH, UNVM, Villa María, Córdoba 5900, Argentina. 3. URM Animal Genetics and Integrative Biology, GABI, INRA, AgroParisTech, Université Paris-Saclay, 78352 Jouy-en-Josas, France. 4. Animal Physiology and Livestock Systems Divisions, INRA, Herbivore Research Unit, Clermont Auvergne University, Saint Genès-Champanelle 63122, France. 5. Animal Breeding and Genetics Program, IRTA, 08140 Caldes de Montbui, Spain.
Abstract
MOTIVATION: We present Link-HD, an approach to integrate multiple datasets. Link-HD is a generalization of 'Structuration des Tableaux A Trois Indices de la Statistique-Analyse Conjointe de Tableaux', a family of methods designed to integrate information from heterogeneous data. Here, we extend the classical approach to deal with broader datasets (e.g. compositional data), methods for variable selection and taxon-set enrichment analysis. RESULTS: The methodology is demonstrated by integrating rumen microbial communities from cows for which methane yield (CH4y) was individually measured. Our approach reproduces the significant link between rumen microbiota structure and CH4 emission. When analyzing the TARA's ocean data, Link-HD replicates published results, highlighting the relevance of temperature with members of phyla Proteobacteria on the structure and functionality of this ecosystem. AVAILABILITY AND IMPLEMENTATION: The source code, examples and a complete manual are freely available in GitHub https://github.com/lauzingaretti/LinkHD and in Bioconductor https://bioconductor.org/packages/release/bioc/html/LinkHD.html.
MOTIVATION: We present Link-HD, an approach to integrate multiple datasets. Link-HD is a generalization of 'Structuration des Tableaux A Trois Indices de la Statistique-Analyse Conjointe de Tableaux', a family of methods designed to integrate information from heterogeneous data. Here, we extend the classical approach to deal with broader datasets (e.g. compositional data), methods for variable selection and taxon-set enrichment analysis. RESULTS: The methodology is demonstrated by integrating rumen microbial communities from cows for which methane yield (CH4y) was individually measured. Our approach reproduces the significant link between rumen microbiota structure and CH4 emission. When analyzing the TARA's ocean data, Link-HD replicates published results, highlighting the relevance of temperature with members of phyla Proteobacteria on the structure and functionality of this ecosystem. AVAILABILITY AND IMPLEMENTATION: The source code, examples and a complete manual are freely available in GitHub https://github.com/lauzingaretti/LinkHD and in Bioconductor https://bioconductor.org/packages/release/bioc/html/LinkHD.html.
The reduction of ‘omics’ technology costs now enables collection of data from multiple sources. This allows researchers to simultaneously study several datasets and investigate their relationship with complex traits. The integration of these heterogeneous datasets is not trivial and several statistical methods have been developed to address this challenge (Argelaguet ; Mariette and Villa-Vialaneix, 2018; Meng ). In particular, the amalgamation of multiple microbial ecosystems poses unique challenges as these are compositional and sparse data. MixKernel (Mariette and Villa-Vialaneix, 2018) is a well-known tool designed to integrate heterogeneous datasets including microbial communities, but no method to perform a taxonomic enrichment analysis is available. Another popular integrative approach is MOFA (Argelaguet ), however, it is unable to deal with compositional data.Here, we present Link-HD, a tool to integrate and explore multiple microbial communities based on STATIS (Des Plantes, 1976), a family of multivariate methods to integrate multiple datasets. Link-HD generalizes STATIS with Regression Biplot (Ter Braak, 1997), clustering, differential abundance, enrichment taxonomic analysis and visualization tools. Link-HD analyzes distance tables computed from numerical, categorical, or compositional data as a generalization of multidimensional scaling (Abdi ). Furthermore, Link-HD performs variable selection and can link the obtained common sub-space with phenotype information.
2 Materials and methods
Like STATIS, Link-HD aims to compare and analyze the relationships between datasets with a shared set of observations or variables. However, our package was specifically designed to integrate microbial communities and incorporate distances and transformations to deal with compositional data (Aitchison, 1982). The method is implemented in three main phases (Fig. 1).
Fig. 1.
Link-HD Workflow. In the Inter-structure step, raw data are transformed using cumulative sum scaling or centered log ratio, and the correlation coefficient (Rv) is computed. The second step is the compromise (W) and, finally, the intra-structure step involves the Eigen-decomposition of W. Observations can be clustered and methods for selecting variables and association with phenotypes are available
Inter-structure step: The algorithm first assesses the similarity between transformed distance tables using the vector correlation coefficient (Rv) (Escoufier, 1973), which can be interpreted as a general ‘vector covariance’ between matrices, i.e. this step evaluates similarity between the disparate datasets.Compromise step: Next, the ‘compromise’ matrix is calculated, which is a weighted sum of each cross-product matrix. This step involves an optimization problem since the weights are chosen to maximize the correlation between the compromise matrix and each individual component.Intra-structure step: Finally, the compromise matrix is evaluated through a Principal Component Analysis. The coordinates of the common elements are projected into a low rank space, where the relationships between them can be easily interpreted.Link-HD Workflow. In the Inter-structure step, raw data are transformed using cumulative sum scaling or centered log ratio, and the correlation coefficient (Rv) is computed. The second step is the compromise (W) and, finally, the intra-structure step involves the Eigen-decomposition of W. Observations can be clustered and methods for selecting variables and association with phenotypes are availableVariable selection is tackled by two alternative approaches: (i) by projecting all the input variables into the compromise through a general Biplot formulation (Ter Braak, 1997); and (ii) by computing the differential abundance of features between clusters of samples. A novelty of Link-HD is its ability to aggregate the selected variables at several taxonomic levels and to establish whether that level is enriched using a cumulative hypergeometric distribution. This function also allows users to add a custom OTUs list. Finally, the SPIEC-EASI (Kurtz ) tool can be used to visualize variable interactions.
3 Case studies
We illustrate our approach with rumen microbial (Ramayo-Caldas ), TARA’s Ocean expedition (Sunagawa ) and transcriptome NCI-60 cell line datasets (Reinhold ).In the rumen study, we integrated Bacteria, Archaea and Protozoa from 65 Holstein cows. Link-HD was able to reproduce previous results (Danielsson ; Kittelmann ; Ramayo-Caldas ), showing a link between the structure of the rumen microbiota and CH4 emission. We also identify microbial markers associated to CH4. In the TARA’s example, Link-HD replicates the relevant role of temperature and Proteobacteria phyla on the structure of this ecosystem, as described in Mariette and Villa-Vialaneix (2018). Finally, we show the potential of Link-HD to integrate other omics layers by using transcriptome NCI-60 cell lines. Link-HD recapitulates the reported data structure (Meng ) and ontology analysis reveals several cancer-related pathways.In all, our results demonstrate that Link-HD is robust in combining several heterogeneous data types. A detailed description of these case studies and the theory behind Link-HD is available at https://lauzingaretti.github.io/LinkHD/ and in Bioconductor (https://bioconductor.org/packages/release/bioc/html/LinkHD.html).
4 Conclusions
We have developed an R package to integrate multiple microbial communities and other ‘omics’ layers combining a plethora of statistical methods in a fast, simple and flexible way.
Authors: Shinichi Sunagawa; Luis Pedro Coelho; Samuel Chaffron; Jens Roat Kultima; Karine Labadie; Guillem Salazar; Bardya Djahanschiri; Georg Zeller; Daniel R Mende; Adriana Alberti; Francisco M Cornejo-Castillo; Paul I Costea; Corinne Cruaud; Francesco d'Ovidio; Stefan Engelen; Isabel Ferrera; Josep M Gasol; Lionel Guidi; Falk Hildebrand; Florian Kokoszka; Cyrille Lepoivre; Gipsi Lima-Mendez; Julie Poulain; Bonnie T Poulos; Marta Royo-Llonch; Hugo Sarmento; Sara Vieira-Silva; Céline Dimier; Marc Picheral; Sarah Searson; Stefanie Kandels-Lewis; Chris Bowler; Colomban de Vargas; Gabriel Gorsky; Nigel Grimsley; Pascal Hingamp; Daniele Iudicone; Olivier Jaillon; Fabrice Not; Hiroyuki Ogata; Stephane Pesant; Sabrina Speich; Lars Stemmann; Matthew B Sullivan; Jean Weissenbach; Patrick Wincker; Eric Karsenti; Jeroen Raes; Silvia G Acinas; Peer Bork Journal: Science Date: 2015-05-22 Impact factor: 47.728
Authors: William C Reinhold; Margot Sunshine; Hongfang Liu; Sudhir Varma; Kurt W Kohn; Joel Morris; James Doroshow; Yves Pommier Journal: Cancer Res Date: 2012-07-15 Impact factor: 12.701
Authors: Zachary D Kurtz; Christian L Müller; Emily R Miraldi; Dan R Littman; Martin J Blaser; Richard A Bonneau Journal: PLoS Comput Biol Date: 2015-05-07 Impact factor: 4.475
Authors: Rebecca Danielsson; Johan Dicksved; Li Sun; Horacio Gonda; Bettina Müller; Anna Schnürer; Jan Bertilsson Journal: Front Microbiol Date: 2017-02-17 Impact factor: 5.640
Authors: Sandra Kittelmann; Cesar S Pinares-Patiño; Henning Seedorf; Michelle R Kirk; Siva Ganesh; John C McEwan; Peter H Janssen Journal: PLoS One Date: 2014-07-31 Impact factor: 3.240
Authors: M Saladrigas-García; M D'Angelo; H L Ko; S Traserra; P Nolis; Y Ramayo-Caldas; J M Folch; P Vergara; P Llonch; J F Pérez; S M Martín-Orúe Journal: Sci Rep Date: 2021-03-17 Impact factor: 4.379
Authors: M Saladrigas-García; M D'Angelo; H L Ko; P Nolis; Y Ramayo-Caldas; J M Folch; P Llonch; D Solà-Oriol; J F Pérez; S M Martín-Orúe Journal: Sci Rep Date: 2021-12-06 Impact factor: 4.379