Literature DB >> 30506971

Multiset sparse redundancy analysis for high-dimensional omics data.

Attila Csala1, Michel H Hof1, Aeilko H Zwinderman1.   

Abstract

Redundancy Analysis (RDA) is a well-known method used to describe the directional relationship between related data sets. Recently, we proposed sparse Redundancy Analysis (sRDA) for high-dimensional genomic data analysis to find explanatory variables that explain the most variance of the response variables. As more and more biomolecular data become available from different biological levels, such as genotypic and phenotypic data from different omics domains, a natural research direction is to apply an integrated analysis approach in order to explore the underlying biological mechanism of certain phenotypes of the given organism. We show that the multiset sparse Redundancy Analysis (multi-sRDA) framework is a prominent candidate for high-dimensional omics data analysis since it accounts for the directional information transfer between omics sets, and, through its sparse solutions, the interpretability of the result is improved. In this paper, we also describe a software implementation for multi-sRDA, based on the Partial Least Squares Path Modeling algorithm. We test our method through simulation and real omics data analysis with data sets of 364,134 methylation markers, 18,424 gene expression markers, and 47 cytokine markers measured on 37 patients with Marfan syndrome.
© 2018 The Authors. Biometrical Journal Published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

Entities:  

Keywords:  high-dimensional data; multivariate statistics; omics data; redundancy analysis

Mesh:

Substances:

Year:  2018        PMID: 30506971      PMCID: PMC6587877          DOI: 10.1002/bimj.201700248

Source DB:  PubMed          Journal:  Biom J        ISSN: 0323-3847            Impact factor:   2.207


INTRODUCTION

In recent years, technological advancement has enabled large amounts of biomolecular data generation from different biological levels, such as from the genome, epigenome, transcriptome, proteome, and metabolome. Omics data from multiple biological levels are often measured on the same patients that share a certain phenotype, and the conceptual information flow between these different biological levels is well defined by the central dogma of molecular biology (Crick, 1970). Integromics is an emergent research area with the interest of simultaneously analyzing multiple molecular and clinical data sources from various omics domains, in order to reveal complex biological pathways that are involved in certain phenotypes (Buescher & Driggers, 2016). In this paper, we describe a statistical tool that models the conceptual information flow between multiple omics data sets and extract genotypic data from those data sets in order to model biological pathways of the phenotype of interest. We refer to this method as the multiset sparse Redundancy Analysis (multi‐sRDA). Redundancy Analysis (RDA) is a well‐known multivariate method that models the information flow between two data sets by maximizing the redundancy index between explanatory and response variables, thus RDA measures the effect of the explanatory data set on the response data set (Johansson, 1981). RDA describes the relationships between data sets by finding a linear combination of the explanatory variables that explain the most variance of the response variables. Thus, if one wishes to perform an integrated analysis to explain the variability of phenotypes with multiple sets of genotypic data, multiset Redundancy Analysis (multi‐RDA) is a prominent candidate. Although RDA's conceptual model was already described by van den Wollenberg (1977), and has been widely used in fields such as ecology (Oksanen et al., 2007) and psychology (Israels, 1984), it has received little attention in high‐dimensional genetic and genomic data analyses (Huang, Chaudhary, & Garmire, 2017). A well‐known characteristic of omics data is its high dimensionality (p ≫ n, where p is the number of variables and n is the number of observations). In this setting, RDA's software implementations are not applicable, since they rely on using standard multivariate least squares regression steps where high‐dimensional covariance matrices are noninvertible (Oksanen et al., 2007). Recently, we showed how penalization methods can be introduced for RDA to overcome the high dimensionality problem. By applying the Elastic Net (ENet) penalization to the multivariate regression equation in a Partial Least Squares (PLS) framework, we showed that sparse Redundancy Analysis (sRDA) is able to deal with high‐dimensional genetic and genomic data and select the important explanatory variables with high precision (Csala, Voorbraak, Zwinderman, & Hof, 2017). In this paper, we show how sRDA can be extended into a multiset multivariate method so that the effect of multiple explanatory variables, from multiple data sets, on a set of outcome variables, measured on one data set, can be assessed simultaneously. Multi‐sRDA is a particularly attractive method for omics data analysis since it is able to account for the directional information flow between data sets from different omics levels. In addition, Multi‐sRDA provides easily interpretable sparse solutions through selecting the important explanatory variables that explain the most variation in their response data set, while excluding the unimportant variables from the final model that explain no to little of the variance of their response set. In this paper, we extend our sRDA approach to multiple sets and we demonstrate multi‐sRDA's capacity for high‐dimensional omics data by conducting simulation studies and analyzing real biomedical data from three different omics domains, measured on the same 37 patients, all diagnosed with the same disease phenotype, namely, Marfan Syndrome. For the simulation studies, we generate multiple high‐dimensional data sets to demonstrate multi‐sRDA's ability to find the important explanatory variables in a high dimensional, multiple data set setting. For the real data analysis, we build a conceptual model that incorporates 364,134 methylation markers, 18,424 gene expression markers, and 47 serum cytokine markers. In this setting, multi‐sRDA models the information transfer from the epigenome through the transcriptome to the proteome. Through this real data analysis, we show how multi‐sRDA can be used to analyze multiple high‐dimensional omics sets in order to find the combination of those foremost biomolecular markers from various biological levels that explain the most variance of the phenotypic variables, while modeling for the conceptual model of the central dogma of molecular biology. This paper is organized as follows. Section 2 describes sRDA and multi‐sRDA, with an algorithmic implementation of multi‐sRDA in the PLS framework. Section 3 describes the simulation studies that are used to assess multi‐sRDA's ability of finding the important explanatory variables, and Section 4 describes an application of multi‐sRDA on real biomedical data. Section 5 discusses the findings of the real data analysis and concludes the paper.

DIRECTIONAL MULTISET MULTIVARIATE ANALYSES FOR HIGH‐DIMENSIONAL DATA

Sparse redundancy analysis

We start with a description of sRDA for two data sets; for explanatory data set and for response data set . Consider data on n individuals distributed over two data sets, and , where is an ‐dimensional matrix containing p explanatory variables (i.e., ), and is a ‐dimensional matrix with q response variables (i.e., ). Let be the ‐dimensional latent variable of (i.e., , where and ) and the ‐dimensional latent variable of (i.e., , where and ). The column vectors and are the associated weights of and , respectively. RDA estimates the amount of variance of a given set of response variables in terms of a given set of explanatory variables. The objective of RDA is to define a linear combination of (denoted as ) that maximizes the sum of the squared correlations between and , that is, The weights and can be estimated in an iterative multiple least‐squared regression framework (Fornell, Barclay, & Rhee, 1988). Estimating involves q univariate least‐squared regressions where is the j‐th column from the matrix . Estimating involves a multivariate least squares regression, for which the standard closed‐form solution can be used if the number of explanatory variables is smaller than the number of observations (i.e., ); Since in omics data the number of explanatory variables is typically much larger than the number of individuals (i.e., ), the closed‐form solution for the multivariate regression step leads to multicollinearity issues. In this setting, the covariance matrix in Equation (2) is noninvertible. To solve this problem, we recently proposed using the ENet penalization, which involves both the LASSO and the Ridge regularization. ENet is applied to the multivariate regression step in the RDA framework (Csala et al., 2017), which turns Equation (2) into where denotes the 1‐norm form of vector , λ1 is the LASSO penalty parameter, and λ2 is the Ridge penalty parameter. Thus sRDA can be applied to high‐dimensional data sets and to maximize the correlation described in Equation (1) by the following steps: sparse Redundancy Analysis (sRDA) Algorithm Given explanatory data set and response data set Preliminary steps Center and scale and Set (0) and (0) to arbitrary vectors with length p and q, respectively Define convergence criterion and a small positive tolerance Iterative alternating regression While and Upon convergence, return More detail about the sRDA can be found in Csala et al. (2017).

Multiset sparse redundancy analysis

Now we describe multi‐sRDA for the case when the true explanatory variables are distributed over multiple data sets. Suppose that the information transfer between the data sets is well‐defined and it can be modeled with explanatory and response data set pairs. The objective function of multi‐sRDA is to maximize the correlation given in Equation (1) for every pairs of explanatory and response data set, that is, where is the latent vector of the k‐th explanatory data set and is the j‐th column of response data set . More precisely, suppose there are three data sets , , and , assuming a directional information flow from to and information flows from both and toward , thus and are explanatory for and is also an explanatory for (see Figure 1). Our objective is to find the combination of variables in data sets and that explain the most variance in while accounting for the fact that also explains variance in (i.e., has an explanatory‐response relationship with ). The resulting 1 and 2 weights for data sets and , respectively, indicate the strength of the contribution of the explanatory variables. Due to the LASSO penalty, the sparse solution provides the set of variables that has the highest contribution among all variables. The 1 weights describe the strength of the correlation between the response variables from data set with the linear combination of their important explanatory variables. Our implementation of multi‐sRDA is based on Wold's Partial Least Squares path modeling algorithm (PLS‐PM) (Esposito Vinzi & Russolillo, 2013; Sanchez, 2013; Wold, 1975).
Figure 1

Multiset sRDA based on PLS path modeling framework, where , , and denote the variables of matrix , and α11, α12, and α13 are elements in vector 1, denoting the individual regression weights of the explanatory variables from matrix on their latent variable 1 Notes: Latent variables 1 and 2 are explanatory for response latent variable 1, and 1 is also an explanatory for 2

Multiset sRDA based on PLS path modeling framework, where , , and denote the variables of matrix , and α11, α12, and α13 are elements in vector 1, denoting the individual regression weights of the explanatory variables from matrix on their latent variable 1 Notes: Latent variables 1 and 2 are explanatory for response latent variable 1, and 1 is also an explanatory for 2 First, we present the algorithm for the case of three data sets (see Figure 1). Afterward, we generalize the algorithm to an arbitrary number of data sets. Multiset sparse Redundancy Analysis algorithm for three data sets Given data sets 1, 2, and Preliminary steps Center and scale 1, 2, and Set , , and (0) to arbitrary vectors with length p 1, p 2, and q, respectively Define convergence criterion , and a small positive tolerance Iterative alternating regression While and Step (a) Estimate initial latent variables ; where ∝ indicates that 1 is normalized to unit variance Step (b) Model the relationship between data sets by calculating the correlation\regression weights between their latent variables (i.e., use regression weights for explanatory‐response data set pairs and correlation weights otherwise, see the general case for the precise description) ; where θ13 are the weights from the regression model , and θ13 and θ23 are the weights from the multiple regression model . Step (c) Reestimate the latent variables Step (d) Estimate the new and weights and calculate the latent variables ; where ; where ; where Step (e) Evaluate the convergence criteria and discard the old and weights , and Upon convergence, return , and (0) The previous example can be generalized to K explanatory and a single response data set. In order to obtain the optimal and weights for latent variables and with arbitrary and (0) weights are estimated (Step (ii)(a) in multi‐sRDA Algorithm). is the matrix of all latent variables (i.e., ), and the link between the response and explanatory latent variables are modeled by the following multiple regression equations: where denotes a response latent variable, denotes a latent variable that is explanatory for , and denotes the residual vector of (Esposito Vinzi & Russolillo, 2013; Sanchez, 2013). The association between the latent variables are expressed by (i.e., is the coefficient denoting the effect of on ) Coefficient is obtained from the multiple multivariate regression of on its latent explanatory variables , and the latent variables are updated with the effect of their explanatories (Step (ii)(b)). Once matrix is obtained, the latent variables are reestimated and stored in matrix (i.e., , Step (ii)(c)). Then latent explanatory variable is recalculated with weights that are obtained from the penalized multivariate regression of on data set . The latent response variable is calculated with the updated (1) weights which are obtained by the univariate regression of on (Step (ii)(d)). This process is repeated until convergence and convergence is reached if the summed squared differences of and are smaller than γ, a predefined small positive tolerance (e.g., , Step (ii)(e)). For an arbitrary number of explanatory data sets and one response data set, the correlation described in Equation (4) can be maximized by the following steps: Multiset sparse Redundancy Analysis general Algorithm Given data sets, from which K number are explanatory data sets , that is, (indexed by k), and a response data set Preliminary steps Center and scale and Set (i.e., ) and (0) to arbitrary vectors with length , and q, respectively Define convergence criterion and a small positive tolerance Iterative alternating regression While Step (a) Estimate initial latent variables ; where k is the index from 1 to K and ∝ indicates that is normalized to unit variance Define matrix and define column j of as Step (b) Model the relationship between data sets by calculating the correlation\regression weights between their latent variables Define as a correlation matrix where is: If then else Define as the submatrix of formed by all columns that are explanatory for end if Step (c) Reestimate the latent variables Step (d) Estimate the new and weights and calculate the latent variables Step (e) Evaluate the convergence criteria and discard the old and weights and Upon convergence, return Note that the only assumption of this model is that there is a linear relationship between the explanatory and response variables that can be modeled through latent variables (Equations 5 and 6). Also, PLS‐PM does not require any assumptions regarding the sample size or the distribution of the explanatory and response variables, therefore it is considered to be an exploratory approach rather than a confirmatory one (Vinzi, Trinchera, & Amato, 2010). In the next section, we describe how to determine the best penalization parameters for the penalized multivariate equation step (i.e., selecting λ1 and λ2 when computing ).

Selecting the best parameter for the penalization variables

As the main goal of the analysis is to maximize the correlation given in Equation (1) for every explanatory and response pairs of data sets, we define the best penalization parameters as the values for λ1 and λ2 that leads to the maximization of the correlation described in Equation (4). To determine the best penalization parameters for the penalized multivariate equation (i.e., selecting λ1 and λ2 when computing ), we define a search space of values for both variables and assessing all possible combinations of these through a standard 10‐fold cross validation (CV) procedure. This procedure is repeated for every explanatory data set and the selected best penalization parameters are the ones that maximize the sum of squared correlations between response variables in data set with their latent explanatories . Finding the best penalization parameters becomes quickly computationally expensive with extending the search space for λ1 and λ2. For example, extending the search space from two possible values for both λ1 and λ2 to two values for λ1 and three values for λ2 introduces 80 extra multi‐sRDA runs in a 4 data set setting, since the total number of runs are defined by: search space ( search space ( 10. This high computational burden can be mitigated by introducing Univariate Soft Thresholding (UST) penalization, which can be seen as a special version of ENet, where . With UST, the covariance matrix from the multivariate equation in Equation (3) is ignored, which implies that with UST we ignore the correlation between the variables within the explanatory data set .

Assessing the statistical significance of the proposed model and quantifying multi‐sRDA's ability of including the relevant variables in the final model

For assessing the statistical significance of the proposed model, we conduct a permutation study as follows. The null distribution of the sum of squared correlations between the latent explanatory variables and the response variables are approximated by removing the correlation between explanatory and response data sets by permuting the rows of the matrices, thus the observations are shuffled keeping the internal correlation structure intact while removing the correlation between the involved matrices. We fit the model on the permuted data sets to obtain the weights. The estimated weights are used to determine the sum of squared correlations, and the null distribution of the sum of squared correlations is obtained by repeating the permutation test many times. In addition, a resampling method is used to approximate the confidence interval of the optimum estimate of sum of absolute correlations. We use bootstrapping by taking a sample of observations from the original data, with replacement. We do this 100 times and fit a model on each bootstrap sample and report the mean and selected quantiles of the resulting distribution. In order to assess multi‐RDA's ability to select the relevant (i.e., truly associated) variables from each data set, we calculate the true‐positive rate (TPR) and true‐negative rate (TNR) during the simulation studies. That is, for TPR we study the proportion of the number of truly associated variables identified by the algorithm (i.e., those that the model assigned with nonzero weights) with the total number of truly associated variables that were simulated during the data generation. For TNR, we assess the proportion of the number of truly nonassociated variables excluded from the final model (i.e., those truly nonassociated variables that the model assigned with nonzero regression weights) to the number of truly nonassociated variables. It is important to note that the TPR and TNR measures are hampered by the restricted computational resources. That is, finding the best penalization parameters is increasingly computationally expensive as the search space for the possible parameters extends and it is possible that the optimal value is not included in the search space at all. For example, the model might not include all the associated variables due to a restricted search space for λ1; during the CV study one might restrict the search space for five possible values for λ1 and these all might result in a model with fewer nonzero weights than the total number of the truly associated variables, that is some of the truly associated variables will not be identified. In order to account for this bias, we provide a measure which is less restrained by the search spaces of the penalization parameters, and is less affected when λ1 is smaller than the number of the real associated variables. We call this the truly associated variable inclusion (TAVI) measure. Thus TAVI gives a better indication about how many times some of the truly associated variables are included in the model. TAVI is then defined as the proportion of truly associated variables identified by the algorithm, to either the total number of truly associated variables or the value of the selected λ1 parameter, whichever is smaller, that is

SIMULATION STUDIES

We assessed multi‐sRDA's ability of selecting the relevant variables from the explanatory data sets and the response data sets. That is, we quantified the methods ability of assigning nonzero regression weights to the variables from the explanatory data sets that explain the most variation in the response data set and indicate which response variables has the highest variance. To do this, we designed four simulation studies. The data were created in a similar fashion for all simulation studies, therefore we first describe the general approach of data generation and then we describe the particular parameters for the different studies.

Data generation

In general, we created explanatory data sets, and one response data set, . For all explanatory data sets , we generated irrelevant variables that were not associated with their latent variables. The irrelevant variables were sampled from the standard normal distribution with mean 0 and a standard deviation of 1. We generated relevant variables that were associated with their latent variables, and the strength of the association was indicated with the regression weights . The strength of the associations between data sets were modeled by θ regression weights, that is, . More precisely: Generate latent variables: distributed For : For : Generate the k‐th data sets (where k is the running index from 1 to ): For : distributed For : distributed Steps (iv)–(vi) are repeated times to obtain data sets and the last data set obtained was treated as the response data set (i.e., were explanatory data sets and was the response data set).

Data simulation

First, we designed three different simulation studies with different sample sizes; small (), medium (), and large () sample size. Otherwise, all three simulation studies shared the same parameters, namely, the number of data sets was set to 4 (), the number of irrelevant variables per data set were = (1,000, 500, 200, 10), the number of relevant variables were = (10, 8, 8, 2), the strength of the association between the latent variables were set to θ = (0.8, 0.7, 0.4), and the regression weights for the relevant variables in the four different data sets were = ([0.5, 0.5, 0.5, 0.3, 0.3, 0.2, 0.2, 0.2, 0.2, 0.2], [0.5, 0.5, 0.5, 0.4, 0.4, 0.3, 0.2, 0.2], [0.5, 0.5, 0.5, 0.3, 0.3, 0.2, 0.2, 0.2], [0.5, 0.1]). In order to assess multi‐RDA's ability to select the associated variables from each data set, we calculated the TPR, the TNR, and the TAVI rate over 1,000 simulations. The TPR measures are reported for all three explanatory data sets (i.e., , , and ). The results are presented in Table 1. TPR values ranged from 0.72 to 0.91 and it showed a tendency to increase by increasing sample size. The measures for TAVI are reported in Table 1 for all three data sets (i.e., , , and ). The values for TAVI ranged from 0.87 to 1.
Table 1

Results of simulation study

n = 100 n = 250 n = 500
TPRX1 0.720.760.75
TPRX2 0.760.880.90
TPRX3 0.790.810.79
TAVIX1 0.890.990.99
TAVIX2 0.870.990.99
TAVIX3 0.920.991.00
TNRX1 0.990.990.99
TNRX2 0.990.990.99
TNRX3 0.990.990.99

Notes:. True positive rates (TPR), the rate of truly associated variable inclusion (TAVI), and true negative rates (TNR) computed from 1000 replicates.

Results of simulation study Notes:. True positive rates (TPR), the rate of truly associated variable inclusion (TAVI), and true negative rates (TNR) computed from 1000 replicates. The TNR measures are reported for the three different explanatory data sets (i.e., , , and ) and the results are reported in Table 1. The TNR values ranged from 0.985 to 0.998 and were not much affected by the varying sample size. We observed and reported earlier that ENet was slightly superior to UST in precision of the relevant explanatory variables selection in such high‐dimensional setting, but ENet imposed huge computational costs compared to UST (Csala et al., 2017). Therefore, given the size of our data, we used UST penalization for all simulation studies. In addition, we designed a simulation study that resembles the size of the omics data used for the real data analysis in Section 4. This study had the following parameters; a sample size of 37 (), the number of data sets was 3 (), the number of irrelevant variables per data set were = (360,000; 18,000; 47), the number of relevant variables were = (100; 80; 8), the strength of the association between the latent variables were set to θ = (0.8; 0.4), and the regression weights for the relevant variables in the three data sets varied between 0.2 and 0.5. The TPR, the TNR, and the TAVI rate were calculated over 100 simulations. TPR resulted in 0.66 for the first explanatory data set and in 0.79 for the second explanatory data set. TAVI resulted in 0.68 for the first explanatory data set and in 0.81 for the second explanatory data set and we observed the value 0.99 for TNR for both explanatory data sets.

HIGH‐DIMENSIONAL OMICS DATA ANALYSIS

We applied multi‐sRDA to high‐dimensional omics data in order to explain variability in proteome variables explained by biomarkers from the epigenome and transcriptome. Our data sets included 364,134 methylation markers measured in blood leukocytes by the Illumina Infinium HumanMethylation450 BeadChip, 18,424 gene expression markers obtained from skin biopsy with Affymetrix Human Exon 1.0ST Arrays, and 47 cytokine markers measured in blood plasma (Radonic et al., 2012). The data included in the analysis were measured on 37 patients with Marfan syndrome who participated in the Dutch Compare Trial (Groenink et al., 2013). The conceptual model we built regarded the methylation data as explanatory for both the gene expression measurements and the cytokine markers, and the gene expression measurements were regarded as explanatory for the cytokine markers. Given the size of the data, we used UST penalization for our model, with which one multi‐sRDA run took 364 s. In order to find the best λ1 parameters for UST, we used 10‐fold CV, which resulted in selecting 150 nonzero explanatory variables from the methylation markers, with the sum of absolute correlations criterion of 6,842.32, and selecting 15 nonzero explanatory variables from the gene expression markers, with the sum of absolute correlation criterion of 8.61 (see Figure 2). We show the sum of absolute correlations instead of the sum of squared correlations for ease of interpretation.
Figure 2

Representation of the 10‐fold CV results for finding the best penalization parameters for the number of nonzero weights (i.e., the λ1 penalty) Notes: On the x‐axis, the number of nonzeros selected are given and on the y‐axis the values for the sum of absolute correlations are represented. On the left in red, these values are associated with the gene expression markers and on the right in black with the methylation markers. A total of 150 methylation markers were selected for the final model that maximize the sum of squared correlations with the cytokine markers (weights are given in Table 4), resulted in the sum of absolute correlations criterion of 6,842.32. The results from the CVs are represented with black dots, with scale on the right side of the figure. Similarly, 15 gene expression markers are selected (with weights given in Table 3) that maximizes the sum of squared correlation with the cytokine markers, with the sum of absolute correlation criterion of 8.61, and the CV results are represented with red triangles, with scale of the left side of the figure

Representation of the 10‐fold CV results for finding the best penalization parameters for the number of nonzero weights (i.e., the λ1 penalty) Notes: On the x‐axis, the number of nonzeros selected are given and on the y‐axis the values for the sum of absolute correlations are represented. On the left in red, these values are associated with the gene expression markers and on the right in black with the methylation markers. A total of 150 methylation markers were selected for the final model that maximize the sum of squared correlations with the cytokine markers (weights are given in Table 4), resulted in the sum of absolute correlations criterion of 6,842.32. The results from the CVs are represented with black dots, with scale on the right side of the figure. Similarly, 15 gene expression markers are selected (with weights given in Table 3) that maximizes the sum of squared correlation with the cytokine markers, with the sum of absolute correlation criterion of 8.61, and the CV results are represented with red triangles, with scale of the left side of the figure
Table 4

The 40 highest weights from the resulting model for the gene expression and cytokine markers

Gene Expression ValuesCytokine Markers
Gene Nameβ WeightCytokine Markerβ Weight
AASDH0.761Eotaxin 430.508
AQP80.699FGF basic 440.462
AZIN10.703G CSF 570.314
BIRC20.675GM CSF 340.585
C11orf600.673GRO a 610.266
C12orf50.717HGF 620.184
C16orf610.704IFN a2 200.249
C20orf150.702IFN g 210.097
C9orf410.678IL 10 560.211
CEACAM40.686Il 12 p40 280.277
CGB50.710IL 12 p70 750.798
DKFZp7790.694IL 13 510.333
DMRTC10.678IL 15 730.367
DPM10.684Il 16 270.312
DUSP190.672IL 17 760.793
FAM98A0.674Il 18 420.495
FASTKD10.692Il 1a 630.273
GTPBP80.679IL 1b 390.907
HIGD1A0.678IL 1ra 250.589
HIST1H3A0.678IL 2 380.345
HSPH10.702Il 2Ra 130.208
IPO70.679Il 3 640.407
KIAA18410.709IL 4 520.146
KPNA20.734IL 5 330.196
KPNA2_A0.714IL 6 190.145
MEF2B0.680IL 7 740.094
MEMO10.707IL 9 770.167
MFSD10.674IP 10 480.286
MINPP10.737LIF 290.509
MRPL420.672M CSF 670.310
NDUFAF40.734MCP 3 260.174
NOC3L0.681MIF 350.319
PTGR10.672MIG 140.657
RPP400.688MIP 1a 550.364
SGMS20.698MIP 1b 180.366
SIGLEC110.693PDGF bb 470.618
SNX160.745SCF 650.559
TAC40.693TNF b 300.244
TCEB10.691TRAIL 660.418
TRIM16L0.691VEGF 450.316
Table 3

Methylation sites that were selected for the final model with their corresponding weights

Methylation Siteα Weights
cg215352220.0078
cg203146200.0077
cg275598700.0078
cg005713390.0079
cg007193230.0081
cg151339630.0084
cg174592150.0079
cg094514270.0077
cg105503080.0077
cg159991040.0077
cg263099290.0082
cg275079190.0084
cg011016470.0078
cg034155450.0078
cg077841660.0078
cg230082380.0079
cg191764530.0082
cg012873420.0077
cg122176000.0082
cg188628880.0077
cg072091410.0077
cg169932200.0078
cg251882980.0079
cg012307840.0077
cg007160250.0077
cg185140650.0082
cg131910490.0078
cg123913280.0078
cg214878940.0079
cg262178130.0080
cg037517340.0081
cg201103470.0079
cg020212100.0078
cg023944210.0080
cg035410570.0078
cg049932760.0078
cg053098770.0081
cg056506320.0079
cg070915510.0079
cg155682250.0078
cg165149950.0078
cg263524010.0081
cg099943230.0080
cg149321330.0078
cg176732050.0077
cg190321440.0083
cg203622870.0077
cg204197240.0078
cg250584820.0081
cg253255880.0077
ch.2.241200.0077
ch.2.154080.0078
ch.2.337650.0077
cg059970210.0078
cg066065480.0078
cg112718300.0077
cg129046800.0078
cg151460040.0078
cg238179810.0077
cg255626640.0077
ch.3.573150.0078
cg000454390.0086
cg006698560.0081
cg029450070.0080
cg054935610.0078
cg118183760.0077
cg167812750.0082
cg168189310.0083
cg213630500.0085
cg229045770.0079
cg270969810.0078
ch.4.168060.0084
ch.4.320750.0077
ch.4.338360.0079
cg060170280.0079
cg165373830.0079
cg172319060.0078
cg241949980.0081
cg255641210.0080
cg268918490.0077
ch.5.913940.0078
cg009490080.0080
cg018026350.0080
cg046735650.0080
cg064420730.0079
cg106846860.0078
cg128821030.0078
cg177746340.0078
cg187158680.0079
cg190041380.0079
cg034138840.0077
cg058961200.0077
cg137912690.0078
cg153947870.0077
cg198616970.0084
cg207524200.0082
cg209563660.0078
cg231246950.0084
cg275529120.0079
ch.7.163700.0077
cg009408120.0081
cg135622760.0078
cg167174800.0079
cg178130740.0078
cg247049080.0078
cg266690440.0078
cg135960810.0084
cg212255480.0083
cg228078230.0084
cg107156370.0078
cg236564430.0082
cg249035270.0082
cg019911800.0081
cg035696160.0081
cg090767700.0082
cg110317370.0077
cg147061070.0082
cg020766420.0082
cg056660550.0078
cg142632440.0079
cg248834130.0077
cg018372750.0081
cg026323140.0081
cg045708270.0077
cg065496070.0078
cg085426400.0079
cg142108170.0080
cg224177890.0077
cg271636590.0077
cg051004320.0080
cg102088210.0077
cg130158720.0078
cg202798950.0079
cg231148810.0080
cg244781450.0081
cg205600910.0080
cg276254810.0079
cg012890200.0081
cg022159450.0081
cg051602280.0079
cg076918740.0079
cg198039760.0080
cg052974610.0080
cg076889330.0079
cg182445440.0078
cg187080750.0078
cg039334900.0078
cg138888860.0081
cg181669470.0077
ch.22.90960.0077
In order to assess the statistical significance of our findings, we conducted a permutation study with the real data. We performed the permutation study 100 times and observed that the sum of absolute correlations obtained with the best penalization values were significantly different from the null distribution of the sum of absolute correlation for the methylation markers (with p‐value of ), but not for the gene expression markers (with p‐value of 0.085) (see Figure 3). We used bootstrapping to estimate the confidence interval of the sum of absolute correlation. We obtained a 95% CI [3872.59, 9147.52] for the sum of absolute correlation for the methylation markers and a 95% CI [8.65, 16.34] for the sum of absolute correlation for the gene expression markers.
Figure 3

The null distribution of the sum of absolute correlations of the methylation markers (in gray) plotted with the sum of absolute correlation (6842.32) observed with the best optimal penalization parameters (p‐value ) Notes: The red dashed lines represent the 95% confidence interval of the distribution obtained through bootstrapping and the red dots represent the sum of absolute correlation values obtained for the 100 bootstrap samples

The null distribution of the sum of absolute correlations of the methylation markers (in gray) plotted with the sum of absolute correlation (6842.32) observed with the best optimal penalization parameters (p‐value ) Notes: The red dashed lines represent the 95% confidence interval of the distribution obtained through bootstrapping and the red dots represent the sum of absolute correlation values obtained for the 100 bootstrap samples The names and weights of the 150 methylation markers and the 15 genes that were included in our final model can be found in Table 3 and 2 in Section 5. Our method also provided the genes and cytokines with the highest correlation coefficients in the response data set, thus those gene expression values whose variability was most affected by the combination of the 150 selected methylation markers and those cytokine markers whose variability was most affected by the combination of the 15 gene expression variables and the 150 selected methylation markers. These variables can be found in Table 4 with their corresponding weights, where the weights are the individual correlation coefficients with the combination of their explanatory variables.
Table 2

Gene expression variables that were selected for the final model with their corresponding weights

Gene Name α Weight
AASDH0.0746
AZIN10.0720
C12orf50.0742
C16orf610.0740
C20orf150.0781
HSPH10.0734
KIAA18410.0736
KPNA20.0782
KPNA2_A0.0745
MINPP10.0743
NDUFAF40.0772
PTGR10.0763
SGMS20.0727
SNX160.0796
TAC40.0751
Gene expression variables that were selected for the final model with their corresponding weights Methylation sites that were selected for the final model with their corresponding weights The 40 highest weights from the resulting model for the gene expression and cytokine markers The final model is represented in Figure 4. The methylation markers are positioned in the upper right corner in the figure, with their corresponding latent variable 1, and only 15 markers with the highest α weights are represented out of the 150. The gene expression markers are plotted in the bottom with their latent variable 2 and the top 40 markers that have the highest weights are plotted. The cytokine markers are represented upper left on the plot with their latent variable , and the top 15 cytokine markers that have the highest weights are represented. The observed correlation was 0.807 between 1 and 2, 0.763 between 1 and , and 0.611 2 and .
Figure 4

Plot of the model obtained by applying multi‐sRDA to the Marfan data set Notes: Upper right are the weights of the methylation markers, from which only 15 represented on the plot with their corresponding latent variable 1. Bottom are the weights of the gene expression markers, from which the highest 40 correlated are plotted and 2 represents the gene expression markers' latent variable. Upper left are the weights of the cytokine markers, from which the highest 15 correlated are plotted and η represents the latent variable of the cytokine markers. The correlation between 1 and 2 is 0.807, between 1 and is 0.763, and between 2 and is 0.611

Plot of the model obtained by applying multi‐sRDA to the Marfan data set Notes: Upper right are the weights of the methylation markers, from which only 15 represented on the plot with their corresponding latent variable 1. Bottom are the weights of the gene expression markers, from which the highest 40 correlated are plotted and 2 represents the gene expression markers' latent variable. Upper left are the weights of the cytokine markers, from which the highest 15 correlated are plotted and η represents the latent variable of the cytokine markers. The correlation between 1 and 2 is 0.807, between 1 and is 0.763, and between 2 and is 0.611 We used an online overrepresentation analysis tool (available at https://reactome.org) to test whether the variables included in the final model can be associated with any known biological pathways. Several pathways were identified, including “Cytokine Signaling in Immune system” pathway, with genes involved TCEB1, PPP2CB, HIST1H3A, KPNA2, BIRC2 (with p‐value ), “Cellular responses to stress” pathway, with genes involved HIGD1A, TCEB1, AQP8, HSPH1, HIST1H3A (with p‐value ), and the “Regulation of HSF1‐mediated heat shock response” pathway with the gene involved HSPH1 (with p‐value ).

DISCUSSION

In the present paper, we have shown how sRDA can be extended into a multiset multivariate method (multi‐sRDA) so that the effect of multiple explanatory variables, from multiple data sets, on multiple outcome variables, measured on one data set, can be assessed simultaneously, while modeling the sequential information transfer between the involved data sets. We showed through simulation studies and through genomewide high‐dimensional omics data analysis that our proposed multi‐sRDA model is able to deal with data sets containing hundreds of thousands of variables and is able to indicate those explanatory genotypic variables that explain the most variation in the response phenotypic variables, with high precision. To quantify the precision of multi‐sRDA, we ran simulation studies and used two true‐positive measures and a negative rate measure. Our simulation studies indicate that multi‐sRDA is able to find important explanatory variables with high precision (TPR and TAVI range from 0.66 to 1, depending on sample size and the number of variables) and that the unimportant variables are almost always excluded from the final model (TNR ranges from 0.985 to 0.998). As we described, TAVI is less restrained by the size of the search spaces of the penalization parameters, and is less affected in the case when only a subset of the real associated variables are included in the model, thus when the number of nonzero weights is smaller than the number of the real associated variables, which we observed several times during the simulation studies. Measuring TAVI showed that in fact multi‐sRDA found a subset of the important explanatory variables almost every time for all the involved data sets in the medium and large sample size setting (range from 0.99 to 1). Note that the simulation studies are limited due to the fact that the generated model is quite simple since the latent variable is not a combination of the explanatory variables, but instead the explanatory variables are directly sampled from a one‐dimensional distribution which might not represent well the structure of the omics data used for the real data analysis. Also, there is a somewhat unexpected behavior observed in the optimization curve of the λ1 parameters for the methylation markers during the real data analysis (see Figure 2). Our best explanation is that around the value 140 for the λ1 penalty, the model starts to generate a linear combination of weights that correlate much stronger with the response variables that leads to the substantial increase in the sum of absolute correlation. Nevertheless, the overrepresentation analysis of the resulting multi‐sRDA model provided reasonable pathways, that is, “Cytokine Signaling in Immune system” pathway given one of the data sets were the cytokine markers and “Cellular responses to stress” makes sense too from a biomedical perspective since the characteristic of Marfan syndrome is the hampered strength and elasticity of the connective tissue. We also applied regularized canonical correlation analysis (rCCA) (Waaijenborg & Zwinderman, 2009) to the same omics data sets that we used in our real data analysis. Canonical correlation analysis is a multivariate statistical method similar to RDA, but instead of maximizing the correlation between the explanatory latent variable and the response variables, it searches for latent variable pairs that have the maximum covariance with each other. Well‐known drawbacks are that CCA's results are not readily interpretable (i.e., the weights cannot be interpreted as correlation coefficients like in RDA) and that CCA cannot model the presumed sequential information transfer between data sets. Nevertheless, for comparison, we applied rCCA on the three omics data sets we described in Section 4. By treating all data set as response, thus running the analysis only with multivariate regression steps to calculate the weight vectors, multi‐sRDA's framework was used for rCCA. We observed the value of 6118.28 for the sum of absolute correlation between the outcome cytokine variables and the latent variable of the methylation markers and the value of 7.47 for the sum of absolute correlation between the outcome cytokine variables and the latent variable of the methylation markers. These values are substantially lower than we found with our sRDA method, which makes sense given CCA's different objective function. Thus if we can assume directionality between data sets, it is better to use our new method. Recently, minimal Bayes Information Criterion (BIC) was proposed for selecting the best optimal λ1 for sparse canonical correlation analysis (Wilms & Croux, 2015) which is an attractive alternative to replace CV in order to mitigate computational burden, although CV outperformed BIC in the average number of iterations needed for algorithmic convergence in what the authors call “Ultra high‐dimensional” scenario, with and . Although multi‐sRDA is based on the PLS framework, it should not be confused with penalized multiblock PLS models (Kawaguchi & Yamashita, 2017). Penalized multiblock PLS, like CCA, does not model the sequential information flow between data sets. When there is a well‐defined information transfer assumed between data sets, multi‐sRDA has the advantage of accounting for this information flow and extracting the biologically relevant variables from the data sets, while penalized multiblock PLS extract the variables that has the highest correlation with each other, disregarding the explanatory‐response relationships between data sets (Karaman et al., 2015). The proposed multi‐sRDA model is easily extendable for multidimensional latent variable extraction. After the first latent variables are obtained for data sets , explained effects are subtracted from original data sets in order to obtain residual data sets , on which the analysis can be repeated to obtain the following latent variables. Since the multidimensional latent variables are orthogonal to each other, each of them explains a different portion of variance of their response data set. By definition, the first latent variable has the highest absolute sum of squared correlation with its response data set, and all the following variables explain a smaller or equal portion of variance of the response data set. The number of latent variable to be extracted from a data set can be determined using the permutation test we proposed in Section 2; one might stop extracting further latent variables when the permutation test indicates nonsignificant results. However, obtaining multiple latent variables introduces further computational burdens, which, as we already mentioned before, practically restricted the present model to the UTS penalization scheme. When the data sets consist hundreds of thousands variables, estimating the optimal penalization parameters are costly due to the CV procedure and extracting l number of latent variables would increase the computational costs l‐fold. Currently, we are investigating how to implement multidimensional latent variable extraction so that it is also applicable in real omics data settings. In this paper, we extended our sRDA approach to multiple explanatory sets and we demonstrated multi‐sRDA's applicability for high‐dimensional omics data through conducting simulation studies and analyzing real biomedical data from three different omics domains. We conclude that multi‐sRDA can be used to analyze multiple high‐dimensional omics sets in order to explore the combination of those foremost biomolecular markers from various biological levels that explain the most variance of the phenotypic variables, while modeling the conceptual model of the central dogma of molecular biology.

CONFLICTS OF INTEREST

The authors have declared no conflict of interest. Supporting Information Click here for additional data file. Supporting Information Click here for additional data file.
  10 in total

1.  A Model and Simple Iterative Algorithm For Redundancy Analysis.

Authors:  C Fornell; D W Barclay; B D Rhee
Journal:  Multivariate Behav Res       Date:  1988-07-01       Impact factor: 5.923

2.  Losartan reduces aortic dilatation rate in adults with Marfan syndrome: a randomized controlled trial.

Authors:  Maarten Groenink; Alexander W den Hartog; Romy Franken; Teodora Radonic; Vivian de Waard; Janneke Timmermans; Arthur J Scholte; Maarten P van den Berg; Anje M Spijkerboer; Henk A Marquering; Aeilko H Zwinderman; Barbara J M Mulder
Journal:  Eur Heart J       Date:  2013-09-02       Impact factor: 29.983

3.  Sparse canonical correlation analysis from a predictive point of view.

Authors:  Ines Wilms; Christophe Croux
Journal:  Biom J       Date:  2015-07-06       Impact factor: 2.207

4.  Central dogma of molecular biology.

Authors:  F Crick
Journal:  Nature       Date:  1970-08-08       Impact factor: 49.962

5.  Supervised multiblock sparse multivariable analysis with application to multimodal brain imaging genetics.

Authors:  Atsushi Kawaguchi; Fumio Yamashita
Journal:  Biostatistics       Date:  2017-10-01       Impact factor: 5.899

6.  Sparse redundancy analysis of high-dimensional genetic and genomic data.

Authors:  Attila Csala; Frans P J M Voorbraak; Aeilko H Zwinderman; Michel H Hof
Journal:  Bioinformatics       Date:  2017-10-15       Impact factor: 6.937

7.  Inflammation aggravates disease severity in Marfan syndrome patients.

Authors:  Teodora Radonic; Piet de Witte; Maarten Groenink; Vivian de Waard; Rene Lutter; Marco van Eijk; Marnix Jansen; Janneke Timmermans; Marlies Kempers; Arthur J Scholte; Yvonne Hilhorst-Hofstee; Maarten P van den Berg; J Peter van Tintelen; Gerard Pals; Marieke J H Baars; Barbara J M Mulder; Aeilko H Zwinderman
Journal:  PLoS One       Date:  2012-03-30       Impact factor: 3.240

Review 8.  Integration of omics: more than the sum of its parts.

Authors:  Joerg Martin Buescher; Edward M Driggers
Journal:  Cancer Metab       Date:  2016-02-19

9.  Sparse canonical correlation analysis for identifying, connecting and completing gene-expression networks.

Authors:  Sandra Waaijenborg; Aeilko H Zwinderman
Journal:  BMC Bioinformatics       Date:  2009-09-28       Impact factor: 3.169

Review 10.  More Is Better: Recent Progress in Multi-Omics Data Integration Methods.

Authors:  Sijia Huang; Kumardeep Chaudhary; Lana X Garmire
Journal:  Front Genet       Date:  2017-06-16       Impact factor: 4.599

  10 in total
  1 in total

1.  Redundancy Analysis to Reduce the High-Dimensional Near-Infrared Spectral Information to Improve the Authentication of Olive Oil.

Authors:  María Isabel Sánchez-Rodríguez; Elena Sánchez-López; Alberto Marinas; José María Caridad; Francisco José Urbano
Journal:  J Chem Inf Model       Date:  2022-09-21       Impact factor: 6.162

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.