Literature DB >> 36111618

Envelope-based partial partial least squares with application to cytokine-based biomarker analysis for COVID-19.

Yeonhee Park¹, Zhihua Su², Dongjun Chung³.

Abstract

Partial least squares (PLS) regression is a popular alternative to ordinary least squares regression because of its superior prediction performance demonstrated in many cases. In various contemporary applications, the predictors include both continuous and categorical variables. A common practice in PLS regression is to treat the categorical variable as continuous. However, studies find that this practice may lead to biased estimates and invalid inferences (Schuberth et al., 2018). Based on a connection between the envelope model and PLS, we develop an envelope-based partial PLS estimator that considers the PLS regression on the conditional distributions of the response(s) and continuous predictors on the categorical predictors. Root-n consistency and asymptotic normality are established for this estimator. Numerical study shows that this approach can achieve more efficiency gains in estimation and produce better predictions. The method is applied for the identification of cytokine-based biomarkers for COVID-19 patients, which reveals the association between the cytokine-based biomarkers and patients' clinical information including disease status at admission and demographical characteristics. The efficient estimation leads to a clear scientific interpretation of the results.

Entities: Chemical

Keywords: Grassmann manifold; dimension reduction; envelope model; multivariate regression; partial least squares

Mesh：

Substances：
Biomarkers
Cytokines

Year: 2022 PMID： 36111618 PMCID： PMC9350235 DOI： 10.1002/sim.9526

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.497

INTRODUCTION

COVID‐19 is a worldwide pandemic. As of April 2021, it has infected more than 147 million people and caused more than 3.1 million deaths worldwide. Despite tremendous efforts to improve the diagnosis and treatment of COVID‐19, we still have a limited understanding of the associations between the key immunologic factors and the clinical information of the COVID‐19 patients. These associations can aid in the treatment and management of the disease. Many studies on COVID‐19 patients have collected data on various biomarkers, such as the COVID‐IP project. It would be of great scientific and medical interest to develop a new statistical tool that facilitates the identification of such associations from the COVID‐19 datasets. The multivariate linear regression model is a common tool for the investigation of the association between key immunologic factors (such as cytokines) and COVID‐19 patients' clinical information. Compared to the traditional ordinary least squares (OLS) fitting, partial least squares (PLS) is a popular alternative known for its superior prediction performance. , It is originated in econometrics and is now widely used in many applied disciplines including chemometrics, social science, food science, and genetics. Among the various applications, it is common to have both categorical and continuous variables in the predictors. For example, in the COVID‐19 dataset, to study the impact of clinical variables on cytokines levels, categorical predictors include the patient's sex, ethnicity, and indicators for underlying diseases such as asthma and diabetes, and continuous predictors include patient's clinical status such as temperature, respiratory rate, and oxygen saturation. When the data have both continuous and categorical predictors, a common practice is to treat the categorical predictors as continuous. However, practitioners discovered that this can “lead to biased estimates and therefore to invalid inferences and erroneous conclusions,” see also Lohmoller and Hair et al. In this article, we resolve the issue via the link between PLS and the envelope model. The envelope model was first proposed in Cook et al which achieves estimation efficiency in multivariate linear regression using dimension reduction techniques. Cook et al discovers a link between PLS and the envelope model that in a population they are estimating the same parameter but use different sample estimation methods. Since its first introduction, PLS stands as an iterative moment‐based algorithm instead of a model‐based method. It is easy to use and fast to compute, but it is difficult to obtain a complete understanding of its properties and make improvements to overcome its disadvantages. On the other hand, the estimation of the envelope model uses a model‐based objective function, which facilitates the theoretical investigation of its estimator. The link between PLS and envelope model enables us to study PLS via the envelope model and design new variants to make it more adaptive to different data structures. The article aims to develop an envelope‐based partial PLS (EPPLS) estimator. Instead of treating the categorical predictors as continuous, we condition both the response(s) and the continuous predictors on the categorical predictors and then perform the envelope estimation based on the conditional distributions. This provides us with a ‐consistent estimator for the regression coefficients, and this estimator achieves more efficient gains and better prediction performance than OLS, PLS, and principal component regression (PCR) in our numerical study and the COVID‐19 dataset. We also establish consistency and the asymptotic distribution of this estimator. In addition, using the link of PLS and the envelope model, we derive a partial PLS (PPLS) algorithm, which is analogous to the PLS algorithm. The rest of the article is organized as follows. A review of the envelope methodology, as well as the link between PLS and the envelope model, is provided in Section 2. We propose the EPPLS, and discuss the estimation, theoretical properties, and order determination in Section 3. Based on the link between PLS and the envelope model, Section 4 derives a moment‐based iterative algorithm that yields a PPLS estimator. The numerical performance of the proposed estimators is investigated in Section 5 via simulations. The analysis of a COVID‐19 dataset is elaborated in Section 6. We conclude the article with a discussion in Section 7.

REVIEW OF THE ENVELOPE MODEL AND ITS CONNECTION TO PLS

The envelope model is first introduced by Cook et al as an efficient method to estimate the regression coefficients under the context of multivariate linear regression. It uses sufficient dimension reduction techniques to identify the part of the data that is immaterial to the estimation goal. The subsequent estimation is only based on the material part and is thus more efficient. The envelope model has since been adapted to many areas including PLS, , , generalized linear models, spatial regression model, variable selection, Bayesian analysis, , and tensor regression. , Codes for fitting the envelope models are included in R package Renvlp available in CRAN. A complete review of the envelope model is in Cook. Among all the envelope models, the predictor envelope model is most related to the background of our discussion. Thus we review the envelope model under the context of the predictor envelope model. Consider a linear regression model where is the univariate response () or multivariate response vector () with mean , is a predictor with mean and covariance matrix , denotes the error vector with mean and covariance matrix , and denotes unknown regression coefficients. The predictor envelope model assumes that part of is immaterial to the regression and does not affect the distribution of directly or indirectly. Specifically the predictor envelope model assumes that there is a subspace such that where denotes the projection matrix, denotes the identity matrix, and . Assumption (2a) states that provides no information about given , and (2b) implies that is uncorrelated with . Cook et al proved that assumptions in (2) are equivalent to imposing the following structure to the model parameters The structure of in (3c) asserts that the span of is contained in . When can be decomposed as in (3d), then is called a reducing subspace of . The ‐envelope of , denoted by , is the smallest reducing subspace that contains . In other words, is the smallest subspace that satisfies (3), or equivalently the assumptions in (2). If appears in subscripts, is abbreviated to . We call the material part of and the immaterial part. Let () denote the dimension of , an orthonormal basis of , and an orthonormal basis of , that is, the orthogonal complement of . Then the coordinate form of the predictor envelope model is where , carries the coordinates of with respect to , and carries the coordinates of with respect to and . From (4), is partitioned into the variation of the material part and the variation of the immaterial part . Estimation of the predictor envelope model uses normal likelihood as an objective function, and performs a manifold optimization to obtain the estimator of . Let be the sample variance matrix of , and the sample conditional covariance matrix of given , then where denotes Grassmann manifold, which is the set of all dimensional subspace of a dimensional space. Once we have , , the estimator of , can be taken as any orthonormal basis of . Then the predictor envelope estimator of is , where is the sample covariance matrix of and . Cook et al shows that the predictor envelope estimator is asymptotically more efficient or as efficient as the OLS estimator. The predictor envelope model has a close connection with PLS. PLS aims to find a reduction of , that is, , where , and then estimates based on the regression of on . PLS uses sequential moment‐based method to estimate columnwise, and different variants of PLS use slightly different algorithms. We take SIMPLS as an example, which is a popular variant implemented in various software packages. Suppose that is the vector obtained in the th step (). Let denote the covariance matrix between and . Set . At the th step, let , then is obtained by Let . If we change the length constrains in (6) to , then we obtain another popular PLS variant NIPALS. Once an estimator of is obtained, denoted by , the PLS estimator of is . Cook et al shows a close connection between SIMPLS and the predictor envelope model: . This indicates that at the population level SIMPLS seeks the same reduction as the predictor envelope model. At the sample level, SIMPLS and the predictor envelope model use different algorithms to estimate . SIMPLS uses the moment‐based algorithm (6) while the predictor envelope model uses a likelihood‐based method (5). Based on this connection, we are able to study the properties of the SIMPLS estimator or develop its extensions through the predictor envelope model, and we call the predictor envelope model (4) envelope‐based partial least squares (EPLS) hereafter.

ENVELOPE‐BASED PARTIAL PARTIAL LEAST SQUARES

Formulation

Suppose that is partitioned into and , where denotes a vector of continuous predictors and denotes a vector of categorical predictors (). Let and be the mean of and , respectively. Then the linear regression model in (1) can be written as where denotes the regression coefficients for the continuous predictors and denotes the regression coefficients for the categorical predictors. We further assume a working model between and , where is a matrix, and has mean and is independent of and . Let and . Then we have We impose similar assumptions as EPLS, but on the conditional distribution given . More specifically, it assumes that there is a subspace of such that Condition (i) indicates that given and , provides no information about , and condition (ii) implies that after removing the effects of , is uncorrelated with . Condition (i) implies that and condition (ii) implies is a reducing subspace of . The smallest subspace that satisfies both (i) and (ii) in (8) is the ‐envelope of , denoted by , or for short. Thus we have (iii) and (iv) . Note that just as the EPLS model, conditions (iii) and (iv) are equivalent to conditions (i) and (ii). Let denote the dimension of with , an orthonormal basis of and an orthonormal basis of . When (iii) and (iv) are satisfied, the coordinate form of the linear regression model (7) is where and carries the coordinates of with respect to . The matrices and are positive definite and contain the coordinates of with respect to and . Since the envelope structure is imposed on part of the predictors, we call (9) the envelope‐based partial PLS (EPPLS) model. When , the EPPLS model reduces to the standard linear regression model (7). The EPPLS model (9) has a close connection with the EPLS model (4). Let denote the population residuals from the linear regression of on , that is, . Then the linear model (7) can be reparameterized as where is a linear combination of and . Let denote the population residuals from the regression of on , then . Based on the reparametrization, we have which presents a multivariate linear regression model of on . Now we impose the EPLS structure (4) on (10). Let be the covariance matrix of . Then the ‐envelope of , denoted by , is the smallest reducing subspace of that contains . Since , is the same as in the EPPLS model (9). This relationship is analogous to the connection between the partial envelope model and the response envelope model for the residuals described in Su and Cook.

Estimation

We use the normal likelihood as an objective function for estimation. Let , be independent observations from the EPPLS model. Let , and be the data matrices, and , , and the sample means of , , and . Then , , and are the centered data matrices for , , and , respectively. The parameters under the EPPLS model are , , , , , , , and . Note that is not identifiable, only is identifiable. We first fix an orthonormal basis and estimate other parameters by maximizing the objective function. Based on the derivations in Supplemental material, the estimators of these parameters can be written as explicit functions of . We substitute them back to the objective function, which now only has one parameter , that is, . Let , , and denote the sample conditional variance of given , the sample conditional variance of given and the sample conditional covariance between and given , respectively. The estimator of the EPPLS can be obtained by solving the following optimization problem where denotes a Grassmann manifold. Details are provided in Section A of the Supplemental materials. Note that the objective function in (11) has the same form as the objective function for the EPLS model in (5) with being the response and being the predictor, which echoes the relationship between EPPLS model and the EPLS model discussed at the end of Section 3.1. The optimization in (11) can be solved using the computing algorithm in Cook et al, or applying existing software such as the R package Renvlp. Once we have , can be taken to be any orthonormal basis of , and can be taken to be any orthonormal basis of . Let denote the sample residuals from the regression of on , and let denote the sample residuals from the regression of on . Thus we have and . The estimators of the EPPLS parameters are Then where denotes the projection matrix onto with inner product, and is the OLS estimator of . Thus the EPPLS estimator is obtained by projecting the OLS estimator onto with the inner product. The estimator has the same expression as its OLS estimator, except that is replaced by the EPPLS estimator . When , the matrices and in (11) are singular. Since the objective function in (11) depends on the inverse of , and the inverse of is required in the algorithm to solve (11), we use a high‐dimensional precision matrix estimator to replace and . While many precision matrix estimators are applicable, for example, Sun and Zhang, Zhang and Zou, Khare et al, we adopt the sparse permutation invariant covariance estimator, [SPICE] since it guarantees to have a positive definite matrix and its consistency does not rely on any sparsity assumption. We use the R package PDSCE to compute the SPICE estimators of and , and denote the resulting estimators as and . Then and replace and in (11), as well as in the estimators in (12) and (13).

Theoretical Properties

In this section, we establish consistency and asymptotic distribution of the EPPLS estimator. Let vec denote the vector operator that stacks the columns of a matrix to a vector, and let vech denote the vector half operator that stacks the lower triangle of a symmetric matrix to a vector. We use to denote the Kronecker product, to denote the Moore‐Penrose generalized inverse, and to denote convergence in distribution. The parameters in (9) include , , and the constituent parameters of the EPPLS model are , , , , , Under the EPPLS model, is a function of . Proposition 1 indicates that the EPPLS estimator is consistent and asymptotically normal even the errors are not normally distributed. Suppose that the EPPLS model ( ) holds, has finite fourth moments and is independently and identically distributed in the sample. Let denote the EPPLS estimator of , then we have where is the gradient matrix, and is the Fisher information matrix from the standard estimation (performed by OLS). In other words, is the asymptotic covariance matrix of the OLS estimator of . Furthermore, since is a positive semi‐definite matrix, the EPPLS estimator is more efficient than or as efficient as the standard estimator asymptotically. The finite fourth moment condition is required for the consistency of the estimators of and . If we further assume normality, then we can obtain the explicit expression of the asymptotic covariance matrix for the EPPLS estimators and , as shown in Proposition 2. Assume that the conditions in Proposition hold, and we further assume that is normally distributed. Then, where Note that the expression of is the same as the asymptotic covariance matrix of under the EPLS model (10) with being the response and being the predictor, (Proposition 9) except that according to Proposition 9, we should have instead of in . However, Lemma 1 below asserts that they are actually equal. The asymptotic variance again echoes the connection between the EPPLS model (9) and the EPLS model (10). Under the EPPLS model ( ), .

Order determination

To implement the EPPLS model, we first need to select , the dimension of . While many methods such as cross validation, likelihood ratio testing can be used for the selection of , we find that BIC has the best performance especially when the sample size is moderate to large. The BIC is constructed as , where is the maximized log likelihood and is the number of parameters of the EPPLS model with the dimension of being . We compute BIC for all possible and choose the one that minimizes BIC. The consistency of BIC is given in the following proposition. Assume that the EPPLS model ( ) holds and that is normally distributed. Let be the dimension selected by BIC. Then as tends to infinity. Proposition 3 indicates that when the sample size increases, BIC chooses the correct model with probability tending to 1. Normality is assumed here since BIC is a likelihood‐based method, thus is inaccurate when the data distribution widely differs from normal. However, numerical analysis (not shown here) indicates that BIC still performs well under a moderate departure from normality.

PARTIAL SIMPLS ALGORITHM

Based on the connection between SIMPLS and the EPLS model, we develop a moment‐based iterative algorithm, called the partial SIMPLS (PPLS) algorithm, for estimating the EPPLS subspace . Its result can be a standalone estimator or a starting value for the optimization in (11). Since the envelope in EPPLS (9) is the same as the predictor envelope in (10), PPLS estimates a basis of using the same algorithm (6) except by replacing by and by . Note that and in (6) are and in the context of (10), where is the covariance matrix between and . The sample estimator of and are and . Given the sample, PPLS estimates each column of the basis of sequentially. Set . At the th step, let , then is obtained by The algorithm (14) is terminated when . Based on Cook et al, estimates , and thus is an estimator for . Once we obtain , the PPLS estimator of is obtained by which has the same form as the EPPLS estimator of in (13). The estimators for other parameters including , , have the same form as the EPPLS estimators in (12) and (13) except that the estimators of the bases and are replaced by the PPLS estimator and , where is any orthonormal basis of . The dimension can be chosen by cross validation, which is a common practice for SIMPLS. It is feasible to derive the asymptotic distribution for PPLS estimator as in Propositions 1 and 2, but the form of the asymptotic variance is too complicated to be useful in practice. Hence we suggest using the bootstrap approach to estimate the variability of the PPLS estimator. Note that for the envelop‐based method EPPLS, we have the explicit form of the asymptotic variance, which is an advantage of EPPLS from an inferential statistical perspective.

SIMULATION STUDY

In this section, we compared EPPLS and PPLS with existing methods including OLS, PCR, categorical principal component analysis (PRINCALS ), correspondence analysis (CA ), PLS, and EPLS. PCR regards categorical variables as continuous variables while PRINCALS and CA use the mixture of continuous and categorical variables to fit the multivariate linear regression model. Specifically, PRINCALS considers continuous transformation of categorical variables through monotone spline function with degree 2 and CA uses multiple correspondence analysis for categorical variables. The envelope dimension was chosen by BIC for the envelope methods such as EPPLS and EPLS, and by cross‐validation for PPLS and PLS. For PCR, PRINCALS, and CA, the number of principal components (PC) is chosen such that the selected PCs explain at least of the total variation of all predictors. We first investigated a low‐dimensional case where OLS is used as a benchmark. The data were generated from model (9), with , , , and . The dimension of was varied from , 10, and 30. The matrix was obtained by normalizing a matrix of independent normal variates, was a matrix with each element being independent normal variates, and , where denotes a vector of 1. Let be an independent normal variates and be a matrix of independent normal variates, and . We have and , where denotes the spectral norm. To generate the continuous predictors , we let and . We let follow a multivariate normal distribution with a zero mean vector and variance matrix . The errors was generated from a multivariate normal distribution with mean and covariance . The categorical predictors were , where , , and were independent Bernoulli variates that take value 1 with probability 0.4, 0.5, and 0.8, respectively. We considered the sample size from 50 to 1000. For each sample size, 100 replications were simulated. First we investigated the computing time of each method. The computing time was calculated by the average of 10 replications, and it included the selection of the number of components. The results were displayed in Table 1. PCR and OLS are the fastest methods to compute, followed by EPLS, PRINCALS, CA, and EPPLS. PLS and PPLS are methods that take the longest to compute. The computing time was measured with 2.3GHx Quad‐core Intel core i7 processor and 32 GB memory.

TABLE 1

Computing time for methods used in simulation studies

(n,r)	EPPLS	EPLS	PPLS	PLS	PCR	PRINCALS	CA	OLS
(100, 1)	6.58 s	2.18 s	1.24 min	2.26 min	0.07 s	1.21 s	2.70 s	0.05 s
(100, 10)	8.69 s	1.98 s	2.23 min	4.17 min	0.06 s	0.97 s	2.31 s	0.18 s
(100, 30)	8.14 s	3.11 s	3.05 min	5.78 min	0.06 s	0.97 s	1.86 s	0.07 s
(300, 1)	4.48 s	1.16 s	1.94 min	3.83 min	0.05 s	2.04 s	3.60 s	0.06 s
(300, 10)	6.37 s	1.55 s	4.22 min	7.34 min	0.06 s	2.06 s	3.76 s	0.07 s
(300, 30)	7.84 s	4.08 s	6.90 min	12.61 min	0.07 s	2.07 s	3.62 s	0.10 s
(1000, 1)	5.79 s	0.77 s	4.94 min	9.76 min	0.06 s	4.50 s	7.95 s	0.16 s
(1000, 10)	6.12 s	1.31 s	10.78 min	19.95 min	0.08 s	4.55 s	8.29 s	0.24 s
(1000, 30)	9.47 s	2.46 s	23.59 min	41.55 min	0.09 s	4.63 s	8.20 s	0.56 s

Computing time for methods used in simulation studies For each replication, we estimated using methods EPPLS, EPLS, PPLS, PLS, PCR, PRINCALS, CA, and OLS, and calculated , where denotes the Frobenius norm. For EPLS, PLS, PCR, PRINCALS, CA, and OLS, we fitted the response on all predictors , and obtained by extracting the submatrix of that corresponds to . The average and SD of based on the 100 replications are summarized in Table 2. Note that is the square root of the mean square error (MSE) of . Among all the methods, EPPLS has the smallest MSE, followed by PPLS. Note that PPLS and EPPLS estimate the same parameter in population, but they use different sampling algorithms. EPPLS is likelihood‐based and is usually more efficient than PPLS. EPLS performs much better than OLS. However, it loses substantial efficiency compared to EPPLS. For each (, ) pair, the from EPLS is at least three times as large as EPPLS. This is because EPPLS treats categorical and continuous predictors differently in estimation, and is, therefore, more efficient. Most of the time, PCR, PRINCALS, CA, and PLS perform worse than OLS. This is because these methods seek for the linear combinations of that provide either the largest variance or the largest covariance with . These directions are not necessarily the ones that provide information to the estimation . So the estimators from these methods may have large bias and underperform OLS (see Figure 1). Note that although EPLS and PLS are estimating the same parameter, EPLS is more stable than PLS, since it is a model‐based method and is proved to be consistent.

TABLE 2

Results of average (SD/) of based on 100 replications

r	Methods	n=50	n=100	n=300	n=1000
1	EPPLS	0.49(0.029)	0.58(0.054)	0.33(0.012)	0.07(0.005)
	EPLS	3.87(0.844)	3.27(0.539)	2.06(0.275)	0.23(0.005)
	PPLS	0.50(0.023)	1.13(0.065)	1.02(0.052)	0.07(0.003)
	PLS	2.64(0.001)	10.60(0.003)	15.64(0.002)	1.07(0.0001)
	PCR	2.60(0.015)	10.61(3.3∗10−4)	15.65(3.2∗10−4)	1.07(1.7∗10−4)
	PRINCALS	1.37(0.088)	10.62(8.8∗10−4)	15.65(2.6∗10−4)	1.08(2.2∗10−4)
	CA	2.67(0.008)	10.61(3.3∗10−4)	15.65(2.9∗10−4)	1.08(1.9∗10−4)
	OLS	12.91(0.812)	8.06(0.500)	4.37(0.284)	2.29(0.140)
10	EPPLS	1.27(0.031)	1.06(0.029)	0.63(0.018)	0.26(0.006)
	EPLS	10.20(1.692)	8.37(1.259)	7.98(0.767)	1.23(0.121)
	PPLS	3.43(0.200)	4.31(0.241)	3.36(0.175)	0.73(0.036)
	PLS	21.36(0.011)	40.62(0.002)	51.76(0.008)	20.93(0.002)
	PCR	20.91(0.060)	40.66(0.002)	51.81(0.001)	20.96(0.0002)
	PRINCALS	10.57(0.310)	40.68(0.002)	51.82(0.001)	20.96(2.2∗10−4)
	CA	21.39(0.012)	40.66(0.001)	51.81(8.1∗10−4)	20.96(2.1∗10−4)
	OLS	44.40(1.013)	29.19(0.692)	16.27(0.348)	8.58(0.206)
30	EPPLS	2.58(0.072)	1.49(0.034)	0.89(0.021)	0.45(0.011)
	EPLS	40.88(3.521)	25.95(2.567)	6.16(0.999)	3.88(0.517)
	PPLS	8.62(0.956)	6.36(0.366)	5.07(0.316)	1.76(0.108)
	PLS	47.49(0.024)	60.59(0.017)	74.61(0.011)	54.68(0.005)
	PCR	46.47(0.136)	60.66(0.002)	74.69(0.001)	54.74(0.001)
	PRINCALS	23.06(0.697)	60.69(0.004)	74.70(0.002)	54.74(6.5∗10−4)
	CA	47.57(0.026)	60.66(0.002)	74.69(0.001)	54.74(5.8∗10−4)
	OLS	77.08(1.150)	50.84(0.744)	28.85(0.360)	15.95(0.168)

FIGURE 1

Bias (left panel) and variance (right panel) for a random picked element of when . The black solid line marks for EPPLS, the blue dashed line marks for EPLS, the magenta dotted line marks for PPLS, the red dotted line marks for PLS, the green dashed line marks for PCR, the coral dashed line marks for PRINCALS, the brown dashed line marks for CA, and the orange dotted line marks for OLS Results of average (SD/) of based on 100 replications Figure 1 takes on a close look at the bias and variance of a randomly chosen element of . From the left panel, we noticed that the PLS estimator indeed carries a large bias, as indicated in Schuberth et al, when it treats the discrete predictors as continuous. The estimator of PCR, PRINCALS, and CA also bear a large bias since it does not take the information of into account in the construction of the principal components. OLS and EPLS are consistent methods and do not have a large bias. But their estimators are more variant than the EPPLS and PPLS estimators as shown in the right panel of Figure 1. PPLS has about the same bias as EPPLS and a slightly larger variance compared to EPPLS, but the difference is dwarfed by the magnitude of the variance of EPLS or OLS. Moreover, we investigated the performance of hypothesis testing for the coefficients based on the asymptotic distribution established in Proposition 2. To perform the hypothesis testing, we followed the simulation setting that generated Table 2 except that we set the first three rows of to zero. It implies that the first three rows of are zero vectors and the remaining elements of are nonzero. Then we test the hypothesis if each element in is zero. Specifically, let denote the th element in , and we test the hypothesis . The SE of estimators of and the ‐value of each test were calculated using the asymptotic distribution in Proposition 2. The simulation was replicated 100 times. We reported the 5th, 50th, and 95th percentiles of the average ‐values in Table 3. The ‐values for the zero elements in and non‐zero elements in were reported separately. The results show that with the asymptotic distribution in Proposition 2, the hypothesis testing procedure is able to detect the nonzero elements in with high power. For the zero elements in , the testing procedure can also control the Type I error under the desired level.

TABLE 3

Results of the 5th, 50th, and 95th percentiles of the average ‐values based on 100 replications

		The 5th, 50th, and 95th percentiles of the P‐values
r	n	Zero element of β1	Nonzero element of β1
1	50	0.336, 0.907, 0.992	4.5∗10−8, 1.4∗10−3, 0.469
	100	0.394, 0.929, 0.994	0, 0, 0.002
	300	0.366, 0.923, 0.993	0, 0, 1.3∗10−14
	1000	0.731, 0.930, 0.993	0, 0, 0.003
10	50	0.191, 0.889, 0.990	0, 1.2∗10−9, 0.098
	100	0.275, 0.889, 0.989	0, 0, 6.3∗10−8
	300	0.279, 0.880, 0.993	0, 0, 0
	1000	0.344, 0.916, 0.990	0, 0, 6.5∗10−12
30	50	0.065, 0.840, 0.988	0, 0, 2.1∗10−4
	100	0.085, 0.854, 0.989	0, 0, 2.1∗10−13
	300	0.183, 0.861, 0.995	0, 0, 4.7∗10−8
	1000	0.305, 0.877, 0.993	0, 0, 0

Results of the 5th, 50th, and 95th percentiles of the average ‐values based on 100 replications For sensitivity analysis, we considered a situation where the immaterial part of has a larger variation than the material part (). The results of both scenarios (ie, and ) presented in Table 2 and Web Table 1 show that EPPLS yields the most efficiency gains, and its performance is quite stable. In addition, we considered the case where and do not have a linear relationship. The results are in Web Table 2. The performance of EPPLS is very stable and is still the best among all models under comparison. However, the performance of PPLS deteriorates a lot, that PPLS even underperforms OLS most of the time. The details of the sensitivity analyses are provided in Section C of Supplemental materials. We also investigated a high‐dimensional setting where . The data were generated in the same way as that produced Table 2, with fixed at 100, fixed at 10 and , and 600. The ten binary predictors were drawn from Bernoulli distributions that take value 1 with probabilities , and 0.8. The coefficient matrix had structure , where each element in was independent normal random variates. The coefficient matrix was , , , . In high‐dimensional settings, prediction is a more common criterion than MSE for comparison of methods, we then computed the prediction errors, which is the square root of mean squared residuals, for methods EPPLS, PPLS, EPLS, PLS, PCR, PRINCALS, and CA using the five‐fold cross validation, with 100 replications for each sample size. Note that the OLS is not applicable when . The results are provided in Table 4. PLS is known for its stable performance in high dimensional settings, and it performs better than PCR, PRINCALS, and CA as shown in Table 4. By conditioning on the categorical variables, PPLS further reduces the prediction errors compared to PLS. The mechanism of the envelope methods EPLS and EPPLS is to remove the variation from the immaterial part, and they have the lowest prediction errors. Between the two envelope methods, EPPLS treats and separately by conditioning and on and has the best performance in all cases in Table 4.

TABLE 4

Results of average (SD/) of the prediction errors based on 100 replications for high dimensional setting

r	Methods	p1=150		p1=300		p1=600
1	EPPLS	5.93	(0.097)	21.89	(0.300)	28.68	(0.606)
	EPLS	9.05	(0.127)	33.61	(0.432)	48.51	(1.073)
	PPLS	9.09	(0.104)	41.09	(0.442)	60.98	(1.278)
	PLS	13.17	(0.107)	60.43	(0.409)	96.24	(1.383)
	PCR	15.28	(0.154)	68.23	(0.640)	104.76	(1.758)
	PRINCALS	13.83	(0.116)	63.92	(0.476)	102.48	(0.819)
	CA	13.90	(0.123)	64.17	(0.484)	102.49	(0.810)
10	EPPLS	67.49	(1.060)	76.41	(1.814)	79.24	(1.694)
	EPLS	69.40	(0.601)	96.78	(0.697)	125.36	(1.947)
	PPLS	99.20	(1.070)	143.23	(1.462)	175.75	(3.214)
	PLS	146.04	(1.186)	211.45	(1.436)	279.92	(4.209)
	PCR	166.11	(1.617)	239.42	(2.239)	302.39	(5.555)
	PRINCALS	152.96	(1.358)	223.65	(1.680)	294.26	(2.350)
	CA	153.83	(1.252)	224.54	(1.668)	294.30	(2.317)
30	EPPLS	93.04	(1.022)	106.61	(1.064)	112.93	(2.394)
	EPLS	129.00	(1.032)	157.74	(1.047)	184.22	(2.509)
	PPLS	195.76	(2.099)	238.98	(2.400)	256.36	(5.496)
	PLS	288.34	(2.373)	353.39	(2.423)	414.83	(5.923)
	PCR	327.77	(3.207)	399.96	(3.730)	444.60	(6.878)
	PRINCALS	301.71	(2.694)	373.43	(2.809)	436.59	(3.480)
	CA	303.35	(2.495)	375.06	(2.799)	436.66	(3.420)

Results of average (SD/) of the prediction errors based on 100 replications for high dimensional setting

DATA APPLICATION

COVID‐19 is a global pandemic that has affected 223 countries, areas, or territories. Study shows that cytokines are associated with COVID‐19 severity and survival, , , and the identification of the association between the cytokine‐based biomarkers and COVID‐19 severity and demographics leads to a better understanding and management of the disease. For this purpose, we analyzed the data from a study investigated in Laing et al, which included 63 COVID‐19 patients. In addition, the data also contained 10 non‐COVID‐19 patients who were hospitalized for lower respiratory tract infections as controls. For each patient, measurements were obtained for 26 cytokines, as well as a set of clinical information including demographics, patient status at admission, and underlying disease status. Among the 73 patients, 9 had missing data on BMI, ethnicity, or cytokines, and were excluded from the analysis. Thus our analysis was based on a dataset containing 64 patients, including 26 severe cases, 22 moderate cases, 6 low cases, and 10 non‐COVID patients. Data and detailed protocols for this study are publicly available on the COVID‐IP project website (http://www.immunophenotype.org). We took the logarithm of the cytokine measurements as a multivariate response vector. The continuous variables were 12 measurements of the patient status at admission including temperature, blood glucose, National Early Warning Score 2 (NEWS2) score, serum lactate, the fraction of inspired oxygen, respiratory rate, oxygen saturation, heart rate, systolic blood pressure, diastolic blood pressure, coma score, WHO score for severity of illness. The categorical variables were demographic information and indicators for underlying disease status. Demographics information contained age, BMI, ethnicity, and sex. Age was a binary variable taking value 1 for patients 45 years and older, and 0 otherwise. BMI was measured in ordinal scale based on categories of below 20, 20‐24, 25‐29, 30‐34, and 35 and above. The ethnicity variable included three categories Asian, black, and Caucasian. We created two binary indicators, one for Asian and one for black. The sex indicator took value 1 for males and 2 for females. For underlying diseases, hypertension, ischaemic heart disease, non‐asthma chronic lung disease, asthma, diabetes, and active malignancy were considered. This gave a total of 11 categorical variables. All variables were standardized. We fitted the data with EPPLS, PPLS, EPLS, PLS, PCR, and OLS, and computed the prediction errors as the root mean square error. The prediction error was obtained by five‐fold cross‐validations with 50 random splits of the data. OLS had the largest prediction error of 38.74, followed by PCR, which had a prediction error of 6.041. PLS and EPLS had similar prediction errors: 5.120 for PLS and 5.247 for EPLS. PPLS and EPPLS had the lowest prediction errors: 2.194 for PPLS and 2.192 for EPPLS. The efficiency gains obtained from EPPLS and PPLS also led to better prediction performance. The estimation efficiency also led to a clear scientific interpretation of the results. Based on the regression coefficient estimators, we investigated the associations between cytokines and covariates. Figure 2 shows the heatmaps of from all six methods, and Web Figure 1 shows the clustering structure of the responses () and continuous variables (). Recall that presents the associations of the cytokines with patient status at admission. It was noteworthy to observe that under EPPLS, interleukin 10 (IL10) stands out to be the most important cytokine, highlighted by a clear strong association across multiple patient statuses at admission, including severity (admission_WHO_ordinal_scale), blood pressure (admission_BP_diastolic, admission_BP_systolic), serum lactate (admission_lactate_venous), and oxygen saturation (admission_os_sats). Under the normality assumption, Proposition 2 was applied to perform the hypothesis test of . The regression coefficients for the association of IL10 (IL10_Th_cyto_cyto) across admission_WHO_ordnial_scale, admission_BP_diastolic, admission_BP_systolic, admission_lactate_venous, and admission_os_sats are statistically significant with ‐value , , , , and , respectively. This is consistent with the report that IL10 is associated with COVID‐19 severity and mortality, cytokine storm, and intensive care unit (ICU) stay in COVID‐19 patients. The importance of IL10 was not as evident in competing approaches based on the absolute values of . The OLS and PCR estimators were very variable, and hard to extract much information from the coefficients. EPLS and PLS both showed a few influential cytokines including IL10, but it was not obvious that IL10 was the most important one as in EPPLS. Although PPLS and EPPLS have the same estimation goal in population, their sample performance can vary. In this example, PPLS also noticed the strong association between IL10 and patient admission status, but the leading role of IL10 was not as obvious as in EPPLS.

FIGURE 2

Heatmaps of the regression coefficients of under EPPLS (left of 1st row), EPLS (right of 1st row), PPLS (left of 2nd row), PLS (right of 2nd row), PCR (left of 3rd row), and OLS (right of 3rd row)

Heatmaps of the regression coefficients of under EPPLS (left of 1st row), EPLS (right of 1st row), PPLS (left of 2nd row), PLS (right of 2nd row), PCR (left of 3rd row), and OLS (right of 3rd row) In addition to IL10, interleukin 6 (IL6) and CXCL10 (IP10) are determined to be co‐leading cytokines. Interestingly, Laing et al reported that the status of COVID‐19 patients is characterized by a severity‐related triad of IL10, IL6, and IP10. The triad/block of IL10, IL6, IP10 was most obvious under PLS but also shown in EPPLS from the clustering structure of in Web Figure 1. However, the triad/block was missed by EPLS, PPLS, PCR, and OLS. We also noted that under EPPLS, interferon‐ or type II interferon (IFNg) had coefficient estimates similar to IP10, which is consistent with the observation that IFNg levels are correlated with IP10. This similarity was not present in EPLS, PPLS, PLS, PCR, or OLS. Figure 3 shows the heatmaps for the estimators of , which present the associations of cytokines with demographics and underlying diseases. First, we noticed the strong association of IP10 with sex. It has been reported that men have a higher risk of infection, mortality, and comorbidities from COVID‐19 compared to women. Thus it is important to investigate the sex difference in COVID‐19. Recently, Takahashi et al reported the association of IP10 with the sex difference in immune responses that underlie COVID‐19 disease outcomes. This association was also captured by all models, although it appeared weaker under OLS. Second, we observed a clear association of interferon‐ (INFg) with both the Asian and black populations and a strong association of type III interferon (IFNl2.3) with the black population under EPPLS. This association was not observed under EPLS and PLS, and the association between IFNl2.3 and the black population was weak under OLS. Significant racial/ethnic disparities have been reported for COVID‐19, with the disproportionate burden on African and Latino population. Hence, cytokine markers IP10, INFg, and IFNl2.3 can potentially be important for understanding the biological mechanisms associated with sex bias and racial/ethnic disparities in COVID‐19. Third, we observed the association of interleukin 2 (IL2) with multiple pre‐existing disease statuses, including ischemic heart disease (IHD), asthma, and hypertension (HTN), which was not clear under EPLS and PLS. Association of IL2 with asthma was previously reported. Therefore IL2 can potentially be considered as a marker for pre‐existing disease status. Finally, we observed the strong association of interferon‐ (IFNl1) with age under EPPLS, EPLS, PPLS, and OLS, but not under PLS or PCR. Recently, Dinnon et al developed a mouse model for COVID‐19, which can be used to study age‐related disease pathogenesis of COVID‐19. IFNl1 is a potential clinical target for the treatment of human COVID‐19 using this mouse model. Heatmaps with uniform color scale are in Web Figures 3 and 4 in the Supplementary materials.

FIGURE 3

Heatmaps of the regression coefficients of under EPPLS (left of 1st row), EPLS (right of 1st row), PPLS (left of 2nd row), PLS (right of 2nd row), PCR (left of 3rd row), and OLS (right of 3rd row)

DISCUSSION

We have proposed an EPPLS model which achieves estimation efficiency when both continuous and categorical predictors are present. EPPLS is proposed when the categorical predictors are assumed to be fixed in the model formulation, but it is also applicable to cases when is random and follows a certain distribution. If all predictors are continuous, the idea of EPPLS can be applied when part of the predictors is of main interest. The proposed model can potentially be applied to generalized linear regression where the response variable is categorical. EPPLS can also incorporate heteroscedastic structure, spatial correlation, or time dependence. A Bayesian approach can also be derived for these models which allow users to incorporate prior information for estimation. Theoretical properties in Section 3.3, such as consistency and asymptotic normality, have been established when the number of predictors is smaller than the sample size. In high‐dimensional settings where , theoretical properties are not valid without further assumptions such as sparsity, low‐rank structure, or other parametric structures. Numerically EPPLS shows a better prediction performance compared to other methods in our simulation settings. Development of variants of EPPLS that better adapts to the high‐dimensional settings is an interesting topic for future study. Data S1: Supporting information Click here for additional data file. Data S2: Supporting information Click here for additional data file.

12 in total

1. Interleukin-2 levels in exhaled breath condensates, asthma severity, and asthma control in nonallergic asthma.

Authors: Sawad Boonpiyathad; Prapaporn Pornsuriyasak; Supranee Buranapraditkun; Jettanong Klaewsongkram
Journal: Allergy Asthma Proc Date: 2013 Sep-Oct Impact factor: 2.587

Review 2. Sex differences in SARS-CoV-2 infection rates and the potential link to prostate cancer.

Authors: Dimple Chakravarty; Sujit S Nair; Nada Hammouda; Parita Ratnani; Yasmine Gharib; Vinayak Wagaskar; Nihal Mohamed; Dara Lundon; Zachary Dovey; Natasha Kyprianou; Ashutosh K Tewari
Journal: Commun Biol Date: 2020-07-08

3. Envelope-based partial partial least squares with application to cytokine-based biomarker analysis for COVID-19.

Authors: Yeonhee Park; Zhihua Su; Dongjun Chung
Journal: Stat Med Date: 2022-07-15 Impact factor: 2.497

Review 4. A guide to modern statistical analysis of immunological data.

Authors: Bernd Genser; Philip J Cooper; Maria Yazdanbakhsh; Mauricio L Barreto; Laura C Rodrigues
Journal: BMC Immunol Date: 2007-10-26 Impact factor: 3.615

5. A dynamic COVID-19 immune signature includes associations with poor prognosis.

Authors: Adam G Laing; Anna Lorenc; Irene Del Molino Del Barrio; Abhishek Das; Matthew Fish; Leticia Monin; Miguel Muñoz-Ruiz; Duncan R McKenzie; Thomas S Hayday; Isaac Francos-Quijorna; Shraddha Kamdar; Magdalene Joseph; Daniel Davies; Richard Davis; Aislinn Jennings; Iva Zlatareva; Pierre Vantourout; Yin Wu; Vasiliki Sofra; Florencia Cano; Maria Greco; Efstathios Theodoridis; Joshua D Freedman; Sarah Gee; Julie Nuo En Chan; Sarah Ryan; Eva Bugallo-Blanco; Pärt Peterson; Kai Kisand; Liis Haljasmägi; Loubna Chadli; Philippe Moingeon; Lauren Martinez; Blair Merrick; Karen Bisnauthsing; Kate Brooks; Mohammad A A Ibrahim; Jeremy Mason; Federico Lopez Gomez; Kola Babalola; Sultan Abdul-Jawad; John Cason; Christine Mant; Jeffrey Seow; Carl Graham; Katie J Doores; Francesca Di Rosa; Jonathan Edgeworth; Manu Shankar-Hari; Adrian C Hayday
Journal: Nat Med Date: 2020-08-17 Impact factor: 87.241

6. Partial least squares path modeling using ordinal categorical indicators.

Authors: Florian Schuberth; Jörg Henseler; Theo K Dijkstra
Journal: Qual Quant Date: 2016-09-14

7. An interactive web-based dashboard to track COVID-19 in real time.

Authors: Ensheng Dong; Hongru Du; Lauren Gardner
Journal: Lancet Infect Dis Date: 2020-02-19 Impact factor: 25.071

8. A Potential Role of Interleukin 10 in COVID-19 Pathogenesis.

Authors: Ligong Lu; Hui Zhang; Danielle J Dauphars; You-Wen He
Journal: Trends Immunol Date: 2020-11-02 Impact factor: 16.687

9. A mouse-adapted model of SARS-CoV-2 to test COVID-19 countermeasures.

Authors: Kenneth H Dinnon; Sarah R Leist; Alexandra Schäfer; Caitlin E Edwards; David R Martinez; Stephanie A Montgomery; Ande West; Boyd L Yount; Yixuan J Hou; Lily E Adams; Kendra L Gully; Ariane J Brown; Emily Huang; Matthew D Bryant; Ingrid C Choong; Jeffrey S Glenn; Lisa E Gralinski; Timothy P Sheahan; Ralph S Baric
Journal: Nature Date: 2020-08-27 Impact factor: 49.962

10. Sex differences in immune responses that underlie COVID-19 disease outcomes.

Authors: Takehiro Takahashi; Mallory K Ellingson; Patrick Wong; Benjamin Israelow; Carolina Lucas; Jon Klein; Julio Silva; Tianyang Mao; Ji Eun Oh; Maria Tokuyama; Peiwen Lu; Arvind Venkataraman; Annsea Park; Feimei Liu; Amit Meir; Jonathan Sun; Eric Y Wang; Arnau Casanovas-Massana; Anne L Wyllie; Chantal B F Vogels; Rebecca Earnest; Sarah Lapidus; Isabel M Ott; Adam J Moore; Albert Shaw; John B Fournier; Camila D Odio; Shelli Farhadian; Charles Dela Cruz; Nathan D Grubaugh; Wade L Schulz; Aaron M Ring; Albert I Ko; Saad B Omer; Akiko Iwasaki
Journal: Nature Date: 2020-08-26 Impact factor: 49.962

1 in total

1. Envelope-based partial partial least squares with application to cytokine-based biomarker analysis for COVID-19.

Authors: Yeonhee Park; Zhihua Su; Dongjun Chung
Journal: Stat Med Date: 2022-07-15 Impact factor: 2.497

1 in total