Literature DB >> 34075095

Relationship between gene regulation network structure and prediction accuracy in high dimensional regression.

Yuichi Okinaga¹, Daisuke Kyogoku², Satoshi Kondo³, Atsushi J Nagano^4,5, Kei Hirose^6,7.

Abstract

The least absolute shrinkage and selection operator (lasso) and principal component regression (PCR) are popular methods of estimating traits from high-dimensional omics data, such as transcriptomes. The prediction accuracy of these estimation methods is highly dependent on the covariance structure, which is characterized by gene regulation networks. However, the manner in which the structure of a gene regulation network together with the sample size affects prediction accuracy has not yet been sufficiently investigated. In this study, Monte Carlo simulations are conducted to investigate the prediction accuracy for several network structures under various sample sizes. When the gene regulation network is a random graph, a sufficiently large number of observations are required to ensure good prediction accuracy with the lasso. The PCR provided poor prediction accuracy regardless of the sample size. However, a real gene regulation network is likely to exhibit a scale-free structure. In such cases, the simulation indicates that a relatively small number of observations, such as [Formula: see text], is sufficient to allow the accurate prediction of traits from a transcriptome with the lasso.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 34075095 PMCID： PMC8169869 DOI： 10.1038/s41598-021-90791-6

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

Technological advancements have enabled the collection of highly multidimensional data from biological systems[1-4]. For example, RNA sequencing quantifies expression levels of thousands of genes. Such omics data is useful in predicting organismal traits, with empirical applications including diagnosis and classification of diseases and prediction of patient survival[5-8] and possible future applications in predicting crop yields[9], insecticide resistance[10], and environmental adaptation[11]. A common challenge in predicting traits from omics data is the dimension of the data far exceeding that of the sample size (known as high-dimensional regression). For example, if one is to apply least-squares estimation in multiple regression (e.g. ) to predict a trait value from a transcriptome, the sample size needs to be (at least) larger than the number of model parameters. However, because transcriptome studies typically observe thousands of genes, a sample size exceeding the number of genes is not realistic at present. In this case, high-dimensional regression modeling must be considered. The least absolute shrinkage and selection operator (lasso[12]) is one of the most frequently used methods for high-dimensional regression. It simultaneously achieves variable selection and parameter estimation. Theoretically, the prediction accuracy of the lasso is highly dependent on the correlation structure among exploratory variables; it is high under certain strong conditions, such as the compatibility condition[13]. However, in practice, it is not easy to check whether the compatibility condition holds. Another popular estimation method for high-dimensional regression is principal component regression (PCR[14]). PCR is a two-stage procedure: first, principal component analysis is conducted for predictors, following which the regression model on which the principal components are used as predictors is fitted. This method may perform well when the exploratory variables are highly correlated. It is reasonable to assume that gene regulation networks will result in conditional independence among the levels of gene expression[15-17]. Here, two variables are conditionally independent when they are independent given other variables (e.g. two focal variables are independently influenced by a third variable[18]). When a random vector of exploratory variables follows a multivariate normal distribution, two variables are conditionally independent if and only if the corresponding element of the inverse covariance matrix is zero. Essentially, the networks are characterized by the nonzero pattern of the inverse covariance matrix. One of the most notable characteristics of biological networks is their scale-free nature, that is, the degree distribution of the network follows a power-law expressed as ()[19,20]. Empirical studies suggest that biological networks are often scale-free[21-23], although exceptions have also been found[24]. Therefore, it is reasonable to consider the problem of high-dimensional regression when the networks of exploratory variables are scale-free. Here, it should be noted that the relative performance of different high-dimensional regression techniques may depend on sample sizes. However, to the best of our knowledge, the effect of the gene regulation network structure together with sample size on prediction accuracy has not yet been sufficiently investigated. This paper provides a general simulation framework to study the effects of correlation structure in explanatory variables. As an example, the prediction of ambient temperature from the transcriptome, for which good empirical data is available[11,25], is considered. It should be noted that the implementation of the proposed procedure is independent of the empirical data[11,25]; the proposed framework may be applied to predict any consequence of gene expression differences. The proposed framework is based on the Monte Carlo simulations. Three datasets of transcriptome and their traits are generated. The datasets are characterized by the covariance structure of exploratory variables; one of the covariance structures corresponds to the scale-free gene regulation network. Both lasso and PCR are applied to these simulated datasets to investigate the prediction accuracy with different types of gene regulation networks. The sample size is also varied to examine its effect on the prediction accuracy. The remainder of this paper is organized as follows. Section “Prediction methods for high-dimensional data” describes prediction methods for high-dimensional regression in the given simulation. Section “Simulation framework” discusses the proposed simulation framework. Finally, Section “Concluding remarks” presents the concluding remarks.

Prediction methods for high-dimensional data

Suppose that we have n observations where are p-dimensional vector of explanatory variables and are responses . Let and . Consider the linear regression model:where is a vector of error variables with and .

Lasso

The lasso minimizes a loss function that consists of quadratic loss with a penalty based on an norm of a parameter vector:where is a regularization parameter. Because of the nature of the norm in the penalty term, some of the elements of the coefficients are estimated to be exactly zero. Thus, variable selection is conducted, and only variables that correspond to nonzero coefficients affect the responses.

PCR

In some cases, the first few largest eigenvalues of the covariance matrix of predictors (i.e., proportional contributions of principle components) can be considerably large (e.g., spiked covariance model[26]). In such a case, the lasso may not function effectively in terms of both prediction accuracy and consistency in model selection, because the conditions for its effective performance (e.g., compatibility condition[27]) may not be satisfied. This issue could be addressed using PCR because it transforms data with a large number of highly correlated variables into a few principal components. In the first stage of PCR, principal component analysis is applied to predictors. The ith observation of predictor, , is linearly mapped onto a -dimensional vector, , where A is a matrix. The matrix A is obtained by the following least squares optimization problem[28]:here, is the sample mean vector, that is, . In this work, the number of projected dimensions, d, was chosen such that d principle components collectively explain 90% or more variance (and principle components do not). Then, in the second stage, regression analysis is conducted, for which the principal components, , are used as predictors. Here, the regression coefficients in the second stage are estimated by the lasso.

Simulation framework

Overview of the simulation. An overview of the simulation is presented in Fig. 1. First, the model that defines the relationship between the trait and the levels of gene expression was parameterized. This was done using the empirical data[11], which quantified the transcriptome of wild Arabidopsis halleri subsp. gemmifera weekly for two years in their natural habitat as well as bihourly on the equinoxes and solstices (p = 17,205 genes for observations). Three types of simulated data were generated using different covariance matrices of genes, denoted as (). is the sample covariance matrix of genes. Generally, none of the elements of the inverse of sample covariance matrix are exactly zero, implying that each gene interacts with all the other genes. Such a fully connected network is ineffective in terms of interpretation of the mechanism of gene regulation. Thus, two other covariance matrices were produced to simulate sparse networks based on the sample covariance matrix . is generated by the graphical lasso[29], which corresponds to the random graph. Although the graphical lasso is widely used because of its computational efficiency, real networks are often scale-free. Therefore, , which corresponds to the scale-free network, was generated here. The estimation of scale-free networks is achieved by the reweighted graphical lasso[30]. Based on these three covariance matrices (), the simulated transcriptome data were generated from the multivariate normal distribution. The simulated trait data were generated from simulated transcriptome data. Finally, lasso and PCR were applied to these simulated data to compare their prediction accuracies. The sample sizes of the simulated data were varied to investigate the relationship between prediction accuracy and sample sizes.

Figure 1

Overview of the simulation.

Evaluation of the estimation procedure

The performance of the estimation procedure is investigated by the following expected prediction error:where and follow or 3) and , respectively. The estimator is obtained using current observations, while and correspond to future observations. The (), , and are true values but unknown. In practice, these parameters are defined by using the actual dataset, (X, ). Detail of setting of these parameters will be presented in the next subsection. To estimate the expected prediction error, the Monte Carlo simulation is conducted. We first randomly generate training and test data, and , respectively. Here, follows a multivariate normal distribution with mean vector and variance–covariance matrix , where is the sample mean of X. Then, is generated by , where is a random sample from with I being an identity matrix. The test data, , are generated by the same procedure as but independent of . The number of observations for the training and test data are N () and 1000, respectively. The lasso and the PCR are performed with training data , following which RMSE is calculated in (10). The above process, from random generation of data to RMSE calculation, was performed 100 times.

Parameter setting

Covariance structures

Here, the characterization of the network structure of predictors by conditional independence is considered. When the predictors follow a multivariate normal distribution, the network structure based on the conditional independence corresponds to the nonzero pattern of the inverse covariance (precision) matrix. In other words, the network structure is characterized by the inverse covariance matrix of predictors. Let S be the sample covariance matrix of predictors, that is, . Let . is a ridge estimator of the sample variance-covariance matrix, that is, . Here is a small positive value (in this simulation, ). The term allows the existence of . Note that because is not sparse, it leads to the complete graph, which is of no use in interpreting gene regulatory networks. To generate a covariance matrix whose inverse matrix is sparse, penalization is employed for the estimation of and as follows:where are penalty terms which enhance the sparsity of the inverse covariance matrix. To estimate the sparse inverse covariance matrix, the lasso penalty is typically used as follows:where . The problem (3) is referred to as the graphical lasso[29], and there exists several efficient algorithms to obtain the solution[31-33]. The estimator of (2) with (3) corresponds to and . The lasso penalty (3) does not enhance scale-free networks. It penalizes all edges equally so that the estimated graph is likely to be a random graph, that is, the degree distribution becomes a binomial distribution. To enhance scale-free networks (i.e., power-law distribution), the log penalty[30] is used as follows:where and are tuning parameters. We note that the penalty (4) is slightly different from original definition[30], expressed asWhen we do not assume that , the estimate of the inverse covariance matrix with (5) is not symmetric. Since the original graphical lasso algorithm does not assume that [31,34], we slightly modify the penalty as in Eq. (4). Notably, in (4) coincides with (5) when . From a Bayesian viewpoint, the prior distribution which corresponds to the log penalty becomes the power–law distribution[30]; thus, the penalty (4) is likely to estimate the scale-free networks. The estimator of (2) with (4) corresponds to . Because the log-penalty (4) is nonconvex, it is not easy to directly optimize (2). To implement the maximization problem (2), the minorize-maximization (MM) algorithm[35] has been constructed[30], in which the weighted lasso penalty with current parameter is used:where are the weightsIn general, must be symmetric, so that Eq. (7) can be expressed asBecause the weighted graphical lasso can be implemented by a standard graphical lasso algorithm, the estimator is obtained as the following algorithm. To obtain and , the tuning parameters and must be determined. Following the experiments[30], was set for . To select the value of the regularization parameter , several candidates were first prepared. In our simulation, the candidates were . From these, the value of was selected such that the extended Bayesian information criterion (EBIC[36,37])was minimized. Here, q is the number of nonzero parameters of the upper triangular matrix of , and is a tuning parameter. As the value of increases, a sparser graph is generated. Note that corresponds to the ordinary BIC[38]. We set because is shown to yield good performance in both simulated and real data analyses[37]. As a result, the EBIC selected . Set . Get via ordinary graphical lasso. Repeat 2 to 4 until convergence. Update weights using (7). Get via the weighted graphical lasso (2) with penalty (6). . The upper triangular matrix must be estimated with the reweighted graphical lasso problem. A value of p = 17205 results in million parameters. As a result, with the machine used in this study (Intel Core Xeon 3 GHz, 128 GB memory), it would take several days to conduct the reweighted graphical lasso approach, even with a small number of iterations such as . For this reason, iterations were employed to produce here. Finally, and were scaled such that their signal-to-noise ratio became . Logarithm graph of the largest 30 eigenvalues of (square), (circle) and (triangle). The horizontal axis expresses the index of eigenvalues arranged in descending order. Figure 2 depicts the logarithm of the largest 30 eigenvalues of (). The first few largest eigenvalues of are significantly larger than those of , implying that the scale-free networks tend to produce predictors with large correlations.

Figure 2

Logarithm graph of the largest 30 eigenvalues of (square), (circle) and (triangle). The horizontal axis expresses the index of eigenvalues arranged in descending order.

Regression parameters

The values of and are determined as follows. First, 10-fold cross-validation is performed as described below, and the regularization parameter in (1) is selected. The data (X, Y) are divided into ten datasets, , which consist of almost equal sample sizes. Let , and (). For each j (), the training and test data are defined by and , respectively. Then, the parameter () is found by the lasso:For each j (), the verification error is calculated as follows:Then, is adopted such that it minimizes , the mean of . Following this, the dataset (X, Y) is again randomly divided into two datasets: test data and training data . Lasso estimation (1) is performed using the training data, with obtained by the above 10-fold cross-validation. Then, is defined as the lasso estimator, resulting in the number of nonzero parameters of being 259. Figure 3 shows the histogram of nonzero parameters of . It is seen that the majority of the nonzero coefficients were close to zero; only 15 parameters had absolute values larger than 0.1.

Figure 3

Histogram of 259 nonzero parameters of .

In addition, the root mean squared error (RMSE) is calculated as follows:and the variance of errors, , is defined by . Histogram of 259 nonzero parameters of .

Results

The box and whisker plot of the RMSE and the coefficient of determination () are illustrated in Figs. 4 and 5. The horizontal axis is N (the number of observations of training data) and the vertical axis is the RMSE or based on 1000 observations of test data.

Figure 4

Figure 5

Box and whisker plot of . The variance-covariance matrix used in the simulations is in (a, b), in (c, d), and in (e, f). The regression model is estimated by the lasso in (a), (c), and (e) and by PCR in (b), (d), and (f).

Box and whisker plot of RMSE. The variance-covariance matrix used in the simulations is in (a, b), in (c, d), and in (e, f). The regression model is estimated by the lasso in (a), (c), and (e) and by PCR in (b), (d), and (f). Box and whisker plot of . The variance-covariance matrix used in the simulations is in (a, b), in (c, d), and in (e, f). The regression model is estimated by the lasso in (a), (c), and (e) and by PCR in (b), (d), and (f). Scatter plot of and the eigenvector corresponding to the maximum eigenvalue of . The nonzero elements of are not drawn. We compared the performance of the lasso with that of the PCR. When and were used, the PCR performed worse than the lasso for small sample sizes. For , the prediction performance with PCR was unsatisfactory even when the sample size N increased. The poor performance of the PCR can be attributed to the predictors associated with small eigenvalues; these predictors affected the prediction performance. Figure 6 depicts a scatter plot of nonzero elements of and the eigenvector for the maximum eigenvalue of . As can be seen, only a significant amount of correlation existed; in fact, the correlation coefficient was only 0.068.

Figure 6

Scatter plot of and the eigenvector corresponding to the maximum eigenvalue of . The nonzero elements of are not drawn.

The prediction accuracy was compared among the three covariance structures. In all the cases except PCR with , the values of RMSE decreased and increased with the increase in the value of N. Further, was unstable for small sample sizes for all the cases when the lasso was applied. For large sample sizes, the of was better than that of and . As described before, was the sample covariance matrix, while (and ) was estimated using the graphical lasso. As the lasso-type regularization methods shrink parameters toward zero, the correlations among the exploratory variables reduce when the graphical lasso is used. Therefore, and resulted in smaller correlations as compared to . Consequently, the may increase with stronger correlations. We compared the RMSE results of and . With , we found that a sufficiently large number of observations is required to yield a small RMSE with the lasso. Meanwhile, resulted in a small RMSE with a relatively small number of observations, such as .

Code availability

The proposed simulation is implemented in R package simrnet, which is available at https://github.com/keihirose/simrnet. Below is a sample code of the simrnet in R: When , it took less than 12 min to conduct the simulation with 100 replications using the machine employed herein (Intel Core Xeon 3 GHz, 128 GB memory). For high-dimensional data such as p = 17,205, which was used in the simulation presented in this paper, several days were required to complete the simulation task.

Concluding remarks

In a gene regulation network, a gene regulates a small portion of a genome, not all the genes in a genome. This indicates that gene regulation network is expected to be a sparse network rather than a complete graph. Therefore, two covariance matrices indicating sparse networks (, ) were prepared in addition to a covariance matrix derived from empirical data (). Generally, although hundreds of genes contribute to defining a trait, their contributions are not equal. It is frequently observed that genes regulating a trait include a few large-effect genes and many small-effect genes. This property was reflected in the distribution of (Fig. 3). We considered the case where a limited number of regression coefficients significantly contributed to the definition of a trait. The Monte Carlo simulation result indicated that regardless of the network structure, the number of observations should be greater than at least 200 to accurately predict traits from a transcriptome (, , Figs. 4 and 5). We also found that the lasso generally provided better accuracy than the PCR. In particular, when the gene regulation network was random (), the prediction accuracy of the PCR was poor even if the sample size increased. In conclusion, it is important to sufficiently secure large sample sizes when performing regression analysis of data that exhibits either the random graph and the scale-free network. Additionally, we concluded that the lasso would be preferable to the PCR to ensure a good prediction accuracy. Conventional theory on the relationship between RMSE and sample size has been developed under the assumption that the sample size exceeds the number of exploratory variables[39]. However, omics data, which is rapidly being accumulated, results in high dimensional data with strong correlations. Thus, our simulation study considered more complicated settings than the traditional ones. Our simulation, or its extension, may be used in the future to find clues about theoretical aspects that may ultimately lead to the development of a sample size determination technique for omics data. Other than the scale-free network, the small-world network is another notable property in the networks literature[40]. The definition of the small-word networks is that the shortest path length between two randomly chosen variables is proportional to ; that is, it is considerably small compared with the network size. The small-world networks have been investigated in various fields of research, including the biology[41-43]. Some statistical properties of the small-world networks have also been studied[44-46]. The investigation of the prediction accuracy in the small-world networks would be interesting but beyond the scope of this research. We would like to take this as a future research topic. The development of methods that provides better prediction accuracy than the lasso in various network structures with small sample sizes would also be an important future research topic.

25 in total

1. Emergence of scaling in random networks

Authors:
Journal: Science Date: 1999-10-15 Impact factor: 47.728

Review 2. Network biology: understanding the cell's functional organization.

Authors: Albert-László Barabási; Zoltán N Oltvai
Journal: Nat Rev Genet Date: 2004-02 Impact factor: 53.242

3. Predicting survival from microarray data--a comparative study.

Authors: H M Bøvelstad; S Nygård; H L Størvold; M Aldrin; Ø Borgan; A Frigessi; O C Lingjaerde
Journal: Bioinformatics Date: 2007-06-06 Impact factor: 6.937

4. A Markov random field model for network-based analysis of genomic data.

Authors: Zhi Wei; Hongzhe Li
Journal: Bioinformatics Date: 2007-05-05 Impact factor: 6.937

5. Collective dynamics of 'small-world' networks.

Authors: D J Watts; S H Strogatz
Journal: Nature Date: 1998-06-04 Impact factor: 49.962

Review 6. Visualization of omics data for systems biology.

Authors: Nils Gehlenborg; Seán I O'Donoghue; Nitin S Baliga; Alexander Goesmann; Matthew A Hibbs; Hiroaki Kitano; Oliver Kohlbacher; Heiko Neuweger; Reinhard Schneider; Dan Tenenbaum; Anne-Claude Gavin
Journal: Nat Methods Date: 2010-03 Impact factor: 28.547

7. Deciphering and prediction of transcriptome dynamics under fluctuating field conditions.

Authors: Atsushi J Nagano; Yutaka Sato; Motohiro Mihara; Baltazar A Antonio; Ritsuko Motoyama; Hironori Itoh; Yoshiaki Nagamura; Takeshi Izawa
Journal: Cell Date: 2012-12-07 Impact factor: 41.582

8. Dysregulation of expression correlates with rare-allele burden and fitness loss in maize.

Authors: Karl A G Kremling; Shu-Yun Chen; Mei-Hsiu Su; Nicholas K Lepak; M Cinta Romay; Kelly L Swarts; Fei Lu; Anne Lorant; Peter J Bradbury; Edward S Buckler
Journal: Nature Date: 2018-03-14 Impact factor: 49.962

Review 9. Multi-omics approaches to disease.

Authors: Yehudit Hasin; Marcus Seldin; Aldons Lusis
Journal: Genome Biol Date: 2017-05-05 Impact factor: 13.583

Review 10. Review of biological network data and its applications.

Authors: Donghyeon Yu; Minsoo Kim; Guanghua Xiao; Tae Hyun Hwang
Journal: Genomics Inform Date: 2013-12-31