Literature DB >> 33987868

Penalized estimation of the Gaussian graphical model from data with replicates.

Abstract

Gaussian graphical models are usually estimated from unreplicated data. The data are, however, likely to comprise signal and noise. These two cannot be deconvoluted from unreplicated data. Pragmatically, the noise is then ignored in practice. We point out the consequences of this practice for the reconstruction of the conditional independence graph of the signal. Replicated data allow for the deconvolution of signal and noise and the reconstruction of former's conditional independence graph. Hereto we present a penalized Expectation-Maximization algorithm. The penalty parameter is chosen to maximize the F-fold cross-validated log-likelihood. Sampling schemes of the folds from replicated data are discussed. By simulation we investigate the effect of replicates on the reconstruction of the signal's conditional independence graph. Moreover, we compare the proposed method to several obvious competitors. In an application we use data from oncogenomic studies with replicates to reconstruct the gene-gene interaction networks, operationalized as conditional independence graphs. This yields a realistic portrait of the effect of ignoring other sources but sampling variation. In addition, it bears implications on the reproducibility of inferred gene-gene interaction networks reported in literature.

Entities: Chemical

Keywords: conditional independence graph; inverse covariance; network; reproducibility; ridge penalty

Mesh：

Year: 2021 PMID： 33987868 PMCID： PMC8360145 DOI： 10.1002/sim.9028

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.497

INTRODUCTION

Gaussian graphical models are used to model (static) molecular networks. These models, and subsequently the network, are learned from omics data. Such data are typically gene expression data that represent the activity of the entities (ie, genes) that constitute the nodes of the network. Data used for the aforementioned purpose are usually acquired as a side product of a clinical or an observational study. Within the context of such studies patients are characterized molecularly once, which is mainly due to financial reasons but also a lack of awareness of the importance of replicates. Consequently, for Gaussian graphical modeling one assumes that only sampling variation is present, ignoring other sources of variation. Here we investigate the consequences of this assumption for the reconstruction of the molecular network. A Gaussian graphical model is a multivariate normal distribution where is the inverse covariance matrix, henceforth called precision matrix. The specification of the multivariate normal in terms of the precision is due to the fact that the off‐diagonal elements of contain the information on the conditional (in)dependencies among the variates. A zero off‐diagonal element implies that the corresponding variates are conditionally independent, given all other variates, while a nonzero off‐diagonal element indicates that there is no such conditional independence. For more on Gaussian graphical models refer to the monographs of Whittaker and Lauritzen. The parameter of the Gaussian graphical model is usually estimated by means of likelihood maximization. The estimation requires a sample of p‐dimensional, independent random vectors Y , i = 1, … , n, from the distribution . The maximum likelihood estimator of the precision matrix then is the inverse of the sample covariance matrix: , where . When p > n, this estimator is not defined as S is singular. One then resorts to penalized maximum likelihood procedures that define the estimator as the maximizer of the log‐likelihood augmented with a penalty term., Both the Gaussian graphical model and the maximum likelihood estimator of its parameter assume that the variation among the random vectors is only due to the sampling from and other sources of variation are absent. In practice, however, such alternative sources of variation, for example, measurement error, are likely to be present in any data as was realized almost a century ago (Shewhart, p. 378): “An element of chance enters into every measurement; hence every set of measurements is inherently a sample of certain more or less unknown conditions. Even in those few instances where we believe that the objective reality being measured is constant, the measurements of this constant are influenced by chance or unknown causes.” Translated to the present context, for instance, to acquire an individual's molecular profile the preparation of the sample and the experimentation contribute substantially to the noise in the eventual observation. This dilutes the biological signal present in the data. The presented maximum likelihood estimator of then does not estimate the signal precision matrix but a convoluted version of it. The convoluted version may harbor different conditional (in)dependence relations than the original one (examples are given in Section 2). The direct in‐house motivation behind this work stems from an omics study in which only a few samples were replicated. These replications were due to doubts about the quality of some measurements (hybridizations). After closer inspection, it turned out both hybridizations of the few replicated samples were of acceptable quality and were both included in the dataset. Application of standard methods for the estimation of Gaussian graphical models cannot accommodate replicates. To apply such methods would require us to choose one replicate and ignore the other. This felt undesirable and suboptimal. We thus proceeded to modify the aforementioned existing methodology to accommodate replicates. In addition, we seized the opportunity to exploit the inclusion of replicates in the learning of Gaussian graphical models. Here we present the results of this endeavour. In this work, we investigate how ignoring other sources of variation, such as measurement error, affects the estimated Gaussian graphical model. Hereto we consider studies with a design that is partially replicated. The replicates enable the separation of the sampling variation from that of other causes. Data from such a study are described by a Gaussian graphical model endowed with a “signal+noise” structure. We present its maximum likelihood estimation, in particular high‐dimensionally, including the choice of the penalty parameter. In an extensive simulation study, we then investigate the effect of taking into account other sources of variation on the estimation of the signal precision and the related conditional (in)dependence graph. The paper closes with a re‐analysis of several Cancer Genome Atlas studies that repeatedly characterized a subset of the included samples transcriptomically by different platforms. The re‐analysis demonstrates the effect of ignoring variation due to technical and experimental differences between platforms on the reproducibility of reconstructed molecular networks.

Related work

Wainwright considers Gaussian graphical model estimation from corrupted data, which—in light of the Shewhart's quote above—could better be called realistic data. A corrupted observation is formed by the sum of a signal and a noise random variable. Both variables are drawn independently from two different multivariate Gaussian distributions. Wainwright discusses the estimation of the signal's precision matrix, in which knowledge on the noise's precision matrix is assumed to be available from other means than the data at hand (ie, effectively known). This knowledge is then used to correct the sample covariance matrix, from which—through a corrected graphical lasso procedure—the signal's precision matrix is estimated. Hence, replicates are not considered as a means to unravel signal and noise. The MAQC/SEQC (MicroArray/SEquencing Quality Control) initiatives have used replicates to study the reproducibility of findings reported by studies involving molecular high‐throughput techniques., , Of particular interest here, as that issue is revisited in Section 5, is the study of Zhang et al. In that study, micro‐array and RNA‐seq platforms are compared with respect to transcriptomic characterization of a certain cancer and clinical endpoint prediction, but not gene‐gene interaction network construction. The reproducibility of reconstructed networks has been studied previously., , Langfelder et al quantify the evolutionary preservation of networks by comparing networks reconstructed from data of mice and man. Bellot et al carried out a benchmark study of network reconstruction methods comparing their reproducibility between two subsamples of the same experiment. Finally, Vinciotti et al assessed the reproducibility of networks reconstructed from micro‐array and RNA‐seq platforms, concluding it is poor at the individual edge level but better at an aggregate one. While only the latter study of Vinciotti et al considers a study with replicates, even there the dependency among replicates is not addressed explicitly. Hence, even Vinciotti et al do not separate signal from (technical) noise.

EXPERIMENT, DATA, AND MODEL

Consider an unstructured observational study with certain samples interrogated molecularly multiple times. Let be a p‐dimensional random variable representing the data resulting from the k th replicate, with k = 1, … , K , of this measurement on sample i = 1, … , n. We model the data from the described study by an additive model: . In this model Z can be thought of as the signal present in sample i, while represents the noise in the k th replicate of sample i. We assume the signal and error both to follow a multivariate normal distribution but with different covariance matrices (as specified by the inverse precision matrices): and , respectively. Additionally, we take the signals and errors to be independent, in the sense that , for i 1 ≠ i 2, and for k 1 ≠ k 2. Thus, with the following marginal and conditional (in)dependence relations: for i 1 ≠ i 2, for k 1 ≠ k 2, for k 1 ≠ k 2, and for k 1 = k 2. The above simple “signal+noise” model enables us to illustrate the effect of only taking sampling variation into account when estimating a Gaussian graphical model. In the presence of other sources of variation, as captured by the parameter , one ought to infer the conditional independencies from . Common practice, however, bases this inference on the inverse of . This leads to false positively and false negatively inferred edges. To see this consider a numerical example where and Then, but . Here ignorance of all sources but sampling variation induces a false‐positive edge. A numeric example for the opposite, a false‐negative edge, is also easily constructed. The difference in the conditional independence graphs inferred from and can be quantified more generally. Hereto use the result of Miller on the inverse of a sum of two matrices to write in terms of : Hence, nonzero off‐diagonal elements of reveal differences in the strength of the edges of the conditional independence graphs inferred from observation and signal precision matrices. To write down the likelihood of the data under the specified model, the following lemma, which is a generalization of the result presented in the appendix A of Riebler et al, is needed. It specifies the elements of the inverse of the joint covariance matrix of the vector of replicates of a sample. LetYbe a multivariate normal random variable partitioned in K equally sized blocks aswith covariance matrix, with, that is, both p × p dimensional, symmetric and positive definite matrices. Then, its covariance matrix has determinant while the blocks of its inverse equal: for k, k′ = 1, … , K and k ≠ k′. Moreover, the blocks of satisfy: and . The determinant identity follows from the factorization: the determinant of a Kronecker product (cf, section 16.3.e of Harville), the specifics of the eigenvalues of 1 , and the use of well‐known results from standard linear algebra on eigen‐decompositions and determinants. Furthermore, the inverse is verified by use of straightforward linear algebra. Finally, the relations for the blocks of the inverse are immediate from these analytic expressions. The loglikelihood of the data can now, when invoking Lemma 1 and some algebraic manipulations, be formulated as: where and .

ESTIMATION

We estimate the parameters and by means of likelihood maximization. Its maximizer is found by means of the EM algorithm, an iterative procedure that alternates between the so‐called E‐ and M‐steps. The procedure starts from initial parameter estimates. In the E‐step, or Expectation step, sufficient statistics for the estimation of the parameters are obtained. In the M‐step, or Maximization step, the parameter estimates are updated by means of (complete) likelihood maximization, given the data and the acquired sufficient statistics. The E‐step produces sufficient statistics for the distribution of the unobserved Z . As the Z follow a multivariate normal distribution, the sufficient statistics are the sample versions of its first two moments. But as the Z are unobserved, these are replaced by the expectation of these two sample moments conditional on the data using the current parameter estimates. These expectations are found from the joint distribution of . This distribution is a zero‐centered multivariate normal distribution with covariance matrix: The inverse on the right‐hand side follows from the analytic expression of the inverse of a 2 × 2 block matrix (theorem 8.5.11 of Harville) in combination with Lemma 1 and the fact that its determinant equals , which is immediate from theorem 13.3.8 of Harville. Then, using theorem 2.5.1 of Anderson that provides an analytic expression of the conditional distribution of a subset of variates given the others, the aforementioned conditional expectations of the sufficient statistics are: In the display above the right‐hand side of the first conditional moment is obtained by means of the result presented in Lemma 1 and the second moment follows then from the Inverse Variance lemma (proposition 5.7.3 of Whittaker). These moments need to be evaluated for each sample i, which involves the inverse of p × p‐dimensional matrices that depends on K . Computationally, it is then most efficient to evaluate these moments of samples with an identical number of replicates as a group such that redundant inversions are avoided. Finally, for use in the M‐step these sufficient statistics are evaluated by plugging in the current estimates of the precision matrices. The M‐step finds updates of the parameter estimates, given the estimates of the Z obtained in the E‐step, through maximization of the so‐called complete likelihood, which is the joint likelihood of . The latter equals: Take the logarithm and obtain the complete log‐likelihood: where For the expectation of the complete log‐likelihood simply replace and by their expectations with respect to the , which are: respectively. Maximization of the expected log‐likelihood can now be done with respect to the two parameters separately. This yields: and . In this, the and Var(Z ), as obtained in the E‐step, are used for the evaluation of the updated estimates. The EM algorithm applies the E‐ and M‐step iteratively until convergence. Convergence is reached when the log‐likelihood does no longer improve much between subsequent iterations. Following Zhu and Melnykov, we operationalize this as the absolute relative change in the complete log‐likelihood. Convergence of the algorithm is then warranted by Jensens' inequality, which implies (after some algebra) that an improvement in the complete log‐likelihood implies one in the log‐likelihood. Omics studies, from which the gene‐gene interaction network is reconstructed, are often undersampled. The resulting high‐dimensional situation requires a modification of the loss criterion. A penalty augments the log‐likelihood to ensure the existence of a unique and well‐defined estimator. Here the ridge penalty, the sum of the square of the elements of the precision matrices each with their own penalty parameter, that is, with ‖ · ‖ the Frobenius norm, is used. The maximizer of the thus penalized log‐likelihood is found by means of a penalized EM algorithm. This is derived as its unpenalized counterpart. Effectively, the penalization leaves the E‐step unaffected and leads to a minor modification of the M‐step. In the latter now the expectation of the complete log‐likelihood (1) augmented with the ridge penalty is maximized with respect to the parameters. This can done per parameter separately and yields (cf, van Wieringen and Peeters), for example: where . For that of replace and by their contourparts and . In the above the ridge penalty may be replaced by the graphical lasso penalty: . Estimates of and are then found by a row/column updating scheme.

Diagonal

We consider the simplification of the model for estimation of signal and error precision matrices in high‐dimensions. While the inclusion of replicates in the design enables the separation of sampling variation from that of other sources, it brings about the estimation of additional parameters (compared to the estimation of the from an unreplicated design). In addition, an extra penalty parameter needs to be chosen. The recovery of conditional independencies is already a challenging task (especially from high‐dimensional studies), but it is further hampered by penalization. Penalization tends to shrink the precision's off‐diagonal elements more than its diagonal ones and thereby obstructs the deconvolution of the contributions of signal and error to the conditional (in)dependencies. The simplification of the model may be achieved by the adoption of assumptions on the structure of precision matrices. This is undesirable for the signal precision matrix as it is the (conditional) relations within the signal that are of primary interest. However, it may be acceptable to make such an assumption for as interest is not in the dependencies among the elements of . Their independence may therefore be a reasonable simplification (which is investigated in Sections 4 and 5.2). This independence assumption corresponds to a diagonal which involves only p parameters. Incorporation of the diagonality assumption of into the estimation requires only a minor modification to the penalized EM algorithm. In the M‐step the complete likelihood (1) is now maximized with respect to by for j = 1, … , p, leaving the estimate of unaffected. In particular, the need for penalization of has vanished as the resulting is well‐defined by the independence assumption and the positivity of the estimates of its diagonal elements. The gain in computation time bought by this diagonal assumption is investigated in the SM If of Appendix S1. We illustrate the effect of the diagonal error assumption on the reconstruction of the conditional independence graph. For simplicity, we assume here K = K for all i. We then study the limiting behavior, in either n or K, of the M‐step's . Write this inverse of this estimator as: where the analytic expressions for the and Var(Z ) have been substituted. For a limiting sample size, note that, by the law of large numbers: . Substitution of this limit into the preceding display yields, after some linear algebraic manipulations, , in which no assumption on the error precision matrix has been made. Should we erroneously have assumed a diagonal error precision matrix, denoted and temporarily known, the limit becomes: The second summand characterizes the effect of the diagonal error precision matrix assumption, which vanishes when diagonality is justified. On the other hand, with a fixed sample size n but a large number of replicates K the assumption becomes irrelevant. Put differently, in the M‐step of the algorithm as and . Intuitively, this is evident when Z are estimated by the average of the Y , … , Y . For large K the error thus averages out. Consequently, the diagonal error assumption does not affect the estimate of Z , or the associated . For small n and K, simulations revealed (not shown) that the model with a full error precision matrix performs (slightly) better in edge recovery than that with a diagonal one. The fit of the former is generally better than the latter, unless—of course—the error precision matrix is indeed diagonal.

Penalty selection

We choose the penalty parameters and for the signal and error precision matrices by means of F‐fold cross‐validation (with such that 2 ≤ F ≤ n). This procedure evaluates—for given and —the performance (in some sense) of estimated precision matrices on novel data. We consider the ‐combination that yields the best performance on these data to be optimal. We use this optimal penalty parameter combination to arrive at the final estimates of the two precision matrices. Without novel data unavailable for performance evaluation, they are mimicked by sample splitting. This splits the data into F equally sized groups (henceforth called splits). The splits are left‐out one at the time to represent the “novel” data. Data from the remaining splits are used to obtain the precision matrices' estimates, while their performance is assessed on the “novel” data from the left‐out split. Each split plays the role of “novel' data once, which results in F performance estimates. We take the average of the F performances to be indicative of the performance of the precision matrix estimators for the employed ‐combination. The most commonly used performance measure for the selection of the penalty parameters of penalized precision matrix estimators is the cross‐validation log‐likelihood. That is, the averaged (over the splits) log‐likelihood of data from the left‐out split given the estimates derived from the data of all‐but‐the‐left‐out splits. We use this criterion here too. For practical purposes, the log‐likelihood needs to be evaluated computationally efficiently, as for the cross‐validated log‐likelihood requires the calculation of the log‐likelihood F times for each ‐combination. To achieve this efficiency, let be the eigen‐decomposition of with the p × p‐dimensional matrices V and diagonal D that contain the eigenvectors and ‐values as columns and on its diagonal, respectively. We then write the log‐likelihood as: with S and S as defined at the end of Section 2. Clearly, this avoids the formation and inversion of the matrix for each different number of replicates. A study design with replicates allows for various ways of construction the F cross‐validation splits. Three strategies may be conceived: Replicate‐based splitting: Form a dimensional matrix with each row containing the data from a replicate. Then, divide rows randomly over the F splits. Sample‐based splitting: Divide the samples randomly over the F splits. A split's data are then formed by the replicated data of the samples that have been assigned to the split. Stratified splitting: Stratify for the number of replicates while randomly assigning the same number of samples to each split. This ensures that the distribution of the number of replicates in each split is representative of the prevalence of the K's encountered in the study. Taken at face value the above splitting strategies may all seem valid. However, the first two may yield splits that are neither representative nor balanced. In particular, the first strategy may, when F is large or the K are small, yield splits that are unlikely to contain replicated observations of the same sample. In practice, K is usually small, rendering replicate‐based sampling a poor choice. The sample‐based strategy may, when the number of replicates is unbalanced among samples, occasionally produce splits that accidently comprise much more data than others. The resulting cross‐validated performance need then not be representative. Hence, generally, the third strategy is the safest option, which is employed in the remainder. However, if the number of replicates is common to all samples, stratified and sample‐based splitting are equivalent. In a simulation study we compared the consequence of the sample‐based and stratified splitting for the reconstruction of the conditional independence graph, as well as the fold size. The results are presented in SM If of Appendix S1. In this study a sample's number of replicates equals either one or four, randomly chosen in a two‐to‐one ratio. Little to no difference is observed in the reconstruction performance. This suggests that generally both splitting strategies are viable. The optimal ‐combination, that is, the combination of penalty parameter values that yields the precision matrix estimates with the best cross‐validated performance, can—in principle—be found by a simple exhaustive grid search. Here we use the quasi‐Newton approach of Byrd et al available through the optim‐function of R. Alternatively, a tailor‐made gradient ascent or descent approach may be developed as outlined in the work of Feng and Simon. But—in light of the limited number of penalty parameters to be optimized over—the latter is not expected to give a substantial computational gain in comparison to the employed quasi‐Newton approach and is therefore not pursued.

SIMULATION

We study the quantification of the signal from replicated data through simulation. In the simulation, the support of employed signal precision matrices corresponds to archetypical topologies as a chain, block, and scale‐free network. The error precision matrices are either diagonal or have a common nonzero off‐diagonal conditional covariance. Furthermore, the sample size n ∈ {10, 25, 50, 75, 100}, the dimension p ∈ {10, 25, 50}, and the number of replicates K ∈ {1, 2, 3, 4}. Each setup is repeated a hundred times. Full simulation setup details are given in the SM Ia of Appendix S1. The aforementioned quantification comprises the performance of the signal precision matrix estimator by the Frobenius loss as well as the ability to reconstruct the signal conditional independence graph through the pAUC and AUC (partial Area Under the Curve). The latter two statistics are calculated using an edge selection procedure based on the absolute value of the partial correlations obtained from estimated signal precision matrix. We place a threshold on these absolute values, and select the edges with values exceeding the threshold. The selected edges are compared to the true graph to obtain the specificity and sensitivity. The threshold is varied over the unit interval. From the resulting (specificity, sensitivity)‐pairs, we calculate the AUC and pAUC. Using the aforementioned performance measures, we first study the effect of (the number of) replicates for various sample sizes and dimensions but also parameter choices. Results are presented as Figure 1 and those in the SM Ib of Appendix S1, which—for reasons of space and brevity—are limited to one representative combination of signal and error precision matrix. The results indicate that the performance of the estimator improves in all senses specified above. This performance gain is largest from K = 1 to K = 2 and levels off for larger number of replicates. However, instead of characterizing each sample in duplicate, it is generally more rewarding to double the sample size as that appears to yield a larger improvement in performance. New samples are of course easily acquired in a simulation study but this need not necessarily be a trivial exercise in a clinical context. Finally, it should be kept in mind that these conclusion are confined to the particulars of the parameter choices. For instance, simulations (not shown) with a smaller signal‐to‐noise ratio reveal the levelling off is observed at larger K .

FIGURE 1

Various simulation results w.r.t. edge recovery for a banded signal precision matrix and a uniform error precision matrix . All plots show the partial AUC, integrated w.r.t. 1 − specificity from 0 to 0.1, of 100 simulation runs. In the top panel p = 50 and the pAUCs are plotted against various (n, K )‐combinations. The left bottom panel plots, for p = 10, 25, 50, the averaged pAUC vs the number of replicated samples with and all K ∈ {1, 2}. The right bottom panel, in which (n, p, K) = (50, 50, 2), shows boxplots of pAUCs of five methods. Legend for the labels at its tick marks: ‘, full ’: Ridge penalized EM algorithm without the diagonal error precision matrix assumption; “, diag ”: Ridge penalized EM algorithm with the diagonal error precision matrix assumption; “, Y average”: Ridge penalized estimation of from replicate‐wise averaged data (ie, ; “L 1, diag ”: Lasso penalized EM algorithm with the diagonal error precision matrix assumption; “L 1, Y average”: Lasso penalized estimation of from replicate‐wise averaged data (ie, [Colour figure can be viewed at wileyonlinelibrary.com] A tangible implication of two replicates (K = 2) over that of a single (K = 1) observation per individuals is an improvement of the estimates. The elements of the estimated precision matrix are, on average over all employed settings and topologies, 0.03 closer to their true value. Similarly, the off‐diagonal elements of the corresponding partial correlation matrices are, again on the same average, 0.02 closer to their true value. This improvement is largest for the larger sample sizes, the smaller dimensions, and the larger elements of . It can then go up 0.1 for off‐diagonal elements, and even over 0.2 for diagonal elements (of the precision matrix). Another takeaway of this simulation can be deduced from the scale of the y‐axis of Figure 1 and its companions in the Appendix S1. On the basis of pure chance, one would expect the pAUCs to be around 0.005. Simultaneously, the maximum achievable pAUC is 0.1. While the results clearly exceed the chance benchmark, they are not close to their maximum. This demonstrates the notorious difficulty of the network reconstruction problem for moderate dimensions (p = 50). The difficulty is readily grasped when realizing that, for a p variate, one needs to estimate parameters (counting on those related to the signal) from a small number of samples. However, the achieved pAUCs are also due to the chosen simulation settings. Other settings, for example, less noise or stronger effect sizes, would have yielded a better pAUC. But, for instance, we based the sample size range on practice, where our in‐house studies rarely exceed a hundred samples and often involve fewer. Nonetheless, the pAUC plots—as the simulation intends—clearly show the effect of the inclusion of replicates, in particular in relation to the sample size and dimension. Finally, the reported pAUCs serve as a warning that, for a small sample size and few duplicates, results are in urgent need of validation. Secondly, we assess whether the full design needs replication, or that it is best to replicate only part of them. Hereto we adopt the settings of the previous simulation, with the following modification. We set the total number of measurement , with K ∈ {1, 2} for all i, equal to hundred. Under this restriction, we vary the number of samples with a single observation and two replicates. With this study design, the above simulation is repeated. The bottom left panel of Figure 1 shows the achieved pAUC against the number of replicated samples. At first, there is a clear gain with each additional replicated sample, although not so obvious for the p = 50 case. This gain, however, levels off after a certain number—the precise number depends among others on the dimension and signal‐to‐noise level—of replicated samples. It even goes down when further samples are replicated. This indicates that at some point it is more worthy to include biological—should they be available—rather than technical replicates. Especially, when p = 50 the gain of replication is limited and more biological samples are to be preferred. Thirdly, we compare in simulation the proposed method to some obvious competitors: (i) the lasso penalized EM algorithm with a diagonal error precision matrix, (ii) the ridge precision estimator with replicate‐wise averaged data, and (iii) the graphical lasso precision estimator with replicate‐wise average data. The simulation setup is as above but with K = 2 throughout and n ∈ {10, 25, 50}. Results are presented as boxplots in the figures of SM Ie of Appendix S1, again limited to a representative combination of signal and error precision matrix. The main takeaways are two‐fold. Firstly, the ridge penalized methods generally outperform their lasso counterparts, in particular for the larger p, in terms of network reconstruction. Secondly, network reconstruction by means of the ridge and lasso precision estimators from averaged data works reasonably well (the latter only with large n and small p). That is, the support of the estimated network based on averaged data do not substantially differ with respect to the AUC‐type measures. However, the values of the estimated signal precision matrices on the basis of averaged data can differ substantially in terms of the Frobenius norm. A more detailed conclusion is given in SM Ie of Appendix S1. Originally, we included a lasso penalized EM algorithm with a penalty on both precision matrices. The resulting algorithm's convergence was slow, while the search for optimal cross‐validated penalty parameters was prohibitively slow. Moreover, the results of the other lasso penalized methods, that is, (i) and (iii), suggest its performance will not exceed that of their ridge counterparts.

ILLUSTRATION

We present an illustration of the use of the presented methodology through a re‐analysis of several oncogenomics studies with replicated observations. The aim of this re‐analysis is multifold: (i) to clarify the consequences of the conditional independence graph reconstruction from an error‐diluted signal, (ii) to assess the tenability of the independence assumption among the errors as implied by a diagonal for the current purpose, and (iii) to elucidate the differences between conditional independence graphs reconstructed from replicated and nonreplicated data. The data stem from three TCGA (The Cancer Genome Atlas) studies, , into the molecular characterization of the cancer of three tissue types, breast (n = 526), lung (n = 151), and ovary (n = 294). Each study interrogated a sample's transcriptome twice (ie, K = 2 for all i), by both gene expression arrays and RNA sequencing. These data have been downloaded using the TCGA2STAT‐package. Subsequently, each dataset has been subsetted into ten smaller ones, each formed by restricting the original dataset to a subset of the genes. The preserved genes in each subsetted dataset map to one of ten signaling pathways that are believed to be involved in cancer. The definitions of these pathways are taken from KEGG and available in R through the KEGG.db‐package as a set of so‐called Entrez‐identifiers. These identifiers are matched to those of the genes present in the datasets. The latter step required conversion of the gene names to their Entrez‐identifiers for which we have used the biomaRt‐package. The pathways' names and their dataset‐specific dimension (ranging from p = 29 to p = 247) and sample size are tabulated in SM IIa of Appendix S1. Finally, to meet the distributional assumptions of the presented model the data have been Gaussianized variate‐wise, an operation that preserves the conditional independencies among variates. Other assumptions are checked visually (see SM IIc of Appendix S1), and found to be unproblematic. We analyze the data in the following ways. We fit the presented model, with and without the diagonal assumption on the error precision matrix, with the penalty parameter(s) chosen through stratified 10‐fold cross‐validation. Additionally, we learn the platform‐specific, that is, array and sequencing, precision matrices from the data using the ridge precision matrix estimator that uses a penalty parameter found through 10‐fold cross‐validation. In the remainder of this section we scrutinize the resulting precision matrices to meet the aims formulated at the beginning of this illustration.

The effect of the error

The deconvolution of signal and error by fitting the “signal+error” model facilitates the study of the consequence of the error on the learning of the conditional independence graph. This study comprises (i) the quantification of the contribution of signal and error to the observation, (ii) the comparison of partial correlations derived from the signal and error‐diluted observation precision matrices, and (iii) the therefrom inferred conditional independence graphs. The fitted model enables us to investigate whether the observations are dominated by either the signal or the error. Hereto we employ the mutual information, a generalized correlation measure, that measures the dilution of the signal by the addition of the error (or vice versa). Concentrating on the former, the mutual information between Y and Z is , where, for example, is the (differential) entropy of Y . Large values of indicate that Z contains a lot information on Y , whereas means the random variables are independent. Here, in the multivariate normal case, by theorem 9.4.1 of Clover and Thomas, . Similarly, . For each (dataset, pathway)‐combination we evaluate these mutual informations by plug‐in estimates of the precision matrices. These results are tabulated in SM IIe of Appendix S1. These tables show that, structurally over all (dataset, pathway)‐combinations, the are substantially larger than . From this we conclude that the observations are dominated by the signal and not the error. This conclusion is corroborated by exploratory analyses presented in SM IIe of Appendix S1. Consequently, when replicates are not available, the inference of the signal‐related conditional independence graph directly from the estimated observation‐related precision matrix is not completely in vain. We compare the distributions of the partial correlations, the basis of the inference of the conditional independence graph, derived from the estimated signal and observation precision matrices, and . Hereto we generate (i) qq‐plots (see SM IIf of Appendix S1) and (ii) the densities (not shown) of the differences between corresponding partial correlations. The qq‐plots suggest that both partial correlation distributions are reasonably similar, but with differences appearing mainly in the tails. The densities of the partial correlations differences confirm this, as most mass is concentrated around and close to zero. However, the different tail behavior implies that the error indeed obscures true edges as well as introduces spurious ones in the inferred conditional independence graph. We now quantify the effect of the error on the inferred conditional independence graph as follows. This graph is inferred from both partial correlation matrices, that is, the ones derived from the signal and observation precision matrix estimates and . The graph is formed by simply taking the top r, r = 1, … , 250, largest (in an absolute sense) unique partial correlations from both matrices. The percentage of overlapping edges among the selected edges between the two graphs is plotted against the number of selected edges (see Figure 2), again for each (dataset, pathway)‐combination. Expectedly, this percentage is unstable for small r, but settles for larger ones. On average, over data sets and pathways, it settles around approximately 70%. Would we translate this to the inference of molecular networks through the learning of conditional independence graphs from a single platform, it suggests that little over one in four absent/present edges reported in the literature is either a false positive or false negative.

FIGURE 2

Left panel: the percentage of overlapping edges (y‐axis) between the conditional independence graphs formed by selecting the top r (x‐axis) strongest (in an absolute sense) partial correlations from the standardized signal precision matrix and the “observation” precision matrix . Each line represents a different pathway and connects the percentages of overlapping edges found for a top of varying sizes r, r = 1, … , 250. Right panel: boxplots of partial correlations of randomly selected edges evaluated from a fixed signal Z diluted with varying errors . For reference the partial correlations from the undiluted signals are added as blue diamonds [Colour figure can be viewed at wileyonlinelibrary.com] Finally, we illustrate the diluting effect of the error on the estimation partial correlations. Hereto the signals Z for i = 1, … , n are estimated by with data and estimates stemming from the apoptosis pathway of the TCGA lung study. We then simulate observed data by with drawn from . The samples are thus unreplicated. The signal and ‘observed’ partial correlations are then obtained from the standardized inverse of their sample covariance matrices and , respectively. To capture the spread in the latter it is evaluated for a hundred error draws. Boxplots of the resulting hundred “observation” partial correlations corresponding to thirty randomly selected edges are displayed in Figure 2. Their “signal” partial correlations are plotted on top of them as blue diamonds. This reveals that indeed some partial correlations are clearly weakened by dilution of the signal by the error. Simultaneously, others are strengthened possibly introducing spuriously inferred edges. The partial correlations estimated from an unreplicated error diluted signal can thus differ substantially from those obtained from the signal itself. This bears consequences on the reconstruction of the network. To illustrate one of the implications we infer the network from the top 100 strongest (in an absolute sense) partial corrections derived from the standardized . This yields a network of 13 unconnected nodes and one large connected component involving 66 nodes. The large connected component is the topological feature of interest. We assess whether it persists when the network is reconstructed from an error diluted signal. Such a signal is created as above, from which the corresponding partial correlation matrix is estimated, and in turn a network reconstructed by selection of its top 100 strongest edges. This exercise is repeated a hundred times. The hundred networks derived from the error diluted signal all exhibit a large connected component. In over 85% of these networks the size of this component involves 50 or more nodes. Hence, without replicated samples the prominent network feature is generally preserved, but it is also partially obscured due to error dilution.

The diagonal assumption

The assumption of a diagonal discussed in Section 3.1 is evaluated. Previously, we proposed the assumption for computational reasons, in particular when the penalty parameter is chosen via cross‐validation. Here we study its effect on the reconstruction of the conditional independence graph in real‐data. For starters we compare the models with a full and diagonal by means of the Aikake's Information Criterion (AIC). The AIC balances the model's fit with its parsimony. For the model with a full the AIC is: that is, twice the number of model parameter minus twice the log‐likelihood evaluated at the estimated model parameters under the full assumption. For the model with a diagonal the first summand on the right‐hand side of the preceding display is replaced by p(p + 1) + 2p and the corresponding estimators are used in the log‐likelihood. These estimated AICs are reported in SM IId of Appendix S1. They reveal that the AICs of the model with a full are better (ie, smaller) than the model with a diagonal . Hence, the improvement of the description of the data by the more elaborated model over the simpler one outweighs the use of additional parameters by the former. This suggests the full model is to be preferred when used for the reconstruction of the conditional independence graph. The model with a diagonal may not be preferred on the basis of the AIC, it could still be a good basis for the reconstruction of the conditional independence graph. As in Section 5.1 qq‐plots of the partial correlations, derived from the estimate under both error assumptions, are drawn for every (data set, pathway)‐combination (see SM IIf of Appendix S1). These plots reveal little difference in distribution of these partial correlations. Additionally, the densities of the differences of corresponding partial correlations of both models are plotted (not shown). Generally, these densities are tightly concentrated around zero, suggesting the estimate of under the assumption of a diagonal may still be a good basis for the reconstruction of the conditional independence graph. This is quantified, as in Section 5.1, by the percentage of overlapping edges between the conditional independence graphs reconstructed from both estimates, and selecting only the top r, r = 1, … , 250, largest (in an absolute sense) partial correlations. For each (dataset, pathway)‐combination we plot these percentages against the number of selected edges r. In all cases the percentage of overlapping edges between the two reconstructed networks exceeds the 85% and is on average around 90%. Hence, for initial screening purposes a simpler model may suffice, but the computational efficiency gain comes at a cost.

The platform differences

In the spirit of the MAQC we compare the reconstruction of the CIGs from individual—but also joint—platform data (all plots are deferred to SM IIf of Appendix S1). In the remainder we refer to these graphs as the “microarray CIG,” the “RNA‐seq CIG” and the “joint CIG.” The percentage of overlap among the top r edges of the micro‐array and RNA‐seq CIGs varies roughly between 35% and 55% (cf, the SM IIf of Appendix S1) over pathways and datasets. This suggests that roughly only between a third and a half of the edges reported in the literature will reproduce in subsequent studies when using a different platform. In our comparison of the individual platforms' CIG to the joint one, we assume that (a) the variation in the data only comprises sampling variation and that due to the use of the two different platforms and (b) these variation components can be estimated adequately and without (!) too much error by the proposed penalized EM algorithm in combination with a cross‐validated penalty parameter. The percentage of overlap in the top r edges of the joint and either the micro‐array or RNA‐seq CIGs fluctuates around approximately 60% and 65%, respectively, over pathways and datasets. The outlying percentages for the MAPK pathway in the ovarian data are due to extremely large cross‐validated penalties in both platform specific precision matrix estimates. The overlap of the “joint CIG” with the RNA‐seq one is systematically a little larger with that of the micro‐array platform. Irrespectively of this minor difference, these percentages suggest that—although unknown which—65% of the gene‐gene interactions reported in the literature are correctly identified, should the aforementioned assumptions be tenable. This 65% is slightly smaller but otherwise in line with the approximately 70% found in Section 5.1, when investigating the effect of the error. The former percentage can be dissected into the overlap percentages among the top r edges between: the joint CIG and the intersection of the microarray and RNA‐seq CIGs. This overlap percentage ranges from 35% to 55%, depending on pathway and data set. In particular, the plots also indicate that, if an edge is in the overlap of the platform specific CIGs, it is most likely to be in the joint CIG. the joint and microarray CIGs that are not present in the RNA‐seq CIG. This ranges more or less from 15% to 20%, while vice versa the joint and RNA‐seq CIGs that are not present in the micro‐array CIG fluctuate between 20% and 25%. The latter's larger overlap is in line with the overall larger overlap between these two CIGs. Irrespectively, this indicates that indeed there are platform specific edges. the joint CIG that are not present in either the microarray or RNA‐seq CIGs. It ranges from 5% to 15% (and sometimes as high as 25%) over the pathways and datasets. This reveals the amount of obscured edges by use of a particular platform. These percentages should be related to the probability of an edge common to two independently reconstructed networks with the same number of nodes p and an equal number of selected edges r. For p = 50 and r = 250 is approximately 4.16% (and lower for smaller r or larger p). Hence, as the observed percentage easily exceeds this reference percentage of 4.16%, there is definitely shared information between the platform‐specific CIGs. Although not perfect, it represents the cohesion of the pathways' gene expression data. Finally, we draw up a more specific inventory of the overlap between the joint, microarry and RNA‐seq CIGs. Hereto we identify, using the TCGA lung cancer data, for all pathways the 100 strongest edges from the corresponding partial correlation matrices. For each pathway we evaluate the overlap between the all combinations of the resulting CIGs, shown in Figure 3. Would all three CIGS be identical, a bar is solid brown and reaches up to 100 on the x‐axis. Similarly, without overlap among the CIGs, a bar comprises of three equally sized blocks, coloured red, yellow and blue, while reaching up to 300. In Figure 3 the bars reach—on average—to approximately 175, which amounts to a reasonable amount of overlap. Unsurprisingly, the joint CIGs share most with both other CIGs, individually and with their intersection. However, there are also approximately 15 edges present only in the joint CIG, which—if correct—are missed without replication. On a similar note, a much larger number of edges is specific to either the RNA‐seq or the micro‐array CIGs. Hence, using a single platform without replication, one clearly identifies a substantial amount of edges that are unlikely to reproduce.

FIGURE 3

On the right, a horizontal bar plot of—per pathway—the number of overlapping edges between the top 100 strongest edges of the CIGs reconstructed from the RNA‐seq, micro‐array and the joint data. The left panel represents the accompanying color legend via a venn diagram [Colour figure can be viewed at wileyonlinelibrary.com]

CONCLUSION

Assuming a simple “signal+noise” model we showed in Section 2 the possible consequences of ignoring variation due to other sources than sampling for the reconstruction of the cohesion among the variates of a Gaussian random variable: conditional dependencies may be obscured and spurious ones introduced. We pointed out that this may be overcome when observations have been replicated as different sources of variation can be separated. We presented methodology for the estimation of the parameters associated with these sources and that harbor the sought conditional (in)dependencies. Simulations showed that most is gained from duplication but that, for example, triplicated observations add little. It also revealed that, when pragmatically using replicate‐wise averaged data instead of the more complicated proposed “signal+noise” model‐based approach, the support of the signal precision matrix can be reconstructed quite well but the corresponding estimated values of the precision matrix can be inaccurate. Finally, through an extensive re‐analysis of data from oncogenomics studies with replicated observations the effect of omission of replicates but also the gain of their inclusion has been tangibly illustrated. In particular, it provides insight in the reproducibility of published gene‐gene interaction networks, which indicates that care is to be taken with the validity of these networks. A further note of caution is needed. Sofar false‐positive and ‐negative edges of the reconstructed conditional independence graph have only been attributed to the presence of the variation introduced by the use of different platforms. On one hand, this ignores the uncertainty in the estimation due to the use of a sample of finite size that introduces falsely inferred absent and present edges. On the other, the focus is—due to the design of the employed TCGA studies—on the error quantifiable from technical replicates. This ignores the fact that mRNA levels may vary considerably over the day. This biological within‐sample variation cannot be quantified from the used TCGA studies. That would require studies with a longitudinal setup in which samples are characterized at multiple instances. And for its analysis different statistical methodology is needed. Both are the subject of follow‐up research. Appendix S1. Supporting information. Click here for additional data file.

14 in total

1. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements.

Authors: Leming Shi; Laura H Reid; Wendell D Jones; Richard Shippy; Janet A Warrington; Shawn C Baker; Patrick J Collins; Francoise de Longueville; Ernest S Kawasaki; Kathleen Y Lee; Yuling Luo; Yongming Andrew Sun; James C Willey; Robert A Setterquist; Gavin M Fischer; Weida Tong; Yvonne P Dragan; David J Dix; Felix W Frueh; Frederico M Goodsaid; Damir Herman; Roderick V Jensen; Charles D Johnson; Edward K Lobenhofer; Raj K Puri; Uwe Schrf; Jean Thierry-Mieg; Charles Wang; Mike Wilson; Paul K Wolber; Lu Zhang; Shashi Amur; Wenjun Bao; Catalin C Barbacioru; Anne Bergstrom Lucas; Vincent Bertholet; Cecilie Boysen; Bud Bromley; Donna Brown; Alan Brunner; Roger Canales; Xiaoxi Megan Cao; Thomas A Cebula; James J Chen; Jing Cheng; Tzu-Ming Chu; Eugene Chudin; John Corson; J Christopher Corton; Lisa J Croner; Christopher Davies; Timothy S Davison; Glenda Delenstarr; Xutao Deng; David Dorris; Aron C Eklund; Xiao-hui Fan; Hong Fang; Stephanie Fulmer-Smentek; James C Fuscoe; Kathryn Gallagher; Weigong Ge; Lei Guo; Xu Guo; Janet Hager; Paul K Haje; Jing Han; Tao Han; Heather C Harbottle; Stephen C Harris; Eli Hatchwell; Craig A Hauser; Susan Hester; Huixiao Hong; Patrick Hurban; Scott A Jackson; Hanlee Ji; Charles R Knight; Winston P Kuo; J Eugene LeClerc; Shawn Levy; Quan-Zhen Li; Chunmei Liu; Ying Liu; Michael J Lombardi; Yunqing Ma; Scott R Magnuson; Botoul Maqsodi; Tim McDaniel; Nan Mei; Ola Myklebost; Baitang Ning; Natalia Novoradovskaya; Michael S Orr; Terry W Osborn; Adam Papallo; Tucker A Patterson; Roger G Perkins; Elizabeth H Peters; Ron Peterson; Kenneth L Philips; P Scott Pine; Lajos Pusztai; Feng Qian; Hongzu Ren; Mitch Rosen; Barry A Rosenzweig; Raymond R Samaha; Mark Schena; Gary P Schroth; Svetlana Shchegrova; Dave D Smith; Frank Staedtler; Zhenqiang Su; Hongmei Sun; Zoltan Szallasi; Zivana Tezak; Danielle Thierry-Mieg; Karol L Thompson; Irina Tikhonova; Yaron Turpaz; Beena Vallanat; Christophe Van; Stephen J Walker; Sue Jane Wang; Yonghong Wang; Russ Wolfinger; Alex Wong; Jie Wu; Chunlin Xiao; Qian Xie; Jun Xu; Wen Yang; Liang Zhang; Sheng Zhong; Yaping Zong; William Slikker
Journal: Nat Biotechnol Date: 2006-09 Impact factor: 54.908

2. Sparse inverse covariance estimation with the graphical lasso.

Authors: Jerome Friedman; Trevor Hastie; Robert Tibshirani
Journal: Biostatistics Date: 2007-12-12 Impact factor: 5.899

3. KEGG: Kyoto Encyclopedia of Genes and Genomes.

Authors: H Ogata; S Goto; K Sato; W Fujibuchi; H Bono; M Kanehisa
Journal: Nucleic Acids Res Date: 1999-01-01 Impact factor: 16.971

4. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models.

Authors: Leming Shi; Gregory Campbell; Wendell D Jones; Fabien Campagne; Zhining Wen; Stephen J Walker; Zhenqiang Su; Tzu-Ming Chu; Federico M Goodsaid; Lajos Pusztai; John D Shaughnessy; André Oberthuer; Russell S Thomas; Richard S Paules; Mark Fielden; Bart Barlogie; Weijie Chen; Pan Du; Matthias Fischer; Cesare Furlanello; Brandon D Gallas; Xijin Ge; Dalila B Megherbi; W Fraser Symmans; May D Wang; John Zhang; Hans Bitter; Benedikt Brors; Pierre R Bushel; Max Bylesjo; Minjun Chen; Jie Cheng; Jing Cheng; Jeff Chou; Timothy S Davison; Mauro Delorenzi; Youping Deng; Viswanath Devanarayan; David J Dix; Joaquin Dopazo; Kevin C Dorff; Fathi Elloumi; Jianqing Fan; Shicai Fan; Xiaohui Fan; Hong Fang; Nina Gonzaludo; Kenneth R Hess; Huixiao Hong; Jun Huan; Rafael A Irizarry; Richard Judson; Dilafruz Juraeva; Samir Lababidi; Christophe G Lambert; Li Li; Yanen Li; Zhen Li; Simon M Lin; Guozhen Liu; Edward K Lobenhofer; Jun Luo; Wen Luo; Matthew N McCall; Yuri Nikolsky; Gene A Pennello; Roger G Perkins; Reena Philip; Vlad Popovici; Nathan D Price; Feng Qian; Andreas Scherer; Tieliu Shi; Weiwei Shi; Jaeyun Sung; Danielle Thierry-Mieg; Jean Thierry-Mieg; Venkata Thodima; Johan Trygg; Lakshmi Vishnuvajjala; Sue Jane Wang; Jianping Wu; Yichao Wu; Qian Xie; Waleed A Yousef; Liang Zhang; Xuegong Zhang; Sheng Zhong; Yiming Zhou; Sheng Zhu; Dhivya Arasappan; Wenjun Bao; Anne Bergstrom Lucas; Frank Berthold; Richard J Brennan; Andreas Buness; Jennifer G Catalano; Chang Chang; Rong Chen; Yiyu Cheng; Jian Cui; Wendy Czika; Francesca Demichelis; Xutao Deng; Damir Dosymbekov; Roland Eils; Yang Feng; Jennifer Fostel; Stephanie Fulmer-Smentek; James C Fuscoe; Laurent Gatto; Weigong Ge; Darlene R Goldstein; Li Guo; Donald N Halbert; Jing Han; Stephen C Harris; Christos Hatzis; Damir Herman; Jianping Huang; Roderick V Jensen; Rui Jiang; Charles D Johnson; Giuseppe Jurman; Yvonne Kahlert; Sadik A Khuder; Matthias Kohl; Jianying Li; Li Li; Menglong Li; Quan-Zhen Li; Shao Li; Zhiguang Li; Jie Liu; Ying Liu; Zhichao Liu; Lu Meng; Manuel Madera; Francisco Martinez-Murillo; Ignacio Medina; Joseph Meehan; Kelci Miclaus; Richard A Moffitt; David Montaner; Piali Mukherjee; George J Mulligan; Padraic Neville; Tatiana Nikolskaya; Baitang Ning; Grier P Page; Joel Parker; R Mitchell Parry; Xuejun Peng; Ron L Peterson; John H Phan; Brian Quanz; Yi Ren; Samantha Riccadonna; Alan H Roter; Frank W Samuelson; Martin M Schumacher; Joseph D Shambaugh; Qiang Shi; Richard Shippy; Shengzhu Si; Aaron Smalter; Christos Sotiriou; Mat Soukup; Frank Staedtler; Guido Steiner; Todd H Stokes; Qinglan Sun; Pei-Yi Tan; Rong Tang; Zivana Tezak; Brett Thorn; Marina Tsyganova; Yaron Turpaz; Silvia C Vega; Roberto Visintainer; Juergen von Frese; Charles Wang; Eric Wang; Junwei Wang; Wei Wang; Frank Westermann; James C Willey; Matthew Woods; Shujian Wu; Nianqing Xiao; Joshua Xu; Lei Xu; Lun Yang; Xiao Zeng; Jialu Zhang; Li Zhang; Min Zhang; Chen Zhao; Raj K Puri; Uwe Scherf; Weida Tong; Russell D Wolfinger
Journal: Nat Biotechnol Date: 2010-07-30 Impact factor: 54.908

5. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.

Authors: Steffen Durinck; Paul T Spellman; Ewan Birney; Wolfgang Huber
Journal: Nat Protoc Date: 2009-07-23 Impact factor: 13.491

6. Is my network module preserved and reproducible?

Authors: Peter Langfelder; Rui Luo; Michael C Oldham; Steve Horvath
Journal: PLoS Comput Biol Date: 2011-01-20 Impact factor: 4.475

7. Integrated genomic analyses of ovarian carcinoma.

Authors:
Journal: Nature Date: 2011-06-29 Impact factor: 49.962

8. Comparison of RNA-seq and microarray-based models for clinical endpoint prediction.

Authors: Wenqian Zhang; Ying Yu; Falk Hertwig; Jean Thierry-Mieg; Wenwei Zhang; Danielle Thierry-Mieg; Jian Wang; Cesare Furlanello; Viswanath Devanarayan; Jie Cheng; Youping Deng; Barbara Hero; Huixiao Hong; Meiwen Jia; Li Li; Simon M Lin; Yuri Nikolsky; André Oberthuer; Tao Qing; Zhenqiang Su; Ruth Volland; Charles Wang; May D Wang; Junmei Ai; Davide Albanese; Shahab Asgharzadeh; Smadar Avigad; Wenjun Bao; Marina Bessarabova; Murray H Brilliant; Benedikt Brors; Marco Chierici; Tzu-Ming Chu; Jibin Zhang; Richard G Grundy; Min Max He; Scott Hebbring; Howard L Kaufman; Samir Lababidi; Lee J Lancashire; Yan Li; Xin X Lu; Heng Luo; Xiwen Ma; Baitang Ning; Rosa Noguera; Martin Peifer; John H Phan; Frederik Roels; Carolina Rosswog; Susan Shao; Jie Shen; Jessica Theissen; Gian Paolo Tonini; Jo Vandesompele; Po-Yen Wu; Wenzhong Xiao; Joshua Xu; Weihong Xu; Jiekun Xuan; Yong Yang; Zhan Ye; Zirui Dong; Ke K Zhang; Ye Yin; Chen Zhao; Yuanting Zheng; Russell D Wolfinger; Tieliu Shi; Linda H Malkas; Frank Berthold; Jun Wang; Weida Tong; Leming Shi; Zhiyu Peng; Matthias Fischer
Journal: Genome Biol Date: 2015-06-25 Impact factor: 13.583