Literature DB >> 17239251

Bayesian profiling of molecular signatures to predict event times.

Abstract

BACKGROUND: It is of particular interest to identify cancer-specific molecular signatures for early diagnosis, monitoring effects of treatment and predicting patient survival time. Molecular information about patients is usually generated from high throughput technologies such as microarray and mass spectrometry. Statistically, we are challenged by the large number of candidates but only a small number of patients in the study, and the right-censored clinical data further complicate the analysis.
RESULTS: We present a two-stage procedure to profile molecular signatures for survival outcomes. Firstly, we group closely-related molecular features into linkage clusters, each portraying either similar or opposite functions and playing similar roles in prognosis; secondly, a Bayesian approach is developed to rank the centroids of these linkage clusters and provide a list of the main molecular features closely related to the outcome of interest. A simulation study showed the superior performance of our approach. When it was applied to data on diffuse large B-cell lymphoma (DLBCL), we were able to identify some new candidate signatures for disease prognosis.
CONCLUSION: This multivariate approach provides researchers with a more reliable list of molecular features profiled in terms of their prognostic relationship to the event times, and generates dependable information for subsequent identification of prognostic molecular signatures through either biological procedures or further data analysis.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2007 PMID： 17239251 PMCID： PMC1796541 DOI： 10.1186/1742-4682-4-3

Source DB: PubMed Journal: Theor Biol Med Model ISSN： 1742-4682 Impact factor: 2.432

Background

High-throughput biotechnologies such as microarray and mass spectrometry permit simultaneous measurements of enormous bodies of genomic, proteomic, and metabolic information to be made. Such information helps us understand the molecular basis of important clinical outcomes, and thus improves the efficiency as well as accuracy in clinical decision making. More specifically, a small subset of these molecules can be used as biomarkers in daily clinical practice for detecting disease at early stages, measuring disease progress, monitoring the efficacy of treatments, and potentially accelerating the drug discovery process. However, the promise of genomics, proteomics, and metabolomics in clinical medicine rests on identifying these disease-specific molecular signatures. Clinical and preclinical studies of patients' genomics and proteomics profiles usually present datasets that share common characteristics, i.e., many molecular features ("large p") collected from few individuals ("small n"). The statistical challenge is to mine prognostic signatures from thousands of candidates by efficiently extracting information from samples of limited size, i.e., "small n large p" datasets. Moreover, the clinical outcomes measured for certain patients, e.g., survival times of cancer patients, are usually censored data, which further complicates the statistical analysis. There has been extensive research on the classification and prediction of cancer using gene expression information [1-3], but there has been less progress in identifying individual molecules that can be used to predict the clinical outcome. We devote this paper to developing a Bayesian approach to profile molecular features on the basis of their prognostic relations to event times. The proportional hazard model [4] has a long history in modeling the association of risk factors to the right-censored event times observed in clinical study [5,6]. Through this model, it has been of special interest to develop a systematic approach to identifying molecular signatures for event times with "small n large p" datasets. However, the overwhelmingly larger number of molecular candidates compared to the number of individuals prohibits exhaustive variable selection because of the heavy computation and model-overfitting considerations. A variety of strategies have been proposed in the literature. The first is to reduce the list of genotypic candidates by univariately associating each of them with phenotypic clinical outcome [1,7], and then regress the clinical outcome on the selected candidates. The second employs principal component analysis (PCA) to build up "eigengenes" (i.e., linear combinations of genes) and associates these with phenotypic clinical outcomes, and the identification of molecular signatures is further explored on the basis of these [8]. The third strategy employs partial least squares (PLS) [9,10] to construct orthogonal "eigengenes" [11]. Other strategies have also been used to reveal interesting prognostic molecular signatures for certain event times [12-15]. Recently, Tadesse et al. [16] proposed a Bayesian error-in-variable survival model to identify genes of which the expression levels are associated with survival outcome. It is widely accepted that most genes measured in microarray experiments provide little information for predicting patient survival, so a necessary step in the analysis is to reduce the number of candidates before identifying prognostic molecular signatures with a relatively small sample. This reduction is usually carried out by ranking molecular features (either the original molecular candidates or the "eigengenes") according to either z scores [7] or Cox scores [17-19], which measure the univariate association of each molecular feature with the event time. Several top-ranked molecular features are further explored for their prognostic associations with the event time. As shown in our simulation study, employing the univariate Cox scores to profile molecular features can be misleading as it may miss many important candidates but select many false-prognostic ones. Indeed, molecular features with high univariate association to the event time may not necessarily predict the event time effectively when applied together. As shown by Sha et al. [20] and Tadesse et al. [21], the disease may often be affected jointly by subsets of the genes while each individual gene might have a relatively weak effect. This study focuses on developing an efficient yet robust approach to profiling molecular features on the basis of their prognostic associations with the event time, taking advantage of the Bayesian framework for the proportional hazard model proposed by Kalbfleisch [22]. We acknowledge the high correlation between some molecular features due to the complicated genetic architecture. For example, genes involved in the same metabolic pathway may be similarly or oppositely regulated. These closely-related molecular features can result in collinearity between the candidates, and should therefore be grouped together in order to address their prognostic associations with the event time properly. Here, we group closely-related molecular features into linkage clusters. A centroid "gene" is constructed to represent each linkage cluster and thus partially solve the collinearity issue. As univariate Cox scores are unable to account for the complicated correlation structures among molecular features, we employ the Bayesian approach to construct a natural framework for molecular feature profiling. We first propose a two-stage procedure for profiling prognostic molecular signatures for event times, and present the construction of linkage clusters as well as their centroids. A Bayesian framework of the Cox proportional hazard model is specified for "large p small n" data and a profiling criterion is described accordingly. The performance of our approach is evaluated via a simulation study and application to data concerning diffuse large B-cell lymphoma (DLBCL) [15].

Results

Simulation study

To evaluate the performance of the proposed approach, we simulated 20 survival datasets, each having p = 1, 000 features and n = 125 independent individuals. The feature values were generated from an autoregressive process of order one, with autocorrelation ρ = 0.5 and unit variance white noise. The event times follow an exponential distribution of which the rate is determined by a linear combination of the 12 features with non-zero coefficients. Independent random censoring times were generated from standard exponential distributions, and this induced censoring of approximately 50% of the observed event times. Among the 1, 000 autocorrelated features, the indices of the 12 with non-zero coefficients are 150, 151, 300, 302, 450, 453, 600, 604, 750, 755, 900, 906, and their values alternate between 1 and -1. Such constant-magnitude coefficients were chosen in order to evaluate the effect of correlation among features on the profiling, as the correlations between the pairs of features, i.e., (150, 151), (300, 302), (450, 453), (600, 604), (750, 755), and (900, 906), decrease geometrically from 0.5 to 0.015625. As shown in Figure 1, these feature pairs have similar chances of being selected as top features while being ranked by the Bayesian approach. In this simulation study, the proposed Bayesian approach could select each non-zero coefficient feature with high probability (more than 0.8) when more than 12 features were selected in total. However, when univariate Cox scores are used, a feature pair with higher correlation is more likely to be among the selected top features, and in general, all 12 features are less likely to be correctly selected, as shown in Figure 2. The percentages of the 12 non-zero coefficient features selected into top features (i.e., success rates) are shown in Figure 3 when using the Bayesian approach, or the univariate Cox scores. The univariate Cox scores can lead to very high false discovery rates because the features with non-zero coefficients are usually ranked very low. Furthermore, as shown in Figure 3, the success rates of selecting features with non-zero coefficients are very low even when a large number of features are selected. On the other hand, when more than 12 features are selected using the Bayesian approach, the success rates are usually higher than 0.8 and approach 1 very quickly as more features are selected.

Figure 1

Figure 2

Frequency of successes using univariate Cox scores. For each of the six feature pairs, the frequency of successes (y-axis) is calculated as the total number of correct detections in the 20 simulated datasets when univariate Cox scores are used to select a certain number of features (x-axis).

Figure 3

Comparison between the Bayesian approach and Cox scores. Shown as success rates (y-axis) are the true positive rates when a certain number of features (x-axis) are selected in each of the 20 simulated datasets. The solid line represents the results from the Bayesian approach and the dotted line represents the results using univariate Cox scores.

Frequency of successes using the Bayesian approach. For each of the six feature pairs, the frequency of successes (y-axis) is calculated as the total number of correct detections in the 20 simulated datasets when the Bayesian approach is used to select a certain number of features (x-axis). Frequency of successes using univariate Cox scores. For each of the six feature pairs, the frequency of successes (y-axis) is calculated as the total number of correct detections in the 20 simulated datasets when univariate Cox scores are used to select a certain number of features (x-axis). Comparison between the Bayesian approach and Cox scores. Shown as success rates (y-axis) are the true positive rates when a certain number of features (x-axis) are selected in each of the 20 simulated datasets. The solid line represents the results from the Bayesian approach and the dotted line represents the results using univariate Cox scores.

Application to a real dataset

We applied the proposed two-stage procedure to data on diffuse large B-cell lymphoma (DLBCL) [15]. These data include the expression levels of 7, 399 genes from a total of 240 patients. The genomic information for each patient was obtained at the beginning of the study, and the patients were followed up until death or the end of the project. The missing gene expression values were imputed using the nearest neighbor averaging approach [12,23]. Using the single linkage clustering approach in Cluster 3.0 [24], we identified 5,656 linkage clusters by pruning the hierarchical tree such that the node distances within branches are less than 0.2. There are 4,944 linkage clusters containing only one gene, while the largest has 186 genes. We then consider selecting prognostic molecular features from the 5, 656 candidates, each being the centroid of a linkage cluster. The univariate Cox scores of all candidate clusters are calculated and shown in decreasing order in Figure 4. There are 761 candidates with Cox scores above the 95 percentile of the distribution, and 290 candidates with Cox scores above the 99 percentile of the distribution. We selected the top 100, 200, 300, and 500 candidates with the largest Cox scores and applied our Bayesian method to profile them. The top 25 of the 500 candidates are listed in Table 1.

Figure 4

Table 1

Bayesian Profiling of the DLBCL Data. The 16 starred genes are proposed by our exploratory selection. Bracketed are the numbers of features from the same gene, which are included in the same cluster.

GenBank Accession No.	Cox		100		200		300		500

	Score	Rank	p˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGWbaCgaacaaaa@2E24@_k	Rank	p˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGWbaCgaacaaaa@2E24@_k	Rank	p˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGWbaCgaacaaaa@2E24@_k	Rank	p˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGWbaCgaacaaaa@2E24@_k	Rank
*D42043	18.91	6	0.99	2	0.99	1	0.99	1	0.98	1
*D88532	9.13	122	-	-	0.99	2	0.89	2	0.93	2
*U50196	6.82	272	-	-	-	-	0.12	6	0.32	3
*BC012161	21.28	3	0.81	4	0.55	4	0.61	4	0.31	4
*AF414120	5.77	387	-	-	-	-	-	-	0.27	5
*AF004709	5.27	460	-	-	-	-	-	-	0.20	6
AF127481	17.14	8	1.00	1	0.90	3	0.67	3	0.19	7
*AK025954	5.93	365	-	-	-	-	-	-	0.14	8
*AA504484 (2)	5.23	468	-	-	-	-	-	-	0.11	9
J00220	8.40	167	-	-	0.32	9	0.09	11	0.09	10
*AA837319	9.84	96	0.45	6	0.20	11	0.07	15	0.09	11
AA027985	11.44	50	0.30	9	0.18	13	0.09	10	0.09	12
*D13666	12.85	32	0.34	8	0.34	7	0.10	7	0.08	13
LC_33732	7.52	223	-	-	-	-	0.07	16	0.08	14
*AK000271	7.01	260	-	-	-	-	0.09	12	0.08	15
AF134159	14.11	18	0.43	7	0.20	10	0.08	13	0.07	16
*AA805749	5.72	393	-	-	-	-	-	-	0.07	17
*U46767	11.33	52	0.51	5	0.20	12	0.13	5	0.06	18
*AA804793	9.21	117	-	-	0.43	6	0.09	9	0.06	19
AA829241	12.76	33	0.25	10	0.12	21	0.06	19	0.06	20
*NM_000176 (2)	11.28	56	0.19	12	0.13	18	0.02	61	0.06	21
X00452 (5)	14.42	16	0.92	3	0.46	5	0.10	8	0.06	22
X00457 (3)
X62744 (3)
U15085 (4)
M16276 (4)
K01171 (5)
M20430 (4)
M83664 (2)
K01144 (6)
AK000170
LC_24239
*X52186	5.44	436	-	-	-	-	-	-	0.05	23
U59302	6.68	287	-	-	-	-	0.07	14	0.05	24
AA825732	7.67	215	-	-	-	-	0.05	21	0.05	25

Cox Score plot for DLBCL data. This figure shows the descending Cox scores of 5,656 Candidates in the DLBCL data. The dotted and dashed lines indicate the 99 and 95 percentiles of the distribution respectively. Bayesian Profiling of the DLBCL Data. The 16 starred genes are proposed by our exploratory selection. Bracketed are the numbers of features from the same gene, which are included in the same cluster. Employing our Bayesian approach to profile the 500 candidates with the largest Cox scores, the posterior probabilities, i.e., defined in (2), of the top 25 clusters range from 0.0538 to 0.9825. However, the ranks of these 25 clusters vary widely when their univariate Cox scores are used, and only five of those with the top 25 univariate Cox scores appear in this list. Therefore, it may be misleading to profile the clusters for their prognostic ability on the basis of their univariate Cox scores, since many false prognostic features can be highly ranked owing to the complicated correlation structure among features. When fewer than 500 candidates, for example, 100, 200 or 300, are profiled with the Bayesian approach, most of those that appeared in the top 25 of the 500 profiled candidates are also among the top 25 clusters as long as they are profiled. Indeed, the only exception is the cluster with two features in gene NM_00176, which was ranked at 61 when 300 candidates were profiled by the Bayesian approach. However, the complicated correlation structure between clusters makes it preferable to profile a number of clusters sufficient to avoid missing critical prognostic features. An exploratory selection of prognostic features from the top 25 clusters shown in Table 1 implies that 16 genes may be considered to construct prognostic features for the event time, and some of these features were ignored from the lists of 100, 200, and 300 candidates chosen on the basis of their univariate Cox scores. The cluster with 38 features from 11 genes is not one of the 16 selected, though all those genes except AK000170 belong to the MHC class II signature group defined by Rosenwald et al. [15]. D13666, which was reported by both Sha et al. [20] and Gui and Li [25], belongs to the lymph-node signature group, and BC012161 and AF134159 belong to the proliferation signature group (see Rosenwald et al. [15]). D42043, D88532, BC012161, and LC_33732 were also reported by Sha et al. [20]. It is interesting to observe that, among the 16 selected genes, AF414120 (gene CTLA4) is a member of the immunoglobulin superfamily and encodes a protein that transmits an inhibitory signal to T cells (Ling et al. [26]). AF127481, a lymphoid blast crisis oncogene (LBC), plays an important role in regulating the Rho/Rac GTPase cycle while the Rho/Rac family of small GTPases mediates cytoskeletal reorganization, gene transcription, and cell cycle progression through unique signal transduction pathways (Sterpetti et al. [27]). U46767 (gene CCL13) encodes a cytokine that plays a role in the accumulation of leukocytes during inflammation (Garcia-Zepeda et al. [28]). NM_000176 (gene NR3C1) encodes a receptor for glucocorticoids that can act as both a transcription factor and a regulator of other transcription factors. This protein can also be found in heteromeric cytoplasmic complexes along with heat shock factors and immunophilins (Subramaniam et al. [29]). X52186 (gene ITGB4) encodes the integrin beta 4 subunit, a receptor for the laminins, which tends to associate with the alpha 6 subunit and is likely to play a pivotal role in the biology of invasive carcinoma (Hogervorst et al. [30]).

Discussion

With high-throughput techniques now available, there has been extensive recent discussion of disease-specific molecular signatures [31,32]. The whole genome and proteome profiles for each of the limited number of patients presents an enormous number of molecular candidates with a complicated correlation structure. Here we group highly-correlated molecular features into linkage clusters in order to profile prognostic signatures. While the molecular features within the same linkage clusters are expected to have similar prognostic association with the event time, physical linkages and metabolic pathways will be able to provide confirmatory information. Thereupon, we strongly suggest that available genome, proteome, and metaboleme information be explored and combined with observed profiles to construct the linkage clusters. By doing this, we can improve the reliability significantly and establish the biological functionality of linkage clusters without overusing the limited number of profiles. When an optimal subset with a prespecified number of candidates is targeted, classical model selection approaches may be employed to explore all possible subsets and identify the best one. However, "small n large p" datasets may still obstruct this practice because of the enormous computation and unidentifiable models involved. Indeed, when p diverges as n → ∞, many classical approaches may not work even if p <n. Although univariate Cox scores are frequently utilized to profile candidates and accordingly select the subset, it is risky to identify prognostic signatures by this approach as it is easy to include false signatures but miss the true ones owing to the strong correlations among molecular features. For example, when a molecular feature is positively correlated with both true signatures, it may happen that the false one, instead of the two true ones, is selected. Ein-Dor et al. [33] discussed the discrepancies while using a univariate approach. Built upon the multivariate proportional hazard model (1), the proposed Bayesian approach is able to search all possible subsets of a certain size stochastically via Gibbs sampling. With restrictive priors for "small n large p" datasets, the posterior probability serves as a relative measure for profiling each candidate's prognostic association with the event time, accounting for other candidates. It is straightforward to extend this Bayesian approach to profile molecular signatures by controlling other clinical factors [3] and considering microenvironments [34]. The three-component prior for the coefficients in model (1) is crucial in constructing the profiling criterion. First, the prior probability of each component can be controlled with a uniform distribution on a subset of [0,1] to guarantee that the model is identifiable such that a Gibbs sampler can feasibly be employed to search the parameter space stochastically. Putting these prior probabilities on a restricted interval allows various numbers of nonzero coefficients in model (1). Second, the three-component prior approach provides flexibility in the possible imbalance between the scales and/or sizes of positive and negative coefficients in the model. Third, the three-component prior automatically results in a three-component posterior distribution for each coefficient, with the posterior probability of each component available for further calculation. In summary, the profiling criterion, posterior probability (), has a natural explanation and can be easily implemented in practice.

Methods

Construction of linkage clusters and their centroids

To facilitate pattern recognition and reveal otherwise hidden structures and functions, genes and proteins are usually clustered into groups based on different biological metrics, such as sequence similarity [35] or expression profiles [12,36]. With gene expression data only, many approaches have been applied to cluster genes that exhibit similar expression profiles across samples (see the review by Jörnsten and Yu, [36]). As we are interested in the prognostic relationships of genes to the event time, highly correlated genes are more likely to have the same power to predict the event time and therefore should be grouped into the same cluster. Here we define linkage clusters as groups of genes with large pairwise correlations in absolute value. As shown by Shaffer et al. [37], these linkage clusters of molecular features may reveal functional similarity/dissimilarity. Proteins regulated in the same metabolic pathway may also act similarly or oppositely. Although expressions of genes/proteins are usually observed with measurement errors, these linkage clusters can still be identified in experimental data. As shown in Figure 5, the molecular features can be mutually correlated as highly as correlation coefficient |ρ| ≥ 0.96. From a statistical point of view, these closely-related molecular features can cause collinearity or near collinearity in multivariate identification of prognostic signatures and therefore destabilize the identification result if all these closely-related molecular features are included in the prognostic model.

Figure 5

Illustration of Linkage Blocks. Part (a) indicates the color bar used in the other parts. Parts (b), (c) and (d) represent three different linkage blocks with correlation coefficients being 0.96, 0.98, and 0.97 respectively, where each row corresponds to a gene and each column corresponds to an individual. Hierarchical clustering approaches (e.g., the complete linkage clustering or single linkage clustering approach in Cluster 3.0 by de Hoon et al. [24]) can be used to construct these linkage clusters. With the mutual correlation coefficients estimated from the data, we use the absolute values of correlation coefficients as the similarity scores, i.e., with the distance measure d = 1 - |ρ|. We prune the hierarchical tree with a prespecified value for the distance, e.g., d ≤ 0.2 (hence absolute values of the correlation coefficients are no less than 0.8). The molecular features within the same branch are assumed to be within the same linkage cluster. The centroid of the linkage cluster is used to represent all the elements within the linkage cluster, and subsequent identification of prognostic signatures proceeds by associating these centroids only with the event time. The expression levels of the centroids are calculated by standardizing expressions of all genes. More specifically, for each linkage cluster, we first randomly select one gene and reverse the expression signs of all genes within the cluster that are negatively correlated with this gene. Then the expression level of the centroid is calculated by averaging the expression levels of all the genes within the same linkage cluster. Meanwhile, the measurement errors in expression levels are attenuated after averaging the gene expression levels within the same linkage cluster.

Bayesian framework of proportional hazard model

Suppose that p linkage clusters are identified and therefore the expressions of the p centroids are calculated for each of the n individuals. The observed event time for the i-th individual is denoted y, with δindicating whether it was right-censored owing to loss in follow-up (i.e., δ= 0 if right-censored and δ= 1 otherwise). Accordingly, the expressions of the centroids are denoted = (z, z, …, z). We use the popular Proportional Hazard Model [4] to associate the molecular features with the event time, i.e., the hazard function is modeled as follows: λ(t|) = λ0(t)exp(), (1) where includes the p coefficients for all the centroids, and λ0(·) is an unspecified baseline hazard function. Further, let = {(y, δ, z) : i = 1, 2, …, n} be the observed data, and (t) = {i : y≥ t} be the risk set at time t. Following Kalbfleisch [22], we construct the Bayesian framework to estimate by considering only the partial likelihood function, which avoids the nuisance baseline hazard function λ0(t). With a large number of available linkage clusters, the time-to-event of interest may be associated with a relatively small number of linkage clusters. On the other hand, the available "large p small n" data sets hamper us in detecting linkage clusters with too weak effects on the time-to-event of interest, and we expect to be able to identify those linkage clusters with strong effects. We therefore incorporate this important prior information by considering the following prior distribution for each β: β~ (1 - w+ - w-)δ{0} + w+N+(0, ) + w-N-(0, ), where N+(μ, σ2) and N-(μ, σ2) are the truncated Gaussian distributions with only positive and negative parts, respectively. As shown by Zhang et al. [38] and Zhang et al. [39], this three-component prior has some theoretical properties and allows a possible imbalance between scales and/or sizes of positive and negative coefficients in model (1). A priori, the hyperparameters and are assumed to follow inverse gamma distributions as IG(1, φ+) and IG(1, φ-), respectively. Here, sufficiently large φ+ and φ- are recommended to approximate the noninformative priors 1/ and 1/, respectively. For the prior distribution of (w+,w-), a noninformative prior (such as Dirichlet(1, 1, 1)) is not applicable since the model is not identifiable with p » n. As shown in Zhang et al. [40], the number of reliably identified significant predictors is limited by the sample size n. Following Zhang et al. [39] and Zhang et al. [41], we let w+ + w- ~ Unif(0, 2/p), which guarantees the model to be identifiable. On the other hand, as the number of candidate predictors is large, the upper bound, 2/p, on (w+ + w-) can be so restrictive that the resultant posterior probability for a true predictor to be significant can be very small. However, these posterior probabilities, as relative measures of significance, play an important role in profiling all features for their prognostic relations to the event time.

The Gibbs sampler

In view of the large number of parameters to be estimated, we consider a Gibbs sampler to obtain the posterior distributions of the parameters and make inferences. The Gibbs sampler can be developed by iteratively sampling each parameter from its full conditional distribution. For simplicity, let -include all components of except β, and write g(β|β-, ) = PL(β|) when βis of particular interest. Then the full conditional distribution of βis a mixture of a point mass at zero and two continuous distributions, i.e., β|w+, w-,,, ~ (1 - - )δ{0} + + , where and zare the distributions corresponding to the following probability density functions, and the probabilities for βto be positive and negative are, respectively, Here, wand ware normalization coefficients, which can be calculated as The full conditional distribution of w+ and w- is (w+, w-, 1 - w+ - w-)| ~ Dirichlet(+, -, p - + - -), w+ + w- ≤ , where + = #{k : β> 0} and - = #{k : β< 0}. Finally, the full conditional distribution of and are The parameters were initialized on the basis of estimators from univariate approaches. After the initial burn-in period (5,000 in the following analysis), the next 5,000 iterations in the Markov chain were used for inference without thinning. Convergence of the algorithm was checked by the diagnostic tools in Cowles and Carlin [42].

Profiling criterion

The significance of each centroid in model (1) is determined by one pair of parameters. They are, for the j-th centroid, the posterior probabilities p= P(β> 0|) and p= P(β< 0|). Given data , the marginal posterior distribution of βis still a mixture of three components, i.e., being positive with probability p= E[|], being negative with probability p= E[|], and having a point mass at zero with probability 1 - p- p. The two parameters pand pcan be estimated from the Markov chains of and drawn from the above Gibbs sampler. With moderately large p, the upper bound 2/p on (w+ + w-) may not be restrictive and βcan be estimated with the median value of its posterior probability. However, p is usually much larger than n in gene expression data and, as a result, pand pmay be heavily shrunk to zero. Therefore, identifying significant prognostic centroids with the posterior median values will be too conservative. Instead, we suggest profiling the prognostic association of k-th centroid to the event time by = max{p, p}, (2) which is a relative measure when p is much larger than n. As demonstrated in the simulation study, this profiling criterion performs much better than the popular univariate Cox score.

Authors' contributions

DZ and MZ both contributed to the development of the modeling method. DZ wrote the Matlab code and did the simulation study. MZ analyzed the real data. Both authors read and approved the final manuscript.

34 in total

1. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns.

Authors: T Hastie; R Tibshirani; M B Eisen; A Alizadeh; R Levy; L Staudt; W C Chan; D Botstein; P Brown
Journal: Genome Biol Date: 2000-08-04 Impact factor: 13.583

Review 2. Signatures of the immune response.

Authors: A L Shaffer; A Rosenwald; E M Hurt; J M Giltnane; L T Lam; O K Pickeral; L M Staudt
Journal: Immunity Date: 2001-09 Impact factor: 31.745

3. Simultaneous gene clustering and subset selection for sample classification via MDL.

Authors: Rebecka Jörnsten; Bin Yu
Journal: Bioinformatics Date: 2003-06-12 Impact factor: 6.937

4. Predicting the clinical status of human breast cancer by using gene expression profiles.

Authors: M West; C Blanchette; H Dressman; E Huang; S Ishida; R Spang; H Zuzan; J A Olson; J R Marks; J R Nevins
Journal: Proc Natl Acad Sci U S A Date: 2001-09-18 Impact factor: 11.205

5. Tumor classification by partial least squares using microarray gene expression data.

Authors: Danh V Nguyen; David M Rocke
Journal: Bioinformatics Date: 2002-01 Impact factor: 6.937

6. Kernel Cox regression models for linking gene expression profiles to censored survival data.

Authors: Hongzhe Li; Yihui Luan
Journal: Pac Symp Biocomput Date: 2003

7. Linking gene expression data with patient survival times using partial least squares.

Authors: Peter J Park; Lu Tian; Isaac S Kohane
Journal: Bioinformatics Date: 2002 Impact factor: 6.937

8. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma.

Authors: Andreas Rosenwald; George Wright; Wing C Chan; Joseph M Connors; Elias Campo; Richard I Fisher; Randy D Gascoyne; H Konrad Muller-Hermelink; Erlend B Smeland; Jena M Giltnane; Elaine M Hurt; Hong Zhao; Lauren Averett; Liming Yang; Wyndham H Wilson; Elaine S Jaffe; Richard Simon; Richard D Klausner; John Powell; Patricia L Duffey; Dan L Longo; Timothy C Greiner; Dennis D Weisenburger; Warren G Sanger; Bhavana J Dave; James C Lynch; Julie Vose; James O Armitage; Emilio Montserrat; Armando López-Guillermo; Thomas M Grogan; Thomas P Miller; Michel LeBlanc; German Ott; Stein Kvaloy; Jan Delabie; Harald Holte; Peter Krajci; Trond Stokke; Louis M Staudt
Journal: N Engl J Med Date: 2002-06-20 Impact factor: 91.245

9. Gene-expression profiles predict survival of patients with lung adenocarcinoma.

Authors: David G Beer; Sharon L R Kardia; Chiang-Ching Huang; Thomas J Giordano; Albert M Levin; David E Misek; Lin Lin; Guoan Chen; Tarek G Gharib; Dafydd G Thomas; Michelle L Lizyness; Rork Kuick; Satoru Hayasaka; Jeremy M G Taylor; Mark D Iannettoni; Mark B Orringer; Samir Hanash
Journal: Nat Med Date: 2002-07-15 Impact factor: 53.440

10. Variable selection for large p small n regression models with incomplete data: mapping QTL with epistases.

Authors: Min Zhang; Dabao Zhang; Martin T Wells
Journal: BMC Bioinformatics Date: 2008-05-29 Impact factor: 3.169

1 in total

1. Transcriptional profiling of VEGF-A and VEGF-C target genes in lymphatic endothelium reveals endothelial-specific molecule-1 as a novel mediator of lymphangiogenesis.

Authors: Jay W Shin; Reto Huggenberger; Michael Detmar
Journal: Blood Date: 2008-07-09 Impact factor: 22.113

1 in total