| Literature DB >> 22369035 |
Chen Zhao1, Leming Shi, Weida Tong, John D Shaughnessy, André Oberthuer, Lajos Pusztai, Youping Deng, W Fraser Symmans, Tieliu Shi.
Abstract
BACKGROUND: Microarray data have been used for gene signature selection to predict clinical outcomes. Many studies have attempted to identify factors that affect models' performance with only little success. Fine-tuning of model parameters and optimizing each step of the modeling process often results in over-fitting problems without improving performance.Entities:
Mesh:
Year: 2011 PMID: 22369035 PMCID: PMC3287499 DOI: 10.1186/1471-2164-12-S5-S3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Overview of analysis flow chart. 1). All the analysis is based on the hypothesis that a causal relation exists between expression levels of specific genes and an endpoint phenotype. 2) The strength of the relation is calculated by the proposed consistency degree, which is valid with the performance of thousands of models. 3) By maximizing the consistency degree, the endpoint phenotype can be redefined to be most consistent with gene expression levels. 4) Last, traditional and redefined endpoints are compared by the ranges of phenotype variance and the positions of unpredictable samples in gene expression levels.
Figure 2Change point comparison between the given and redefined cutoffs for endpoint F. A. Change point comparison. Two parts in the figure; the upper one is for posterior mean, and the lower part for the posterior probability. In the upper part, the blue solid lines and the translucent skew ellipses refer to the change point chart of the cutoff 730 days (2*365) given by MAQC-II project, while the others refer to the change point chart with the redefined cutoff 1,561. Posterior mean for the given cutoff is linear scaled to the range of that for the redefined cutoff. The boxplot in the left side of the change point charts represent the ranges of the first component values of the OS negative samples. The larger points represent the unpredictable samples. The skew elliptical ones in purple and the circle ones in orange represent the corresponding positions in the charts with given cutoff and redefined cutoff, respectively. In the posterior probability part of the figure, the dashed lines are the reference lines of the maximum posterior probability values. B. Kaplan-Meier survival curve. Six Kaplan-Meier survival curves are shown in the figure for six datasets, including training, test and training-test combined dataset for both EFS and OS endpoint. Three EFS associated curves (blue, green, wheat) are more sloping than OS associated curves (purple, yellow, red) and underneath the OS associated curves.
Figure 3Unpredictability relationship between endpoint F and G. A) the cutoff of the error rate to select OS (Endpoint F) unpredictable samples is set to 0.9, and the height of the height of bars represent the probability densities of the samples in corresponding EFS (Endpoint G) prediction error, gray bars for all the samples and red for the OS unpredictable samples with error rate larger than 0.9. Obviously, the OS unpredictable samples tend to have high EFS prediction error rate. B) The same situation presents in the EFS unpredictable samples.