| Literature DB >> 15094809 |
Eric Bair1, Robert Tibshirani.
Abstract
An important goal of DNA microarray research is to develop tools to diagnose cancer more accurately based on the genetic profile of a tumor. There are several existing techniques in the literature for performing this type of diagnosis. Unfortunately, most of these techniques assume that different subtypes of cancer are already known to exist. Their utility is limited when such subtypes have not been previously identified. Although methods for identifying such subtypes exist, these methods do not work well for all datasets. It would be desirable to develop a procedure to find such subtypes that is applicable in a wide variety of circumstances. Even if no information is known about possible subtypes of a certain form of cancer, clinical information about the patients, such as their survival time, is often available. In this study, we develop some procedures that utilize both the gene expression data and the clinical data to identify subtypes of cancer and use this knowledge to diagnose future patients. These procedures were successfully applied to several publicly available datasets. We present diagnostic procedures that accurately predict the survival of future patients based on the gene expression profile and survival times of previous patients. This has the potential to be a powerful tool for diagnosing and treating cancer.Entities:
Mesh:
Year: 2004 PMID: 15094809 PMCID: PMC387275 DOI: 10.1371/journal.pbio.0020108
Source DB: PubMed Journal: PLoS Biol ISSN: 1544-9173 Impact factor: 8.029
Figure 1Two Patient Subgroups with Overlapping Survival Times
Figure 2Comparison of the Survival Curves of the “Low-Risk” and “High-Risk” Groups
These were obtained by applying nearest shrunken centroids to the DLBCL test data. Patients in the training data were assigned to either the “low-risk” or “high-risk” group depending on whether or not their survival time was greater than the median survival time of all the patients.
Figure 3Comparison of the Survival Curves Resulting from Applying Two Different Clustering Methods to the DLBCL Data
Figure 4Comparison of the Survival Curves Resulting from Applying Two Different Clustering Methods to the DLBCL Data
Figure 5Survival Curves for Clusters Derived from the DLBCL Data
Figure 6Plot of Survival Versus the Predictor υ^I for the DLBCL Data
Supervised Principal Components Applied to Breast Cancer Data
Comparison of the values of the R 2 statistic of the Cox proportional hazards model (and the p-value of the associated log-rank statistic) obtained by fitting the times to metastasis to our supervised principal components method and the discrete predictor described in van't Veer et al. (2002)
Comparison of the Different Methods on Four Datasets
Comparison of the different methods applied to the DLBCL data of Rosenwald et al. (2002), the breast cancer data of van't Veer et al. (2002), the lung cancer data of Beer et al (2002), and the acute myeloid leukemia (AML) data of Bullinger et al. (2004). The methods are (1) assigning samples to a “low-risk” or “high-risk” group based on their median survival time; (2) using 2-means clustering based on the genes with the largest Cox scores; (3) using the supervised principal components method; (4) using 2-means clustering based on the genes with the largest PLS-corrected Cox scores; (5) using the continuous predictor ; (6) using 2-means clustering to identify two subgroups; (7) partitioning the training data into “low-risk” and “high-risk” subgroups by choosing the split that minimizes the p-value of the log-rank test when applied to the two resulting groups; (8) using SVMs, similar to the method of Li and Luan (2003); (9) using a discretized version of (8); (10) Using partial least squares regression, similar to the method of Nguyen and Rocke (2002a); (11) using a discretized version of (11); (12) using the method of Beer et al. (2002)
Comparison of the Different Methods on Our Simulated Data
The methods are (1) assigning samples to a “low-risk” or “high-risk” group based on their median survival time; (2) using 2-means clustering to identify two subgroups; (3) using 2-means clustering based on the genes with the largest Cox scores; (4) using the supervised principal components method; (5) using 2-means clustering based on the genes with the largest PLS-corrected Cox scores; (6) using the continuous predictor . Each entry in the table represents the mean over 10 simulations; the standard error is given in parentheses
Comparison of the Different Methods on Our Simulated Data
The methods are (1) assigning samples to a “low-risk” or “high-risk” group based on their median survival time; (2) using 2-means clustering to identify two subgroups; (3) using 2-means clustering based on the genes with the largest Cox scores; (4) using the supervised principal components method; (5) using 2-means clustering based on the genes with the largest PLS-corrected Cox scores; (6) using the continuous predictor . Each entry in the table represents the mean over 10 simulations; the standard error is given in parentheses