| Literature DB >> 19352457 |
John W Emerson1, Marisa Dolled-Filhart, Lyndsay Harris, David L Rimm, David P Tuck.
Abstract
Missing data pose one of the greatest challenges in the rigorous evaluation of biomarkers. The limited availability of specimens with complete clinical annotation and quality biomaterial often leads to underpowered studies. Tissue microarray studies, for example, may be further handicapped by the loss of data points because of unevaluable staining, core loss, or the lack of tumor in the histospot. This paper presents a novel approach to these common problems in the context of a tissue protein biomarker analysis in a cohort of patients with breast cancer. Our analysis develops techniques based on multiple imputation to address the missing value problem. We first select markers using a training cohort, identifying a small subset of protein expression levels that are most useful in predicting patient survival. The best model is obtained by including both protein markers (including COX6C, GATA3, NAT1, and ESR1) and lymph node status. The use of either lymph node status or the four protein expression levels provides similar improvements in goodness-of-fit, with both significantly better than a baseline clinical model. Using the same multiple imputation strategy, we then validate the results out-of-sample on a larger independent cohort. Our approach of integrating multiple imputation with each stage of the analysis serves as an example that may be replicated or adapted in future studies with missing values.Entities:
Keywords: biomarker; breast cancer; immunohistochemistry; multiple imputation; variable selection
Year: 2008 PMID: 19352457 PMCID: PMC2664700 DOI: 10.4137/cin.s911
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Primary antibody details. *denotes those variables selected for analysis with the training cohort.
| Antibody | Source | Species (Dilution, time) |
|---|---|---|
| ACADSB | Gift of Gerry | rabbit polyclonal (1:5000, 1 hour) |
| AGR2 | Gift of Devon Thompson { | rabbit polyclonal (1:1000, 1 hour) |
| BCL2 | DAKO, clone 124 | mouse monoclonal (1:40, 1 hour) |
| BNIP3 | BD Pharmingen | rabbit polyclonal (1:500, overnight) |
| CA12* | Gift of William Sly { | rabbit polyclonal (1:2000, 30 minutes) |
| CAV1* | Transduction Labs, clone 2297 | mouse monoclonal (1:100, overnight) |
| CD24 | Neomarkers, clone 24C02 Ab-2 | mouse monoclonal (1:50, 1 hour) |
| CDH3* | BD Transduction Labs clone 56 | mouse monoclonal (1:200, overnight) |
| COX6C* | Molecular Probes clone 3G5 | mouse monoclonal (1:100, overnight) |
| CTSD | DAKO | rabbit polyclonal (1:1000, 1 hour) |
| EEF1D | Gift of Ong Lee Lee { | rabbit polyclonal (1:5000, 1 hour) |
| ESR1* | DAKO Estrogen Receptor antibody clone 1D5 | mouse monoclonal (1:50, 1 hour) |
| GATA3* | Santa Cruz, clone HG3-31 | mouse monoclonal (1:100, 1 hour) |
| GGH | Gift of Thomas J. Ryan { | rabbit polyclonal (1:400, 1 hour) |
| GLUL | BD Transduction Labs clone 6 | mouse monoclonal (1:1000, overnight) |
| GRB7 | Santa Cruz | rabbit polyclonal (1:250, 1 hour) |
| GSTP1* | DAKO clone 353-10 | mouse monoclonal (1:50, 1 hour) |
| HER2 | DAKO | rabbit polyclonal (1:8000, 1 hour) |
| HSP27 | Neomarkers clone Ab-1 G3.1 | mouse monoclonal (1:50, 30 minutes) |
| IGFBP2 | Santa Cruz | goat polyclonal (1:1000, 30 minutes) |
| IGFBP4* | Austral Biologicals | mouse monoclonal (1:50, 1 hour) |
| IGFBP5* | Austral Biologicals | mouse monoclonal (1:100, 1 hour) |
| IRAK1 | Santa Cruz | rabbit polyclonal (1:100, 1 hour) |
| JUP | BD Transduction Labs | mouse monoclonal (1:1000, overnight) |
| KRT7 | DAKO clone TL 12/30 | mouse monoclonal (1:50, 1 hour) |
| KRT8 | DAKO clone 25BH11 | mouse monoclonal (1:100, 1 hour) |
| KRT18 | DAKO clone DC10 | mouse monoclonal (1:50, 1 hour) |
| KRT19 | DAKO clone RCK108 | mouse monoclonal (1:50, 1 hour) |
| MUC1 | Novocastra | mouse monoclonal (1:100, overnight) |
| MYC* | DAKO clone 1D5 | mouse monoclonal (1:200, 1 hour) |
| NAT1* | Gift of Edith Sim { | rabbit polyclonal (1:1000, 1 hour) |
| PCNT1 | Gift of Stephen Doxsey { | rabbit polyclonal (1:500, 1 hour) |
| PFK | Gift from George Dunaway { | rabbit polyclonal (1:2000, 1 hour) |
| RNF110 | Santa Cruz | rabbit polyclonal (1:400, overnight) |
| SERPINA3 | DAKO | rabbit polyclonal (1:3200, 10 minutes) |
| SLC7A5 | Serotec | rabbit polyclonal (1:50, 1 hour) |
| SLC9A3R1 | Gift of Vijaya Ramesh { | rabbit polyclonal (1:50, overnight) |
| TFF1 | DAKO clone BC04 | mouse monoclonal (1:5000, overnight) |
| TFF3 | Gift of Daniel Podolsky { | rabbit polyclonal (1:500, 1 hour) |
| THBS1 | Neomarkers clone A6.1 Ab4 | mouse monoclonal (1:50, overnight) |
| TIMP3* | Oncogene Research clone 136-13H4 Ab-1 | mouse monoclonal (1:50, overnight) |
| XBP1* | Santa Cruz | rabbit polyclonal (1:200, overnight) |
Figure 1Marker Selection.
Figure 2Out-of-sample validation.
A comparison of the training and validation cohorts. Univariate Cox proportional hazard coefficients (with 95% confidence intervals) show the similarities between the cohorts with the exception of nuclear grade (which appears to have a statistically significant relationship to survival in the validation cohort, but not the training cohort).
| Variable | Training (236) | Validation (338) |
|---|---|---|
| Missing values (percent) | 0 (0%) | 0 (0%) |
| Mean (standard deviation) | 59.9 (12.4) | 56.8 (12.0) |
| Hazard ratio (95% confidence interval) | 1.00 (0.989–1.02) | 1.00 (0.987–1.01) |
| Missing values (percent) | 1 (0.4%) | 0 (0%) |
| Node positive (percent) | 119 (50.4%) | 169 (50%) |
| Node negative (percent) | 116 (49.2%) | 169 (50%) |
| Positive nodes: Mean (standard deviation) | 6.6 (8.0) | 6.0 (6.1) |
| Hazard ratio (95% confidence interval) | 1.04 (1.02–1.07) | 1.06 (1.04–1.07) |
| Missing values (percent) | 9 (3.8%) | 28 (8.3%) |
| 1 | 36 (15.3%) | 64 (18.9%) |
| Count (percent) 2 | 118 (50%) | 169 (50%) |
| 3 | 73 (30.9%) | 77 (22.8%) |
| Hazard ratio (95% confidence interval) | 1.12 (0.846–1.49) | 1.39 (1.1–1.77) |
| Missing values (percent) | 0 (0%) | 37 (10.9%) |
| Mean (standard deviation) | 2.97 (2.24) | Mean = 2.79 (2.12) |
| Hazard ratio (95% confidence interval) | 1.10 (1.05–1.16) | 1.15 (1.08–1.23) |
Model selection on training data.
| Model name | Variables | Mean (standard deviation) Improvement over Baseline Clinical Model
| |
|---|---|---|---|
| R2 | Log-likelihood | ||
| Baseline | Age at Diagnosis Nuclear Grade Tumor Size | NA | NA |
| M1 | Baseline + COX6C | 0.0537 (0.0157) | 6.78 (2.05) |
| M2 | M1 + GATA3 (N) | 0.0655 (0.0158) | 8.33 (2.08) |
| M3 | M2 + ESR1 (N) | 0.0693 (0.0155) | 8.82 (2.06) |
| M4 | M3 + NAT1.Total | 0.0743 (0.0170) | 9.48 (2.26) |
| Nodes | Baseline + Positive Nodes | 0.0312 (0.0011) | 3.89 (0.13) |
| Combined | Baseline + M4 + Positive Nodes | 0.0993 (0.0171) | 12.85 (2.35) |
Figure 3Validation of four-marker model
Each plot depicts the distribution of improvements in the goodness-of-fit statistics for three candidate models compared to the baseline model containing only the clinical factors: “Nodes” (lymph node status and clinical factors); “Markers” (four selected protein markers and clinical factors), and “Combined” (including clinical factors, protein markers, and nodal status).
Simulation results. The table shows the number of times (out of 50) that the four markers were captured by the variable selection process. The last column indicates the mistaken inclusions of spurious variables.
| Method | Beta1 | Beta2 | Beta3 | Beta4 | Others |
|---|---|---|---|---|---|
| stepMI | 50 | 50 | 50 | 6 | 159 |
| Drop | 9 | 4 | 2 | 0 | 35 |
| Median | 50 | 50 | 50 | 43 | 572 |
| KNN | 50 | 50 | 50 | 37 | 512 |