| Literature DB >> 21266980 |
A M G Ali1, S-J Dawson, F M Blows, E Provenzano, I O Ellis, L Baglietto, D Huntsman, C Caldas, P D Pharoah.
Abstract
BACKGROUND: Tissue micro-arrays (TMAs) are increasingly used to generate data of the molecular phenotype of tumours in clinical epidemiology studies, such as studies of disease prognosis. However, TMA data are particularly prone to missingness. A variety of methods to deal with missing data are available. However, the validity of the various approaches is dependent on the structure of the missing data and there are few empirical studies dealing with missing data from molecular pathology. The purpose of this study was to investigate the results of four commonly used approaches to handling missing data from a large, multi-centre study of the molecular pathological determinants of prognosis in breast cancer. PATIENTS AND METHODS: We pooled data from over 11,000 cases of invasive breast cancer from five studies that collected information on seven prognostic indicators together with survival time data. We compared the results of a multi-variate Cox regression using four approaches to handling missing data - complete case analysis (CCA), mean substitution (MS) and multiple imputation without inclusion of the outcome (MI-) and multiple imputation with inclusion of the outcome (MI+). We also performed an analysis in which missing data were simulated under different assumptions and the results of the four methods were compared.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21266980 PMCID: PMC3049587 DOI: 10.1038/sj.bjc.6606078
Source DB: PubMed Journal: Br J Cancer ISSN: 0007-0920 Impact factor: 7.640
Baseline characteristics of breast cancer datasets
|
|
|
|---|---|
| Mean age | 55 (N/A) |
| Mean follow-up | 8.4 (N/A) |
| Number of breast deaths | 2677 (24) |
| 5-year survival | 8002 (71) |
|
| |
| 1 | 1437 (13) |
| 2 | 4155 (36) |
| 3 | 4546 (41) |
| Missing | 1074 (10) |
|
| |
| Negative | 5478 (49) |
| Positive | 4060 (36) |
| Missing | 1674 (15) |
|
| |
| <2 | 4545 (41) |
| 2–4.9 | 4664 (41) |
| 5+ | 697 (6) |
| Missing | 1306 (12) |
|
| |
| Negative | 3037 (27) |
| Positive | 7458 (67) |
| Missing | 717 (6) |
|
| |
| Negative | 3963 (35) |
| Positive | 5030 (45) |
| Missing | 2219 (20) |
|
| |
| Negative | 7068 (63) |
| Positive | 1104 (10) |
| Missing | 3040 (27) |
|
| |
| Negative | 2185 (19) |
| Positive | 5700 (51) |
| Missing | 3327 (30) |
Abbreviations: BCL2=B-cell lymphoma 2; ER=oestrogen receptor; HER2=human epidermal growth factor receptor-2; N/A=not applicable; PR=progesterone receptor.
Number of missing values for prognostic covariates
|
|
|
|
|---|---|---|
| Grade | 10 138 | 1074 (10) |
| Nodal status | 9538 | 1674 (15) |
| Tumour size | 9906 | 1306 (12) |
| ER status | 10 495 | 717 (6) |
| PR status | 8993 | 2219 (20) |
| HER2 status | 8172 | 3040 (27) |
| BCL2 status | 7885 | 3327 (30) |
Abbreviations: BCL2=B-cell lymphoma 2; ER=oestrogen receptor; HER2=human epidermal growth factor receptor-2; PR=progesterone receptor.
Correlation of missingness in breast cancer prognostic factors
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|
| Grade | 1.00 | |||||
| Nodal status | 0.09 | 1.00 | ||||
| Size group | 0.10 | 0.73 | 1.00 | |||
| ER status | 0.01 | −0.03 | (0.01) | 1.00 | ||
| PR status | (−0.01) | 0.08 | 0.07 | 0.41 | 1.00 | |
| HER2 status | (0.01) | 0.28 | 0.23 | 0.36 | 0.57 | 1.00 |
| BCL2 status | −0.04 | 0.07 | 0.05 | 0.30 | 0.58 | 0.57 |
Abbreviations: BCL2=B-cell lymphoma 2; ER=oestrogen receptor; HER2=human epidermal growth factor receptor-2; PR=progesterone receptor.
Coefficients within parentheses are not statistically significant (P>0.05).
Comparison of coefficients (log hazard ratio) and standard errors (s.e.) from analyses based on four methods for handling missing data
|
|
|
|
| |||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| Grade | 1.05 | 0.14 | 1.05 | 0.10 | 0.95 | 0.10 | 0.97 | 0.10 |
| Nodal status | 1.17 | 0.13 | 0.97 | 0.09 | 0.97 | 0.10 | 1.10 | 0.10 |
| Tumour size | 0.41 | 0.05 | 0.45 | 0.03 | 0.40 | 0.04 | 0.43 | 0.04 |
| ER status | −0.89 | 0.15 | −0.93 | 0.11 | −0.83 | 0.11 | −0.80 | 0.11 |
| PR status | −1.13 | 0.16 | −1.00 | 0.12 | −0.94 | 0.12 | −1.00 | 0.13 |
| HER2 status | 0.27 | 0.07 | 0.27 | 0.05 | 0.23 | 0.05 | 0.26 | 0.05 |
| BCL2 status | −0.26 | 0.07 | −0.20 | 0.06 | −0.17 | 0.06 | −0.20 | 0.06 |
| Time effect | ||||||||
| Nodal status | −0.22 | 0.08 | −0.13 | 0.06 | −0.19 | 0.06 | −0.19 | 0.06 |
| Grade | −0.40 | 0.08 | −0.37 | 0.06 | −0.34 | 0.06 | −0.33 | 0.06 |
| ER status | 0.71 | 0.10 | 0.68 | 0.07 | 0.63 | 0.08 | 0.63 | 0.08 |
| PR status | 0.55 | 0.10 | 0.49 | 0.08 | 0.46 | 0.08 | 0.47 | 0.08 |
Abbreviations: BCL2=B-cell lymphoma 2; ER=oestrogen receptor; HER2=human epidermal growth factor receptor-2; MI−=multiple imputation without the outcome; MI+=multiple imputation with the outcome; PR=progesterone receptor.
MD and MAD for each imputation method, averaged over 100 simulations
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
|
| ||||||||
| Grade |
|
| − | −0.03 | −0.04 |
| − | −0.03 |
| Nodal status |
| 0.03 |
|
| − | 0.07 |
|
|
| Tumour size | −0.01 | −0.02 |
|
|
| 0.02 |
| 0.01 |
| ER status | 0.02 |
| 0.05 |
|
|
| 0.06 | 0.02 |
| PR status | −0.09 | −0.20 |
| − |
| −0.22 |
| −0.05 |
| HER2 status | − |
| −0.03 |
| − | − |
| − |
| BCL2 status |
|
| − | 0.03 |
|
| −0.06 | −0.01 |
|
| ||||||||
| Grade |
| 0.13 |
| 0.14 |
|
|
| 0.14 |
| Nodal status |
| 0.12 | 0.18 |
|
|
|
|
|
| Tumour size |
| 0.09 |
| 0.09 |
|
| 0.11 | 0.11 |
| ER status |
| 0.17 |
| 0.14 |
| 0.20 |
| 0.15 |
| PR status |
| 0.24 |
|
| 0.29 | 0.26 |
|
|
| HER2 status |
| 0.14 |
| 0.14 |
| 0.15 |
| 0.16 |
| BCL2 status |
| 0.14 |
| 0.15 |
| 0.19 |
| 0.18 |
Abbreviations: BCL2=B-cell lymphoma 2; CCA=complete case analysis; ER=oestrogen receptor; HER2=human epidermal growth factor receptor-2; MAD=mean absolute difference; MAR=missing at random; MCAR=missing completely at random; MD=mean deviation; MI−=multiple imputation without the outcome; MI+=multiple imputation with the outcome; MS=mean substitution; PR=progesterone receptor.
Numbers in bold indicate method with best result for that variable. Underlined numbers indicate method with worst result for that variable.
Figure 1Confidence limits for multivariate log hazard ratio estimates for each prognostic variable using four approaches of handling missing data in 100 datasets with data-simulated MAR. CCA=complete case analysis; MS=mean substitution; MI=multiple imputation without the outcome and MI+=multiple imputation with the outcome. The horizontal lines represent the true estimates.