| Literature DB >> 27499923 |
Mustapha Abubakar1, William J Howat2, Frances Daley3, Lila Zabaglo4, Leigh-Anne McDuffus2, Fiona Blows5, Penny Coulson1, H Raza Ali2, Javier Benitez6, Roger Milne7, Herman Brenner8, Christa Stegmaier9, Arto Mannermaa10, Jenny Chang-Claude11, Anja Rudolph12, Peter Sinn13, Fergus J Couch14, Rob A E M Tollenaar15, Peter Devilee16, Jonine Figueroa17, Mark E Sherman18, Jolanta Lissowska19, Stephen Hewitt20, Diana Eccles21, Maartje J Hooning22, Antoinette Hollestelle22, John Wm Martens22, Carolien Hm van Deurzen23, Manjeet K Bolla24, Qin Wang24, Michael Jones1, Minouk Schoemaker1, Annegien Broeks25, Flora E van Leeuwen26, Laura Van't Veer25, Anthony J Swerdlow27, Nick Orr3, Mitch Dowsett28, Douglas Easton29, Marjanka K Schmidt30, Paul D Pharoah29, Montserrat Garcia-Closas18.
Abstract
Automated methods are needed to facilitate high-throughput and reproducible scoring of Ki67 and other markers in breast cancer tissue microarrays (TMAs) in large-scale studies. To address this need, we developed an automated protocol for Ki67 scoring and evaluated its performance in studies from the Breast Cancer Association Consortium. We utilized 166 TMAs containing 16,953 tumour cores representing 9,059 breast cancer cases, from 13 studies, with information on other clinical and pathological characteristics. TMAs were stained for Ki67 using standard immunohistochemical procedures, and scanned and digitized using the Ariol system. An automated algorithm was developed for the scoring of Ki67, and scores were compared to computer assisted visual (CAV) scores in a subset of 15 TMAs in a training set. We also assessed the correlation between automated Ki67 scores and other clinical and pathological characteristics. Overall, we observed good discriminatory accuracy (AUC = 85%) and good agreement (kappa = 0.64) between the automated and CAV scoring methods in the training set. The performance of the automated method varied by TMA (kappa range= 0.37-0.87) and study (kappa range = 0.39-0.69). The automated method performed better in satisfactory cores (kappa = 0.68) than suboptimal (kappa = 0.51) cores (p-value for comparison = 0.005); and among cores with higher total nuclei counted by the machine (4,000-4,500 cells: kappa = 0.78) than those with lower counts (50-500 cells: kappa = 0.41; p-value = 0.010). Among the 9,059 cases in this study, the correlations between automated Ki67 and clinical and pathological characteristics were found to be in the expected directions. Our findings indicate that automated scoring of Ki67 can be an efficient method to obtain good quality data across large numbers of TMAs from multicentre studies. However, robust algorithm development and rigorous pre- and post-analytical quality control procedures are necessary in order to ensure satisfactory performance.Entities:
Keywords: Ki67; automated algorithm; breast cancer; immunohistochemistry; tissue microarrays
Year: 2016 PMID: 27499923 PMCID: PMC4958735 DOI: 10.1002/cjp2.42
Source DB: PubMed Journal: J Pathol Clin Res ISSN: 2056-4538
Description of the source populations, numbers of cases and designs of TMAs used in this study
| Study acronym | Country | Cases ( | Age at diagnosis mean (range) | TMAs | Cores per case | Cores per TMA | Core size (mm) | Total cores per study |
|---|---|---|---|---|---|---|---|---|
| ABCS | Netherlands | 892 | 43 (19–50) | 24 | 1–6 | 15–328 | 0.6 | 2,449 |
| CNIO | Spain | 164 | 60 (35–81) | 4 | 1–2 | 80–133 | 1.0 | 316 |
| ESTHER | Germany | 258 | 62 (50–75) | 6 | 1–2 | 78–91 | 0.6 | 461 |
| KBCP | Finland | 276 | 59 (30–92) | 12 | 1–3 | 63–94 | 1.0 | 724 |
| MARIE | Germany | 808 | 62 (50–75) | 27 | 1–5 | 32–92 | 0.6 | 1,490 |
| MCBCS | USA | 491 | 58 (22–87) | 7 | 1–8 | 131–301 | 0.6 | 1,630 |
| ORIGO | Netherlands | 383 | 53 (22–87) | 9 | 1–9 | 67–223 | 0.6 | 991 |
| PBCS | Poland | 1,236 | 56 (27–75) | 22 | 1–2 | 66–145 | 1.0 | 2,358 |
| POSH | UK | 73 | 36 (27–41) | 5 | 1–5 | 75–114 | 0.6 | 194 |
| RBCS | Netherlands | 234 | 45 (25–84) | 6 | 1–5 | 134–199 | 0.6 | 642 |
| SEARCH | UK | 3,528 | 52 (24–70) | 24 | 1–3 | 120–167 | 0.6 | 4,037 |
| UKBGS | UK | 367 | 56 (24–84) | 14 | 1–4 | 62–114 | 1.0 | 1,130 |
| kConFab | Australia | 349 | 45 (20–77) | 6 | 1–2 | 65–114 | 0.6 | 531 |
| Totals | 9,059 | 56 (19–92) | 166 | 1–9 | 15–328 | 0.6–1.0 | 16,953 |
Figure 1Study design. Of the 166 TMAs, 15 were selected as the training set and were used to develop an algorithm that was applied to the scoring of all 166 TMAs, containing 16,953 tissue cores. The agreements between automated and visual scores were determined for the TMAs in the training set. Furthermore, a subset of the TMAs (N = 22) had pathologists' semi quantitative Ki67 scores: as a result, automated scores from these were compared with the pathologists' scores and the agreement between the two also determined. In the next stage of the study, scores derived using the automated method were combined with information on other clinical and pathological characteristic for all subjects in the study (N = 9,059). The distribution of Ki67 scores across categories and its association with pathological characteristics were then determined.
Figure 2Schematic representation of the stages involved in the development of a centralised scoring protocol. Of the 166 TMAs, 15 were randomly selected as the training set. Two protocols were developed and adopted for scoring: A computer‐assisted visual (CAV) and automated scoring protocols. Using the CAV protocol, a grid was used to demarcate each core and at least six well‐delineated areas of the core were counted for positive and negative nuclei (right hand panel (A) tumour core; (B) demarcation into regions by a grid and (C) counting of positive and negative nuclei within the squares) and the average score obtained. For the automated scoring protocol (Stage 1), 15 TMA‐specific classifiers were tuned (left hand panel (D) region of interest, (E) colour detection of DAB/positive nuclei, (F) colour detection of haematoxylin/negative nuclei and (G) combined detection of positive and negative nuclei) and used for scoring. In the next stage (Stage 2) one classifier was selected, tuned further, and used to score all 15 TMAs. Agreement with the CAV protocol was further tested and the impact of quality control on the performance of this classifier was then assessed (Stage 3). In the final stage (Stage 4), this classifier was applied to the scoring of all 166 TMAs in this study.
Agreement parameters (observed agreement and kappa statistic) and discriminatory accuracy (AUC) parameters for visual and automated scores (derived using TMA‐specific and Universal classifiers) overall and for each of the 15 TMAs in the training set
| TMA Name |
| TMA‐specific classifier | Universal classifier | ||||
|---|---|---|---|---|---|---|---|
| AUC (95% CI) | Observed agreement (95% CI) | Kappa (95% CI) | AUC (95% CI) | Observed agreement (95% CI) | Kappa (95% CI) | ||
| TMA 1 | 102 | 69 (59, 79) | 73 (64, 82) | 0.29 (0.21, 0.39) | 78 (69, 87) | 80 (71, 88) | 0.37 (0.28, 0.47) |
| TMA 2 | 89 | 93 (88, 99) | 82 (72, 89) | 0.57 (0.45, 0.67) | 91 (84, 97) | 90 (82, 95) | 0.75 (0.65, 0.84) |
| TMA 3 | 120 | 88 (82, 94) | 87 (79, 92) | 0.60 (0.51, 0.69) | 86 (80, 93) | 84 (75, 90) | 0.49 (0.40, 0.58) |
| TMA 4 | 154 | 87 (81, 92) | 91 (85, 95) | 0.71 (0.64, 0.78) | 83 (77, 90) | 87 (81, 92) | 0.58 (0.50, 0.66) |
| TMA 5 | 89 | 94 (88, 99) | 93 (86, 97) | 0.81 (0.71, 0.88) | 87 (80, 95) | 89 (82, 95) | 0.69 (0.58, 0.78) |
| TMA 6 | 74 | 91 (83, 98) | 89 (80, 95) | 0.60 (0.47, 0.71) | 80 (64, 96) | 84 (73, 91) | 0.44 (0.33, 0.57) |
| TMA 7 | 101 | 86 (79, 93) | 89 (81, 94) | 0.62 (0.52, 0.72) | 88 (81, 95) | 90 (83, 95) | 0.67 (0.57, 0.76) |
| TMA 8 | 104 | 96 (93, 100) | 84 (75, 90) | 0.59 (0.49, 0.68) | 91 (84, 97) | 80 (71, 87) | 0.37 (0.27, 0.47) |
| TMA 9 | 70 | 97 (95, 100) | 94 (86, 98) | 0.84 (0.74, 0.92) | 98 (95, 100) | 95 (86, 98) | 0.85 (0.75, 0.93) |
| TMA 10 | 70 | 90 (83, 98) | 93 (84, 98) | 0.79 (0.67, 0.87) | 94 (90, 99) | 96 (88, 99) | 0.87 (0.77, 0.94) |
| TMA 11 | 69 | 91 (84, 98) | 90 (80, 96) | 0.72 (0.60, 0.83) | 89 (81, 97) | 90 (80, 96) | 0.73 (0.62, 0.84) |
| TMA 12 | 86 | 90 (83, 96) | 85 (76, 92) | 0.35 (0.25, 0.46) | 91 (84, 97) | 88 (80, 94) | 0.47 (0.36, 0.58) |
| TMA 13 | 72 | 70 (58, 82) | 69 (57, 80) | 0.27 (0.17, 0.38) | 84 (72, 96) | 92 (83, 97) | 0.73 (0.62, 0.83) |
| TMA 14 | 75 | 87 (79, 95) | 75 (65, 85) | 0.40 (0.29, 0.52) | 85 (75, 94) | 87 (77, 93) | 0.64 (0.52, 0.75) |
| TMA 15 | 71 | 70 (57, 82) | 82 (71, 90) | 0.34 (0.23, 0.46) | 80 (70, 91) | 87 (77, 94) | 0.56 (0.44, 0.68) |
| Overall | 1,346 | 83 (81, 86) | 85 (83, 87) | 0.58 (0.55, 0.61) | 85 (83, 87) | 87 (86, 89) | 0.64 (0.61, 0.66) |
TMA‐specific classifiers represent automated algorithms that were trained specifically for each individual TMA. Universal classifier is a single automated algorithm tuned across the spectrum of TMAs in the training set and used for the scoring of all 15 TMAs. The Area Under the Curve (AUC) was determined by plotting a Receiver Operating Characteristic (ROC) curve of the continuous Ki67 automated score against categories of the visual scores – dichotomised using the most commonly reported cut‐off point in the literature of 10% (33)
The agreement and kappa statistics were determined by comparing quartiles (<25th, 25th–50th, >50th–75th and >75th percentiles) of both the visual and automated scores using weighted kappa statistics. N, Represents the number of cores on each TMA.
*The Universal classifier was adopted for use in the scoring of all TMAs (N = 166) in this study.
Figure 3Graphs comparing the ROC curves for the discriminatory accuracy of the automated continuous Ki67 scores against categories of the visual score by classifier type (TMA‐specific and universal) among representative TMAs. In TMA 1, the universal classifier showed better discrimination than the TMA‐specific classifier; in TMA 6, the TMA‐specific classifier showed better discrimination while in TMA 9 no difference was observed between the two classifier types. Overall, both classifiers showed similar discriminatory accuracy.
Agreement (observed agreement, kappa statistic) and discriminatory accuracy (AUC) parameters for the automated and visual scores according to quality control status (satisfactory, N = 950 and suboptimal, N = 396) overall and among the 15 TMAs in the training set
| TMA Name | Satisfactory QC | Suboptimal QC | ||||||
|---|---|---|---|---|---|---|---|---|
| N | AUC (95% CI) | Observed agreement (95% CI) | Kappa (95% CI) |
| AUC (95% CI) | Observed agreement (95% CI) | Kappa (95% CI) | |
| TMA 1 | 65 | 82 (71, 92) | 78 (67, 88) | 0.31 (0.20, 0.43) | 37 | 79 (64, 94) | 84 (68, 94) | 0.42 (0.25, 0.58) |
| TMA 2 | 63 | 93 (85, 100) | 91 (82, 97) | 0.78 (0.66, 0.87) | 26 | 88 (74, 100) | 86 (65, 96) | 0.61 (0.41, 0.79) |
| TMA 3 | 73 | 92 (86, 98) | 87 (76, 93) | 0.61 (0.50, 0.73) | 47 | 82 (69, 95) | 79 (64, 89) | 0.28 (0.17, 0.44) |
| TMA 4 | 98 | 86 (79, 93) | 90 (83, 96) | 0.69 (0.59, 0.78) | 56 | 80 (67, 93) | 82 (70, 91) | 0.34 (0.25, 0.81) |
| TMA 5 | 76 | 91 (84, 97) | 90 (80, 95) | 0.70 (0.60, 0.81) | 13 | 69 (37, 100) | 89 (64, 100) | 0.51 (0.60, 0.81) |
| TMA 6 | 61 | 89 (77, 100) | 85 (74, 93) | 0.49 (0.36, 0.62) | 13 | 58 (14, 100) | 77 (46, 95) | 0.19 (0.10, 0.54) |
| TMA 7 | 84 | 88 (81, 95) | 91 (82, 96) | 0.69 (0.58, 0.79) | 17 | 79 (48, 100) | 88 (64, 99) | 0.57 (0.33, 0.81) |
| TMA 8 | 87 | 89 (81, 97) | 80 (71, 88) | 0.38 (0.28, 0.49) | 17 | 99 (95, 100) | 78 (50, 93) | 0.31 (0.10, 0.56) |
| TMA 9 | 44 | 100 (99, 100) | 95 (85, 99) | 0.85 (0.70, 0.93) | 26 | 96 (91, 100) | 95 (80, 100) | 0.79 (0.61, 0.93) |
| TMA 10 | 48 | 98 (95, 100) | 96 (86, 99) | 0.88 (0.75, 0.95) | 22 | 82 (63, 100) | 95 (77, 100) | 0.82 (0.60, 0.95) |
| TMA 11 | 48 | 92 (84, 99) | 93 (83, 99) | 0.81 (0.67, 0.91) | 21 | 91 (79, 100) | 85 (64, 97) | 0.54 (0.30, 0.74) |
| TMA 12 | 53 | 93 (86, 100) | 89 (77, 96) | 0.55 (0.40, 0.68) | 33 | 83 (65, 100) | 87 (72, 97) | 0.30 (0.16, 0.48) |
| TMA 13 | 45 | 86 (73, 99) | 89 (76, 96) | 0.68 (0.51, 0.80) | 27 | 97 (91, 100) | 96 (81, 100) | 0.85 (0.66, 0.95) |
| TMA 14 | 55 | 89 (78, 100) | 91 (80, 97) | 0.75 (0.61, 0.85) | 20 | 69 (44, 93) | 76 (51, 91) | 0.27 (0.11, 0.54) |
| TMA 15 | 50 | 91 (82, 100) | 90 (78, 97) | 0.71 (0.58, 0.84) | 21 | 49 (20, 78) | 78 (53, 92) | 0.03 (0.01, 0.23) |
| Overall | 950 | 86 (84, 89) | 89 (86, 91) | 0.68 (0.65, 0.71) | 396 | 82 (78, 86) | 85 (81, 88) | 0.51 (0.46, 0.56) |
Suboptimal QC were cores which did not meet the criteria to be considered satisfactory but which were sufficiently suitable for scoring, eg, cores with few tumour cells (50–500 cells), partially folded cores, staining artefact or suboptimal/poor fixation. N, Represents the number of cores on each TMA that have been classified as being either of satisfactory or suboptimal QC.
Figure 4Graphs comparing the ROC curves for the discriminatory accuracy of the automated continuous scores against categories of the visual score by QC status among representative TMAs. The discriminatory accuracy was better among cores with satisfactory QC, overall and in TMAs 1 & 15. This difference was however not as obvious in TMA 9 as in 1 and 15.
Agreement (observed agreement, kappa statistics) and discriminatory accuracy (AUC) parameters for automated and visual scores according to categories of the total nuclei counted by the machine among the 15 TMAs in the training set (N = 1,346)
| Total nuclei count |
| AUC (95% CI) | Observed agreement (95%CI) | Kappa (95% CI) |
|---|---|---|---|---|
| 50–500 | 151 | 80 (73, 87) | 78 (71, 84) | 0.41 (0.33, 0.49) |
| >500–1,000 | 227 | 80 (74, 86) | 86 (81, 91) | 0.57 (0.51, 0.64) |
| >1,000–1,500 | 207 | 85 (80, 90) | 87 (82, 91) | 0.61 (0.54, 0.68) |
| >1,500–2,000 | 172 | 90 (85, 95) | 90 (85, 94) | 0.72 (0.65, 0.79) |
| >2,000–2,500 | 106 | 88 (82, 95) | 91 (83, 95) | 0.72 (0.62, 0.80) |
| >2,500–3,000 | 87 | 82 (72, 92) | 89 (81, 95) | 0.67 (0.56, 0.76) |
| >3,000–3,500 | 90 | 88 (81, 95) | 88 (79, 94) | 0.67 (0.57, 0.77) |
| >3,500–4,000 | 74 | 92 (86, 98) | 93 (85, 98) | 0.77 (0.66, 0.86) |
| >4,000–4,500 | 56 | 91 (83, 99) | 92 (80, 97) | 0.78 (0.66, 0.88) |
| > 4,500 | 176 | 90 (85, 95) | 88 (82, 92) | 0.68 (0.61, 0.75) |
N.B: Evidence for a strongly positive linear relationship between mean total nuclei count and agreement parameters was observed [kappa (r = 0.85, p‐value = 0.004); observed agreement (r = 0.80, p‐value = 0.01); AUC (r = 0.79, p‐value = 0.01)]. N, Represents the number of cores for each category of total nuclei count.
Figure 5Distribution of Ki67 scores by method of scoring. Ki67 scores for the Computer‐Assisted Visual (CAV) and automated (TMA‐specific and Universal classifier) methods for each of the 15 TMAs in the training set and overall. The TMA‐specific classifier yielded higher Ki67 scores in all but two TMAs, ie, TMAs 2 and 8 (red arrows).
Subject level AUC and kappa agreement between automated Ki67 and visually derived scores for a subset of the participating studies for which visual scores were available (N = 1,849)
| Study | Cases ( | AUC (95% CI) | Observed agreement (95% CI) | Kappa |
|---|---|---|---|---|
| ABCS | 215 | 86 (79, 94) | 87 (82, 87) | 0.52 (0.45, 0.59) |
| CNIO | 154 | 87 (78, 97) | 79 (72, 85) | 0.39 (0.32, 0.47) |
| ESTHER | 244 | 95 (93, 98) | 92 (88, 95) | 0.69 (0.62, 0.74) |
| PBCS | 1,236 | 88 (87, 91) | 89 (87, 91) | 0.50 (0.47, 0.52) |
|
| ||||
| Yes | 613 | 90 (86, 93) | 87 (84, 90) | 0.54 (0.50, 0.58) |
| No | 1,236 | 89 (87, 91) | 89 (87, 91) | 0.50 (0.47, 0.52) |
|
| 1,849 | 90 (88, 91) | 88 (87, 90) | 0.65 (0.63, 0.67) |
Semi‐quantitative categories of visual scores were used to determine kappa agreement. AUC was determined using continuous automated scores and dichotomous categories of visual scores.
*Agreement analyses were stratified by whether or not a study had TMAs in the training set. ABCS, CNIO and ESTHER all had TMAs in the training set while PBCS did not have TMAs in the training set.
Odds ratio and 95% CI for the association between clinical and pathological characteristics of breast cancer with categories of Ki67 (≤10% vs. >10%) among 9,059 patients
| Characteristic | Cases ( | OR |
|
|---|---|---|---|
|
| |||
| <35 | 328 | 1.00 (Referent) | |
| 35–50 | 3,043 | 0.64 (0.50–0.83) | 1.00E‐03 |
| >50–65 | 4,064 | 0.55 (0.43–0.72) | 4.79E‐06 |
| >65 | 1,414 | 0.60 (0.45–0.80) | 2.43E‐04 |
|
| |||
| Low grade | 1,696 | 1.00 (Referent) | |
| Intermediate grade | 3,684 | 1.69 (1.45–1.97) | 4.71E‐12 |
| High grade | 2,552 | 4.18 (3.57–4.89) | 3.57E‐72 |
|
| |||
| I | 3,214 | 1.00 (Referent) | |
| II | 3,534 | 1.15 (1.03–1.27) | 1.00E‐02 |
| III | 473 | 1.41 (1.13–1.28) | 2.00E‐03 |
| IV | 97 | 1.77 (1.15–2.72) | 9.00E‐03 |
|
| |||
| Ductal | 4,315 | 1.00 (Referent) | |
| Lobular | 860 | 0.36 (0.29–0.43) | 1.98E‐25 |
| Other | 648 | 0.68 (0.56–0.82) | 4.62E‐05 |
|
| |||
| <2 cm | 4,492 | 1.00 (Referent) | |
| 2–4.9 cm | 2,565 | 1.31 (1.17–1.46) | 6.64E‐07 |
| >5 cm | 244 | 1.29 (0.96–1.72) | 8.60E‐02 |
|
| |||
| Negative | 4,758 | 1.00 (Referent) | |
| Positive | 3,168 | 1.11 (1.00–1.23) | 4.00E‐02 |
|
| |||
| Negative | 2,222 | 1.00 (Referent) | |
| Positive | 6,128 | 0.42 (0.38–0.47) | 1.09E‐55 |
|
| |||
| Negative | 2,853 | 1.00 (Referent) | |
| Positive | 4,919 | 0.51 (0.46–0.56) | 1.68E‐36 |
|
| |||
| Negative | 5,379 | 1.00 (Referent) | |
| Positive | 1,060 | 1.61 (1.40–1.85) | 1.30E‐11 |
|
| |||
| Negative | 2,407 | 1.00 (Referent) | |
| Positive | 356 | 3.08 (2.40–3.95) | 4.61E‐19 |
|
| |||
| Negative | 4,184 | 1.00 (Referent) | |
| Positive | 623 | 1.73 (1.45–2.07) | 5.69E‐10 |
All variables were modelled separately and each model was adjusted for age at diagnosis and study group. Other morphology includes all other histological subtypes of breast cancer that are neither invasive ductal (NOS) nor invasive lobular.
*OR refers to the odds of each clinico‐pathological characteristic being high Ki67 expressing