| Literature DB >> 34415429 |
Bálint Cserni1, Rita Bori2, Erika Csörgő2, Orsolya Oláh-Németh3, Tamás Pancsa3, Anita Sejben3, István Sejben2, András Vörös3, Tamás Zombori3, Tibor Nyári4, Gábor Cserni5,6.
Abstract
The reproducibility of assessing potential biomarkers is crucial for their implementation. ONEST (Observers Needed to Evaluate Subjective Tests) has been recently introduced as a new additive evaluation method for the assessment of reliability, by demonstrating how the number of observers impact on interobserver agreement. Oestrogen receptor (ER), progesterone receptor (PR), and Ki67 proliferation marker immunohistochemical stainings were assessed on 50 core needle biopsy and 50 excision samples from breast cancers by 9 pathologists according to daily practice. ER and PR statuses based on the percentages of stained nuclei were the most consistently assessed parameters (intraclass correlation coefficients, ICC 0.918-0.996), whereas Ki67 with 5 different theoretical or St Gallen Consensus Conference-proposed cut-off values demonstrated moderate to good reproducibility (ICC: 0.625-0.760). ONEST highlighted that consistent tests like ER and PR assessment needed only 2 or 3 observers for optimal evaluation of reproducibility, and the width between plots of the best and worst overall percent agreement values for 100 randomly selected permutations of observers was narrow. In contrast, with less consistently evaluated tests of Ki67 categorization, ONEST suggested at least 5 observers required for more trustful assessment of reliability, and the bandwidth of the best and worst plots was wider (up to 34% difference between two observers). ONEST has additional value to traditional calculations of the interobserver agreement by not only highlighting the number of observers needed to trustfully evaluate reproducibility but also by highlighting the rate of agreement with an increasing number of observers and disagreement between the better and worse ratings.Entities:
Keywords: Breast cancer; Ki67; ONEST; Oestrogen receptor; Progesterone receptor; Reproducibility
Mesh:
Substances:
Year: 2021 PMID: 34415429 PMCID: PMC8724065 DOI: 10.1007/s00428-021-03172-9
Source DB: PubMed Journal: Virchows Arch ISSN: 0945-6317 Impact factor: 4.064
Fig. 1ONEST plots of ER (A), PR (B), and Ki67 (C) classifications into < 1%, 1–10%, and > 10% categories on CNB with all 100 random permutations of pathologists (A-B-C 1) and just the best and worst OPA values (A-B-C 2). Note: C2 demonstrates best that with the increasing number of pathologists, the OPA decreases till reaching a plateau with 4 pathologists. The classification can be characterized with the distance between the minimum and maximum OPA with 2 pathologists (0.94–0.76 = 0.18), the number of pathologists required for reaching the plateau (4), the approximate value of the plateau (0.64), and the OPA for all pathologists (0.62). Categorizations with good reproducibility have a narrow gap (bandwidth) between the maximum and minimum values, reach the plateau with few pathologists and have a high OPA with all pathologists (A1, A2). While A1, B1, and C1 demonstrate 100 OPAC each; A2, B2, and C2 show the minimum and maximum OPA values and do not necessarily overlap with an OPAC from the 100 permutations, but obviously overlap with an OPAC from all permutations. The worst scenario, i.e., the minimum OPA values were selected to characterize the categorizations
ICC (95% credible interval, CI) values for the investigated categories
ER oestrogen receptor, PR progesterone receptor, QS quick score or Allred score; intensity refers to average intensity scorings; (%) refers to the recorded percentage values with all different values representing a different category, 3 categories refer to < 1%, 1–10%, and > 10% categorization, St Gallen—year refers to the categories of low/(intermediate)/high Ki67 labelling as defined by the St Gallen Consensus Conference of the given year (see “Methods” section). The greyscale reflects the categorization of the level of reliability into excellent (ICC > 0.9), good to excellent, good (ICC > 0.75–0.9), moderate to good and moderate (ICC > 0.5–0.75) from white to deeper shades of grey; the 95% CIs are taken into account for the categorization [25]
Main results of the ONEST analyses of different parameters
| Maximum OPA differences | Pathologists needed for plateau | OPA with 9 pathologists | |
|---|---|---|---|
| ER categories (< 1%, 1–10%, > 10%) CNB | 0.04 | 2 | 0.96 |
| ER categories (< 1%, 1–10%, > 10%) EXC | 0.02 | 2 | 0.98 |
| ER intensity CNB | 0.32 | 5 | 0.48 |
| ER intensity EXC | 0.36 | 4 | 0.38 |
| ER Allred scores (0,2; 3–4; 5–6; 7–8) CNB | 0.12 | 4 | 0.72 |
| ER Allred scores (0,2; 3–4; 5–6; 7–8) EXC | 0.10 | 2 | 0.90 |
| PR categories (< 1%, 1–10%, > 10%) CNB | 0.12 | 3 | 0.82 |
| PR categories (< 1%, 1–10%, > 10%) EXC | 0.18 | 3 | 0.76 |
| PR intensity CNB | 0.36 | 4 | 0.38 |
| PR intensity EXC | 0.42 | 4 | 0.36 |
| PR Allred scores (0,2; 3–4; 5–6; 7–8) CNB | 0.22 | 5 | 0.48 |
| PR Allred scores (0,2; 3–4; 5–6; 7–8) EXC | 0.20 | 3 | 0.58 |
| Ki67 categories (< 1%, 1–10%, > 10%) CNB | 0.18 | 4 | 0.62 |
| Ki67 categories (< 1%, 1–10%, > 10%) EXC | 0.26 | 4 | 0.44 |
| Ki67 St Gallen 2009 CNB | 0.30 | 4 | 0.32 |
| Ki67 St Gallen 2009 EXC | 0.28 | 4 | 0.38 |
| Ki67 St Gallen 2011 CNB | 0.18 | 5 | 0.6 |
| Ki67 St Gallen 2011 EXC | 0.24 | 4 | 0.5 |
| Ki67 St Gallen 2013 CNB | 0.22 | 5 | 0.52 |
| Ki67 St Gallen 2013 EXC | 0.26 | 5 | 0.54 |
| Ki67 St Gallen 2015 CNB | 0.3 | 4 | 0.32 |
| Ki67 St Gallen 2015 EXC | 0.34 | 5 | 0.26 |
Fig. 2Comparison of OPAs derived from 100 and all permutations of pathologists for Ki67 categorization according to St Gallen 2013 recommendation. MIN: minimum, MAX: maximum, AVE: average, (100): for the 100 permutations, (All): for all 9! permutations. The MIN(All) and MAX(All) represent the worst and best OPAC, whereas the MIN(100) and MAX(100) curves lay on the worst and best OPA values and do not necessarily represent an OPAC. The AVE values are just derived from the 100 or 9! OPA values belonging to the number of pathologists on the x-axis. The MAX values (curves) overlap completely. The AVE curves virtually overlap completely and the MIN(100) vs MIN(All) curves deviate slightly, but the differences are not significant (p = 0.64; Kruskal Wallis)