| Literature DB >> 35579553 |
Mohamed Amgad1, Lamees A Atteya2, Hagar Hussein3, Kareem Hosny Mohammed4, Ehab Hafiz5, Maha A T Elsebaie6, Ahmed M Alhusseiny7, Mohamed Atef AlMoslemany8, Abdelmagid M Elmatboly9, Philip A Pappalardo10, Rokia Adel Sakr11, Pooya Mobadersany1, Ahmad Rachid12, Anas M Saad13, Ahmad M Alkashash14, Inas A Ruhban15, Anas Alrefai12, Nada M Elgazar16, Ali Abdulkarim17, Abo-Alela Farag12, Amira Etman8, Ahmed G Elsaeed16, Yahya Alagha17, Yomna A Amer8, Ahmed M Raslan18, Menatalla K Nadim19, Mai A T Elsebaie12, Ahmed Ayad20, Liza E Hanna3, Ahmed Gadallah12, Mohamed Elkady21, Bradley Drumheller22, David Jaye22, David Manthey23, David A Gutman24, Habiba Elfandy25,26, Lee A D Cooper1,27,28.
Abstract
BACKGROUND: Deep learning enables accurate high-resolution mapping of cells and tissue structures that can serve as the foundation of interpretable machine-learning models for computational pathology. However, generating adequate labels for these structures is a critical barrier, given the time and effort required from pathologists.Entities:
Keywords: breast cancer; crowdsourcing; deep learning; nucleus classification; nucleus segmentation
Mesh:
Year: 2022 PMID: 35579553 PMCID: PMC9112766 DOI: 10.1093/gigascience/giac037
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 7.658
Figure 1:Dataset annotation and quality control procedure. A. Nucleus classes annotated. B. Annotation procedure and resulting datasets. Two approaches were used to obtain nucleus labels from non-pathologists (NPs). (Top) The first approach focused on breadth, collecting single-rater annotations over a large number of FOVs to obtain the majority of data in this study. NPs were given feedback on their annotations, and 2 study coordinators corrected and standardized all single-rater NP annotations on the basis of input from a senior pathologist. (Bottom) The second approach evaluated interrater reliability and agreement, obtaining annotations from multiple NPs for a smaller set of shared FOVs. Annotations were also obtained from pathologists for these FOVs to measure NP reliability. The procedure for inferring a single set of labels from multiple participants is described in Fig. 2. We distinguished between inferred non-pathologist labels (NP-labels) and inferred pathologist truth (P-truth) for clarity. Three multi-rater datasets were obtained: an Evaluation dataset, which is the primary multi-rater dataset, as well as Bootstrap and Unbiased experimental controls to measure the value of algorithmic suggestions. In all datasets except the Unbiased control, participants were shown algorithmic suggestions for nucleus boundaries and classes. They were directed to click nuclei with correct boundary suggestions and annotate other nuclei with bounding boxes. The pipeline to obtain algorithmic suggestions consisted of 2 steps: (i) Using image processing to obtain bootstrapped suggestions (Bootstrap control); (ii) Training a Mask R-CNN deep-learning model to refine the bootstrapped suggestions (single-rater and Evaluation datasets).
Figure 2:Inference from multi-rater datasets. The purpose of this step was to infer the nucleus locations and classifications from multi-rater data. A. The first step involved agglomerative hierarchical clustering of bounding boxes using intersection-over-union (IOU) as a similarity measure. We imposed a constraint during clustering that prevents merging annotations where a single participant has annotated overlapping nuclei. Participant intention was preserved by demoting annotations from the same participant to the next node (Step 5, arrow). After clustering was complete, a threshold IOU value was used to obtain the final clusters (Step 5, black nodes). Within each cluster, the medoid bounding box was chosen as an anchor proposal. The result was a set of anchors with corresponding clustered annotations. When a participant did not match to an anchor, it was considered a conscious decision not to annotate a nucleus at that location. B. Once anchors were obtained, an expectation-maximization procedure was used to estimate (i) which anchors represent actual nuclei and (ii) which classes to assign these anchors. The expectation-maximization procedure estimates and accounts for the reliability of each participant for each classification. Expectation-maximization was performed separately for NPs and pathologists. C. Grouping of nucleus classes. Consistent with standard practice in object detection, nuclei were grouped, on the basis of clinical reasoning, into 5 classes and 3 super-classes.
Figure 3:Accuracy of participant annotations. A. Detection precision-recall comparing annotations to inferred P-truth. Junior pathologists tend to have similar precision but higher recall than senior pathologists, possibly reflecting the time constraints of pathologists. PPV: positive predictive value. B. Classification ROC for classes and super-classes. The overall classification accuracy of inferred NP-labels was high. However, class-balanced accuracy (macro-average) is notably lower because NPs are less reliable annotators of uncommon classes. FPR: false-positive rate. C. Confusion between pathologist annotations and inferred P-truth. D. Multidimensional scaling (MDS) analysis of interrater classification agreement. Some clustering by participant experience (blue ellipse) highlights the importance of modeling reliability during label inference. E. A simulation was used to measure how redundancy affects the classification accuracy of inferred NP-labels. While keeping the total number of NPs constant, we randomly kept annotations for a variable number of NPs per FOV. Accuracy in these simulations was class-dependent, with stromal nuclei requiring more redundancy for accurate inference. Each simulation is represented by one notched box plot, where notches correspond to the bootstrapped 95% interval around the median, and the whiskers extend for 1.5x the interquartile range.
Figure 4:Effect of algorithmic suggestions on annotation abundance and accuracy. We compared annotations from the Evaluation dataset and controls to measure the effect of suggestions and Mask R-CNN refinement on the acquisition of nucleus segmentation data and the accuracy of annotations. A. Example annotations from a single participant. Algorithmic suggestions allow the collection of accurate nucleus segmentations without added effort. Yellow points indicate clicks to approve suggestions. B. The number of segmented nuclei clicked is significantly higher for the Evaluation dataset than for the Bootstrap control, indicating that refinement improves suggestion quality. C. Accuracy of algorithmic segmentation suggestions. The comparison is made against a limited set of manually traced segmentation boundaries obtained from 1 senior pathologist (SP). Suggestions that were determined to be correct by the expectation-maximization procedure had significantly more accurate segmentation boundaries. D. Self-agreement for annotations in the presence or absence of algorithmic suggestions. The agreement is substantial for non-pathologist (NP) and pathologist (P) groups, indicating that algorithmic suggestions do not affect classification decisions adversely. Pathologists have higher self-agreement and are less impressionable than NPs. E. ROC curves for the classification accuracy of inferred NP-label, using inferred P-truth as our reference. **P < 0.01; ***P < 0.001.
Figure 5:Effect of clustering on detection and interrater agreement. A. Stricter IOU thresholds reduce the number of anchor proposals generated by clustering but increase agreement. A threshold of 0.25 provides more anchor proposals with negligible difference in agreement from the 0.5 threshold. The shaded region indicates that by design, there are no anchor proposals with <2 clustered annotations. B. The clustering constraint prevents annotations from the same participant from being assigned to the same anchor, preserving participant intention when annotating overlapping nuclei. This results in better detection of overlapping nuclei during clustering (upper panel) and also affects the inferred P-truth for anchors (bottom panel). C. Interrater classification agreement among pathologists for tested clustering thresholds. D. Pairwise interrater classification agreement (Cohen κ) at 0.25 IOU threshold. **P < 0.01; ***P < 0.001.