| Literature DB >> 32747665 |
Christian Marzahl1,2, Marc Aubreville3, Christof A Bertram4, Jason Stayt5, Anne-Katherine Jasensky6, Florian Bartenschlager4, Marco Fragoso-Garcia4, Ann K Barton7, Svenja Elsemann8, Samir Jabari9, Jens Krauth10, Prathmesh Madhu3, Jörn Voigt10, Jenny Hill5, Robert Klopfleisch4, Andreas Maier3.
Abstract
Exercise-induced pulmonary hemorrhage (EIPH) is a common condition in sport horses with negative impact on performance. Cytology of bronchoalveolar lavage fluid by use of a scoring system is considered the most sensitive diagnostic method. Macrophages are classified depending on the degree of cytoplasmic hemosiderin content. The current gold standard is manual grading, which is however monotonous and time-consuming. We evaluated state-of-the-art deep learning-based methods for single cell macrophage classification and compared them against the performance of nine cytology experts and evaluated inter- and intra-observer variability. Additionally, we evaluated object detection methods on a novel data set of 17 completely annotated cytology whole slide images (WSI) containing 78,047 hemosiderophages. Our deep learning-based approach reached a concordance of 0.85, partially exceeding human expert concordance (0.68 to 0.86, mean of 0.73, SD of 0.04). Intra-observer variability was high (0.68 to 0.88) and inter-observer concordance was moderate (Fleiss' kappa = 0.67). Our object detection approach has a mean average precision of 0.66 over the five classes from the whole slide gigapixel image and a computation time of below two minutes. To mitigate the high inter- and intra-rater variability, we propose our automated object detection pipeline, enabling accurate, reproducible and quick EIPH scoring in WSI.Entities:
Mesh:
Year: 2020 PMID: 32747665 PMCID: PMC7398908 DOI: 10.1038/s41598-020-65958-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Data set statistics for each fully annotated WSI.
| File | Staining | Total Cells | Score | Count of Cells by Grade | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | mean | SD | ||||
| 01_EIPH | Prussian | 4446 | 126 | 1013 | 1782 | 1218 | 348 | 85 | 1.26 | 0.96 |
| 02_EIPH | Prussian | 12812 | 72 | 5084 | 6203 | 1450 | 64 | 11 | 0.72 | 0.68 |
| 03_EIPH | Prussian | 6325 | 37 | 4295 | 1697 | 330 | 3 | 0 | 0.37 | 0.58 |
| 04_EIPH | Prussian | 5448 | 63 | 2551 | 2379 | 508 | 10 | 0 | 0.63 | 0.66 |
| 05_EIPH | Prussian | 2489 | 34 | 1754 | 634 | 99 | 2 | 0 | 0.34 | 0.55 |
| 06_EIPH | Turnbull | 2992 | 41 | 1908 | 933 | 148 | 3 | 0 | 0.41 | 0.59 |
| 07_EIPH | Turnbull | 1073 | 235 | 48 | 127 | 352 | 495 | 51 | 2.35 | 0.91 |
| 08_EIPH | Turnbull | 924 | 67 | 471 | 290 | 160 | 3 | 0 | 0.67 | 0.76 |
| 09_EIPH | Turnbull | 4752 | 216 | 568 | 1053 | 932 | 1446 | 753 | 2.16 | 1.27 |
| 10_EIPH | Prussian | 10385 | 208 | 592 | 2131 | 4037 | 3098 | 527 | 2.08 | 0.96 |
| 11_EIPH | Prussian | 5751 | 59 | 2839 | 2452 | 435 | 25 | 0 | 0.59 | 0.65 |
| 12_EIPH | Turnbull | 1112 | 35 | 767 | 302 | 43 | 0 | 0 | 0.35 | 0.55 |
| 13_EIPH | Turnbull | 968 | 43 | 637 | 252 | 70 | 8 | 1 | 0.43 | 0.67 |
| 14_EIPH | Prussian | 3143 | 39 | 1995 | 1062 | 81 | 5 | 0 | 0.39 | 0.55 |
The columns show the total number of alveolar macrophages/hemosiderophages, the number of cells for each grade and their corresponding mean grade and standard deviation. The three final bold lines indicate the test set.
Figure 1Left: Clumps of hemosiderin in an area with artefacts (hair). The used staining method is inadequate to distinguish between intra-cellular and extra-cellular hemosiderin, clearly making the annotation of the area especially ambiguous. Centre: Example for the sampling strategy on image 17_EIPH Turnbull blue with 7,095 cells. We can see a high sampling probability for the node with the only grade four cell. Each cells was marked as a dot. Right: Object detection result for a region of the image 17_EIPH Turnbull blue with their ground truth on top and the predictions at the bottom.
Figure 2Cell-based regression results on the test data set visualised as a density histogram for the predicted scores. As an example, both cells in the middle are labelled with grade two and the regression model assigned very different scores to both, which is also clearly comprehensible from the visual appearance of the cell.
Figure 3Object detection and score prediction based on RetinaNet. (a) ResNet-18 is used as input network for the (c) Feature Pyramid Network[39] to generate rich, multi-scale features. The features ResNet-18 extracted from the patch are used for a direct regression-based score estimation. (d) Predicts a regression-based score for each cell, (e) classifies the cell into the five grades and background. (f) Is used for regressing from anchor boxes to ground truth bounding boxes.
Figure 4From left to right: Confusion matrix for the automatic single cell classification results; Accumulated confusion matrix for all human experts; On the right the performance metrics diagram visualise the results for the concordance with the ground truth for trial one and two (Con-V0, Con-V1). Additionally, the intra-rater concordance (Con-IR) and Cohen’s Kappa are shown. [B = Beginner, P = Professional, E = Expert, DL = Deep Learning approach.].
Figure 5The left diagram visualises the regression error for the single cell classification task. The three remaining figures show the object detection results from test set (slide #17) on 1049 patches of size 1024 × 1024. Ground truth (left), predictions (middle) and error (right). Large errors appearing at the outer circle boundary can be explained by missed cell annotations.
Comparison of multiple object detection architectures with their corresponding backbone, number of parameters, accuracy, score error and average inference speed per test WSI.
| Architecture | Backbone | Parameter | mAP_50 | Score Error | Inference speed |
|---|---|---|---|---|---|
| Ours | RN-18 | 11.434.555 | 0.64 | 15 | 101s |
| Ours | RN-18 | 11.987.739 | 0.65 | 13 | 101s |
| Ours | RN-18 | 13.683.675 | 0.66 | 9 | 103s |
| Ours | RN-18 | 22.625.439 | 0.66 | 9 | 111s |
| RetinaNet | RN-18 | 19.729.755 | 0.66 | 9 | 111s |
| RetinaNet | RN-34 | 29.837.915 | 0.66 | 9 | 142s |
| RetinaNet | RN-50 | 36.288.347 | 0.66 | 8 | 258s |
| SSD | MobileNetV2 | 13.871.354 | 0.61 | 21 | 105s |
| Faster-RCNN | RN-50 | 128.383.642 | 0.66 | 7 | 305s |
| SVM | RBF-Kernel | / | / | 21 | 65s |
| DL-Regression | RN-18 | 11.704.897 | / | 19 | 92s |
We incrementally increased the number of channels and convolutional layers in our implementation until the accuracy converged against 0.66. Additionally, the errors of the deep learning-based regression and of the support vector machine are shown for comparison.