| Literature DB >> 36106328 |
Robin S Mayer1, Steffen Gretser1, Lara E Heckmann1, Paul K Ziegler1, Britta Walter1, Henning Reis1, Katrin Bankov1, Sven Becker2, Jochen Triesch3, Peter J Wild1,3,4,5,6, Nadine Flinner1,3,5,6.
Abstract
There is a lot of recent interest in the field of computational pathology, as many algorithms are introduced to detect, for example, cancer lesions or molecular features. However, there is a large gap between artificial intelligence (AI) technology and practice, since only a small fraction of the applications is used in routine diagnostics. The main problems are the transferability of convolutional neural network (CNN) models to data from other sources and the identification of uncertain predictions. The role of tissue quality itself is also largely unknown. Here, we demonstrated that samples of the TCGA ovarian cancer (TCGA-OV) dataset from different tissue sources have different quality characteristics and that CNN performance is linked to this property. CNNs performed best on high-quality data. Quality control tools were partially able to identify low-quality tiles, but their use did not increase the performance of the trained CNNs. Furthermore, we trained NoisyEnsembles by introducing label noise during training. These NoisyEnsembles could improve CNN performance for low-quality, unknown datasets. Moreover, the performance increases as the ensemble become more consistent, suggesting that incorrect predictions could be discarded efficiently to avoid wrong diagnostic decisions.Entities:
Keywords: computational pathology; data perturbation; deep learning; ensemble learning; machine learning; ovarian cancer; quality control; tissue quality
Year: 2022 PMID: 36106328 PMCID: PMC9464871 DOI: 10.3389/fmed.2022.959068
Source DB: PubMed Journal: Front Med (Lausanne) ISSN: 2296-858X
Figure 1CNN performance depended on tissue quality and tissue source site. The TCGA-OV dataset was split into four subsets and 4 x 15 individual CNNs were trained with different train-validation splits using one subset as a test. The whole procedure was repeated 10 times. CNN performance was only recorded for WSIs in the test set. (A) Average accuracy for each WSI. The identifier with their accuracy is given in Supplementary Table S1. Error bars depict standard deviation. (B) Boxplot over the accuracy values of all slides with pathologist P1 assigned tissue quality. At least one category is significantly different according to an ANOVA (p = 0.00011), and significant differences between groups (post-hoc t-test with Bonferroni Holm p-value adjustments, p < 0.05) are marked with *. (C) Contingency table for tissue source site vs. tissue quality. (D) Boxplot over the accuracy values of all slides from different tissue source sides. At least one category is significantly different according to an ANOVA (p = 0.00005) and significant differences between groups (post-hoc t-test with Bonferroni Holm p-value adjustments, p < 0.05) are marked with *.
Figure 3Quality control tools did not increase performance. Percentage of retained tissue per WSI by (A). HistoQC and (B). PathProfiler for the three different quality levels as assigned by pathologist one. For HistoQC, at least one group differs (ANOVA, p-value: 0.00003; post-hoc t-test with Bonferroni Holm p-value adjustments, p < 0.05 are marked with *), and for PathProfiler, there are no significant differences between groups (ANOVA, p-value: 0.34). (C) CNNs trained for Figure 2 were used to calculate the performance on the test datasets, which were now quality controlled by either HistoQC (black) or PathProfiler (light gray) and do not contain tiles of low quality. Plotted is the performance difference between the original and quality-controlled test datasets. Error bars depict standard deviation.
Figure 4NoisyEnsembles could identify tiles with uncertain predictions. CNNs were trained with data from tissue source site A (left column) or tissue source site B (right column), while a defined amount of tiles was deliberately mislabeled in the training dataset. For each patient, either healthy or cancerous tiles were used during training. Hold-out test sets were chosen randomly and with the remaining data 15 CNNs with different train-validation splits were trained and combined with a bagging ensemble. The complete procedure was repeated 10 times. (A,B) Average ensemble accuracy for different test datasets. (C–F) NoisyEnsemble with a noise level of 15% during training. (C,D) Number of tiles with a valid ensemble prediction for the respective ensemble agreement. (E,F) Ensemble accuracy for individual levels of ensemble agreement. Predictions with smaller rates of the agreement were ignored in the respective categories. Error bars depict standard deviation.
Figure 2CNN transferability depended on the quality of training data. CNNs were exclusively trained and validated on data from (A) source site A (n = 21+7), (B) source site B (n = 27+7), or (C) both source sites (n = 48+14) and tested on hold-out test data from site A (n = 7) and site B (n = 7) and external data from UKF. Hold out test data were randomly chosen 10 times and every test 15 individual CNNs with different train-validation splits were calculated and the performance was measured. Boxplots show all recorded performances. Blue…site A; orange…site B; green…UKF.