| Literature DB >> 33318495 |
Young-Gon Kim1, Sungchul Kim2, Cristina Eunbee Cho2, In Hye Song3, Hee Jin Lee4, Soomin Ahn5, So Yeon Park5, Gyungyub Gong4, Namkug Kim6.
Abstract
Fast and accurate confirmation of metastasis on the frozen tissue section of intraoperative sentinel lymph node biopsy is an essential tool for critical surgical decisions. However, accurate diagnosis by pathologists is difficult within the time limitations. Training a robust and accurate deep learning model is also difficult owing to the limited number of frozen datasets with high quality labels. To overcome these issues, we validated the effectiveness of transfer learning from CAMELYON16 to improve performance of the convolutional neural network (CNN)-based classification model on our frozen dataset (N = 297) from Asan Medical Center (AMC). Among the 297 whole slide images (WSIs), 157 and 40 WSIs were used to train deep learning models with different dataset ratios at 2, 4, 8, 20, 40, and 100%. The remaining, i.e., 100 WSIs, were used to validate model performance in terms of patch- and slide-level classification. An additional 228 WSIs from Seoul National University Bundang Hospital (SNUBH) were used as an external validation. Three initial weights, i.e., scratch-based (random initialization), ImageNet-based, and CAMELYON16-based models were used to validate their effectiveness in external validation. In the patch-level classification results on the AMC dataset, CAMELYON16-based models trained with a small dataset (up to 40%, i.e., 62 WSIs) showed a significantly higher area under the curve (AUC) of 0.929 than those of the scratch- and ImageNet-based models at 0.897 and 0.919, respectively, while CAMELYON16-based and ImageNet-based models trained with 100% of the training dataset showed comparable AUCs at 0.944 and 0.943, respectively. For the external validation, CAMELYON16-based models showed higher AUCs than those of the scratch- and ImageNet-based models. Model performance for slide feasibility of the transfer learning to enhance model performance was validated in the case of frozen section datasets with limited numbers.Entities:
Mesh:
Year: 2020 PMID: 33318495 PMCID: PMC7736325 DOI: 10.1038/s41598-020-78129-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Clinicopathologic characteristics of the patients in AMC and SNUBH datasets.
| AMC | SNUBH | |||
|---|---|---|---|---|
| Training set | Validation set | Test set | Test set | |
| Age (median and range) | 50 (28–80) | 49 (30–68) | 47 (34–75) | 52 (25–87) |
| Female | 157 (100%) | 40 (100%) | 100 (100%) | 228 (100%) |
| Present, size > 2 mm | 68 (43.3%) | 14 (35%) | 40 (40%) | 84 (36.8%) |
| Present, size ≤ 2 mm | 35 (22.3%) | 5 (12.5%) | 15 (15%) | 44 (19.2%) |
| Absent | 54 (34.4%) | 21 (52.5%) | 45 (45%) | 100 (44%) |
AMC, Asan Medical Center; SNUBH, Seoul National University Bundang Hospital.
Figure 1A validation workflow for transfer learning. (a) Two initial classification models are trained from two public datasets (i.e., ImageNet and CAMELYON16 datasets). (b) Private dataset from AMC is used to train classification models with different initial weights based on scratch- (random initialization), ImageNet- (pre-trained), and CAMELYON16-(pre-trained) based models. Different dataset ratios are selected at 2, 4, 8, 20, 40, and 100% of the total AMC dataset to train models for validation of the transfer learning effectiveness.
Figure 2Validation loss and accuracy comparison of scratch-, ImageNet-, and CAMELYON16-based models for different AMC dataset ratios. (a) 4%, (b) 20%, (c) 100% of the total training dataset.
Performance comparison of models based on different initial weights for different ratios of the training dataset.
| Ratio | AMC | SNUBH | ||||
|---|---|---|---|---|---|---|
| Patch-level | Slide-level | Slide-level | ||||
| Sensitivity | Specificity | Accuracy | AUC | AUC | AUC | |
| 2% | – | – | – | – | – | – |
| 4% | – | – | – | – | – | – |
| 8% | – | – | – | – | – | – |
| 20% | 0.737 | 0.848 | 0.798 | 0.887** | 0.689** | 0.540** |
| 40% | 0.715 | 0.873 | 0.801 | 0.897** | 0.703** | 0.495** |
| 100% | 0.627 | 0.930 | 0.791 | 0.914** | 0.781** | 0.437** |
| 2% | 0.060 | 0.989 | 0.5669 | 0.825** | 0.739** | 0.592* |
| 4% | 0.284 | 0.976 | 0.660 | 0.869** | 0.724** | 0.625 |
| 8% | 0.184 | 0.995 | 0.626 | 0.885** | 0.794** | 0.667 |
| 20% | 0.431 | 0.966 | 0.722 | 0.894** | 0.847* | 0.723 |
| 40% | 0.483 | 0.972 | 0.749 | 0.919** | 0.870 | 0.819** |
| 100% | 0.660 | 0.968 | 0.828 | 0.943 | 0.888 | 0.798 |
| 2% | 0.475 | 0.898 | 0.705 | 0.843 | 0.814 | 0.689 |
| 4% | 0.617 | 0.907 | 0.775 | 0.881 | 0.874 | 0.667 |
| 8% | 0.474 | 0.967 | 0.742 | 0.895 | 0.873 | 0.695 |
| 20% | 0.494 | 0.973 | 0.755 | 0.912 | 0.867 | 0.763 |
| 40% | 0.534 | 0.972 | 0.772 | 0.929 | 0.878 | 0.749 |
| 100% | 0.656 | 0.969 | 0.827 | 0.944 | 0.886 | 0.804 |
In the AMC dataset, sensitivity, specificity, and accuracy at threshold 0.5, and AUC were measured for patch-level evaluation and AUC for slide-level evaluation was measured on AMC and SNUBH datasets.
Statistical comparisons between AUCs of CAMELYON16-based model and others trained with the same ratio of the training dataset were performed to determine whether the model’s performance with different initial weights were significant.
AMC, Asan Medical Center; SNUBH, Seoul National University Bundang Hospital.
*p-value < 0.05, **p-value < 0.0005.
Figure 3An example of Grad-CAMs for tumor patch with scratch-, ImageNet-, and CAMELYON16-based models trained with different dataset ratios. Confidence below each result indicates the value of the last fully connected layer in each model. (a) Input patch, Grad-CAM results of models trained with (b) 4%, (c) 20%, and (d) 100% of the total AMC dataset.
Performance comparison of models based on different initial weights and learning rates for different ratios of the training dataset.
| Ratio | Learning rate | AUC | ||
|---|---|---|---|---|
| Scratch-based model | ImageNet-based model | CAMELYON-based model | ||
| 20% | 5e−2 | 0.516 | 0.511 | 0.572 |
| 5e−3 | 0.623 | 0.766 | 0.799 | |
| 5e−4 | 0.887 | 0.894 | 0.912 | |
| 5e−5 | 0.890 | 0.891 | 0.920 | |
| 5e−6 | 0.813 | 0.872 | 0.910 | |
| 40% | 5e−2 | 0.546 | 0.599 | 0.597 |
| 5e−3 | 0.759 | 0.803 | 0.812 | |
| 5e−4 | 0.897 | 0.919 | 0.929 | |
| 5e−5 | 0.881 | 0.905 | 0.930 | |
| 5e−6 | 0.822 | 0.824 | 0.931 | |
| 100% | 5e−2 | 0.546 | 0.616 | 0.631 |
| 5e−3 | 0.754 | 0.804 | 0.811 | |
| 5e−4 | 0.914 | 0.943 | 0.944 | |
| 5e−5 | 0.921 | 0.944 | 0.938 | |
| 5e−6 | 0.911 | 0.906 | 0.901 | |
AUC was measured for patch-level evaluation in AMC dataset.