| Literature DB >> 32897490 |
Mina Rezaei1, Janne J Näppi2, Christoph Lippert3, Christoph Meinel3, Hiroyuki Yoshida2.
Abstract
Purpose The identification of abnormalities that are relatively rare within otherwise normal anatomy is a major challenge for deep learning in the semantic segmentation of medical images. The small number of samples of the minority classes in the training data makes the learning of optimal classification challenging, while the more frequently occurring samples of the majority class hamper the generalization of the classification boundary between infrequently occurring target objects and classes. In this paper, we developed a novel generative multi-adversarial network, called Ensemble-GAN, for mitigating this class imbalance problem in the semantic segmentation of abdominal images.Method The Ensemble-GAN framework is composed of a single-generator and a multi-discriminator variant for handling the class imbalance problem to provide a better generalization than existing approaches. The ensemble model aggregates the estimates of multiple models by training from different initializations and losses from various subsets of the training data. The single generator network analyzes the input image as a condition to predict a corresponding semantic segmentation image by use of feedback from the ensemble of discriminator networks. To evaluate the framework, we trained our framework on two public datasets, with different imbalance ratios and imaging modalities: the Chaos 2019 and the LiTS 2017.Result In terms of the F1 score, the accuracies of the semantic segmentation of healthy spleen, liver, and left and right kidneys were 0.93, 0.96, 0.90 and 0.94, respectively. The overall F1 scores for simultaneous segmentation of the lesions and liver were 0.83 and 0.94, respectively.Conclusion The proposed Ensemble-GAN framework demonstrated outstanding performance in the semantic segmentation of medical images in comparison with other approaches on popular abdominal imaging benchmarks. The Ensemble-GAN has the potential to segment abdominal images more accurately than human experts.Entities:
Keywords: Abdominal imaging; Generative multi-discriminative networks; Imbalanced learning; Semantic segmentation
Mesh:
Year: 2020 PMID: 32897490 PMCID: PMC7603459 DOI: 10.1007/s11548-020-02254-4
Source DB: PubMed Journal: Int J Comput Assist Radiol Surg ISSN: 1861-6410 Impact factor: 3.421
Fig. 1Overview of the architecture of the proposed Ensemble-GAN composed of a generator and multi-discriminator. The generator network (G) is a modified stacked hourglass architecture which takes random noise and medical images as the condition and tries to predict the semantic segmentation through an ensemble of D losses. Each D (with different losses) distinguishes between ground-truth and different global and local features map predicted by G
Effectiveness of hyperparameter on semantic segmentation results in terms of F1 score
| Architecture | CHAOS | LiTS | ||||
|---|---|---|---|---|---|---|
| Liver | Spleen | r-kidney | l-kidney | Liver | Lesion | |
| Conditional GAN | ||||||
| 1 Disc. | 0.86 | 0.77 | 0.83 | 0.89 | 0.84 | 0.82 |
| Ensemble-GAN (1) | ||||||
| 2 Disc. | 0.89 | 0.88 | 0.90 | 0.91 | 0.90 | 0.84 |
| 2 Disc. | 0.89 | 0.89 | 0.91 | 0.91 | 0.91 | 0.85 |
| 2 Disc. | 0.90 | 0.89 | 0.91 | 0.92 | 0.91 | 0.85 |
| Ensemble-GAN (3) | ||||||
| 3 Disc. | 0.94 | 0.90 | 0.93 | 0.94 | 0.95 | 0.87 |
| 3 Disc. | 0.95 | 0.91 | 0.93 | 0.94 | 0.96 | 0.89 |
| 3 Disc. | 0.95 | 0.92 | 0.93 | 0.94 | 0.96 | 0.89 |
The F1 scores obtained across 100 epochs on both datasets are shown in the table
Fig. 2a In the earlier epochs of training of the Ensemble-GAN, when G improves, Ds deteriorate because G and Ds work against each other. b After several epochs of training, the ensemble of D reaches the point to improve segmentation output from G. As a result, the Ensemble-GAN shows a good convergence where the ensemble of Ds is unable to differentiate between the real and fake distributions. Here, loss G indicates the loss of generator and loss and loss indicate the adversarial losses of discriminator on real and fake image calculated on high-resolution features map, respectively. The Attri term denotes the losses calculated on low-resolution label map
Accuracy for simultaneous liver and lesions segmentation in terms of the Dice score and average surface distance on the test data, where 1 is the index for liver and 2 for lesions
| Approaches | Dice1 | Dice2 | ASSD1 | ASSD2 |
|---|---|---|---|---|
| Ensemble-GANs (1) | 0.91 | 0.80 | 1.4 | 1.9 |
| Ensemble-GANs (2) | 0.92 | 0.81 | 1.4 | 1.7 |
| Ensemble-GANs (3) | 0.94 | 0.84 | 1.3 | 1.6 |
| cGAN | 0.85 | 0.81 | 1.8 | 2.1 |
| UNet | 0.72 | 0.70 | 19.04 | 19.04 |
| Cascaded-UNet [ | 0.93 | 0.93 | 2.3 | 2.3 |
| UNet+3DCRF [ | 0.95 | 0.50 | 0.92 | 1.3 |
| ResNet+Fusion [ | 0.95 | 0.50 | 0.84 | 13.33 |
| SuperAI | 0.96 | 0.81 | – | 1.1 |
| H-Dense+ UNet [ | 0.96 | 0.82 | 1.45 | 1.1 |
| coupleFCN [ | 0.78 | 0.77 | – | – |
The top four rows show the accuracy of the liver segmentation
| Architecture | VOE | RVD | ASSD | MSSD | F1 | Precision | Recall | kappa |
|---|---|---|---|---|---|---|---|---|
| Ensemble-GANs (1) | 17 | 9.2 | 46.8 | 0.90 | 0.91 | 0.86 | 0.77 | |
| Ensemble-GANs (2) | 16 | 7.7 | 41.2 | 0.91 | 0.91 | 0.86 | 0.79 | |
| Ensemble-GANs (3) | 14 | 6.2 | 40.3 | 0.95 | 0.94 | 0.89 | 0.80 | |
| cGAN | 21 | 10.8 | 87.1 | 0.88 | 0.90 | 0.79 | 0.68 | |
| ResNet+Fusion [ | 16 | 5.3 | 48.3 | – | – | - | - | |
| SuperAI | 36 | 4.27 | 1.1 | 6.2 | – | – | – | – |
| H-Dense+ UNet [ | 39 | 7.8 | 1.1 | 7.0 | – | – | – | – |
| coupleFCN [ | 35 | 12 | 1.0 | 7.0 | – | – | – | – |
VOE volume overlap error, RVD relative volume difference , ASSD average symmetric surface distance, MSSD maximum symmetric surface distance
Fig. 3Semantic segmentation results obtained by Ensemble-GAN (3) on LiTS dataset
The top three rows show the average accuracy for the semantic segmentation of abdominal CT and MR images with respect to the measurements obtained by the challenge organizer
| Architecture | VOE | RAVD | ASSD | MSSD | DICE | F1 | Precision | Recall |
|---|---|---|---|---|---|---|---|---|
| Ensemble-GAN (1) | 0.14 | 6.9 | 6.1 | 39.1 | 0.91 | 0.95 | 0.96 | 0.89 |
| Ensemble-GAN (2) | 0.15 | 5.7 | 5.8 | 40.1 | 0.92 | 0.94 | 0.96 | 0.88 |
| Ensemble-GAN (3) | 0.12 | 3.1 | 2.9 | 32.1 | 0.94 | 0.96 | 0.97 | 0.90 |
| cGAN | 2 | 10.8 | 17.3 | 51 | 0.83 | 0.85 | 0.69 | |
| PKDIA [ | – | 8.43 | 6.37 | 33.1 | 0.88 | – | – | – |
| OvGUMEMoRIAL [ | – | 50 | 5.2 | 74.0 | 0.85 | – | – | – |
| IITKGP-KLIV [ | – | 13.5 | 16.6 | 130 | 0.63 | – | – | – |
| ISDUE [ | – | 14.0 | 9.81 | 37.1 | 0.85 | – | – | – |
VOE volume overlap error, RAVD relative volume absolute difference, ASSD average symmetric surface distance, MSSD maximum symmetric surface distance are defined by CHAOS organizers. The average F1 score, precision, and recall are calculated as measures for the handling of the class imbalance problem
Fig. 4Semantic segmentation results obtained by Ensemble-GAN (3) on CHAOS dataset
Fig. 5Different losses induce different qualities of results. Each column shows the results predicted by different models
Effectiveness of each component and network architecture
| Architecture | CHAOS | LiTS | ||||
|---|---|---|---|---|---|---|
| Liver | Spleen | Right kidney | Left kidney | Liver | Lesion | |
| Conditional GAN | ||||||
| 1 Disc. | 0.88 ± 0.08 | 0.80 ± 0.08 | 0.84 ± 0.09 | 0.91 ± 0.02 | 0.87 ± 0.02 | 0.82 ± 0.11 |
| 1 Disc. | 0.89 ± 0.06 | 0.83 ± 0.12 | 0.86 ± 0.03 | 0.92 ± 0.05 | 0.88 ± 0.01 | 0.84 ± 0.05 |
| 1 Disc. | 0.89 ± 0.03 | 0.81 ± 0.14 | 0.85 ± 0.08 | 0.92 ± 0.03 | 0.88 ± 0.02 | 0.83 ± 0.07 |
| 1 Disc. | 0.87 ± 0.14 | 0.77 ± 0.20 | 0.83 ± 0.05 | 0.90 ± 0.05 | 0.86 ± 0.02 | 0.82 ± 0.04 |
| Cyclic-Ensemble-GAN | ||||||
| 2 Disc. | 0.89 ± 0.05 | 0.88 ± 0.03 | 0.91 ± 0.06 | 0.91 ± 0.04 | 0.89 ± 0.01 | 0.85 ± 0.08 |
| Ensemble-GAN (1) | ||||||
| 2 Disc. | 0.89 ± 0.02 | 0.87 ± 0.10 | 0.90 ± 0.04 | 0.91 ± 0.09 | 0.92 ± 0.07 | 0.84 ± 0.02 |
| 2 Disc. | 0.89 ± 0.04 | 0.88 ± 0.06 | 0.91 ± 0.03 | 0.92 ± 0.03 | 0.93 ± 0.01 | 0.86 ± 0.23 |
| 2 Disc. | 0.92 ± 0.02 | 0.89 ± 0.05 | 0.92 ± 0.02 | 0.93 ± 0.02 | 0.85 ± 0.22 | |
| 2 Disc. | 0.91 ± 0.03 | 0.88 ± 0.02 | 0.91 ± 0.14 | 0.93 ± 0.03 | 0.93 ± 0.02 | 0.85 ± 0.05 |
| Ensemble-GAN (2) | ||||||
| 3 Disc. | 0.92 ± 0.02 | 0.90 ± 0.12 | 0.91 ± 0.05 | 0.94 ± 0.03 | 0.92 ± 0.02 | 0.88 ± 0.02 |
| Ensemble-GAN (3) | ||||||
| 3 Disc. | 0.92 ± 0.03 | 0.94 ± 0.04 | 0.96 ± 0.07 | 0.89 ± 0.02 | ||
| 3 Disc. | 0.94 ± 0.08 | 0.93 ± 0.06 | 0.93 ± 0.05 | |||
The F1 scores obtained across 100 epochs using the different datasets with different imbalanced ratios and image modalities are shown in the table. Bold scores indicate the best F1 score obtained for each dataset