| Literature DB >> 36156419 |
Holger R Roth1, Ziyue Xu2, Carlos Tor-Díez3, Ramon Sanchez Jacob4, Jonathan Zember4, Jose Molto4, Wenqi Li2, Sheng Xu5, Baris Turkbey5, Evrim Turkbey5, Dong Yang2, Ahmed Harouni2, Nicola Rieke2, Shishuai Hu6, Fabian Isensee7, Claire Tang8, Qinji Yu9, Jan Sölter10, Tong Zheng11, Vitali Liauchuk12, Ziqi Zhou13, Jan Hendrik Moltz14, Bruno Oliveira15, Yong Xia6, Klaus H Maier-Hein16, Qikai Li9, Andreas Husch17, Luyang Zhang11, Vassili Kovalev12, Li Kang13, Alessa Hering17, João L Vilaça18, Mona Flores2, Daguang Xu2, Bradford Wood5, Marius George Linguraru19.
Abstract
Artificial intelligence (AI) methods for the automatic detection and quantification of COVID-19 lesions in chest computed tomography (CT) might play an important role in the monitoring and management of the disease. We organized an international challenge and competition for the development and comparison of AI algorithms for this task, which we supported with public data and state-of-the-art benchmark methods. Board Certified Radiologists annotated 295 public images from two sources (A and B) for algorithms training (n=199, source A), validation (n=50, source A) and testing (n=23, source A; n=23, source B). There were 1,096 registered teams of which 225 and 98 completed the validation and testing phases, respectively. The challenge showed that AI models could be rapidly designed by diverse teams with the potential to measure disease or facilitate timely and patient-specific interventions. This paper provides an overview and the major outcomes of the COVID-19 Lung CT Lesion Segmentation Challenge - 2020.Entities:
Keywords: COVID-19; Challenge; Medical image segmentation
Mesh:
Year: 2022 PMID: 36156419 PMCID: PMC9444848 DOI: 10.1016/j.media.2022.102605
Source DB: PubMed Journal: Med Image Anal ISSN: 1361-8415 Impact factor: 13.828
Fig. 1The countries of origin of the 98 teams that completed the training, validation and test phases of the challenge.
Fig. 2Demographic information of the leaders of the 98 teams that completed the training, validation and test phases of the challenge. The top row shows the age group (left), student status (middle) and sex (right) of the participant. The middle row shows the highest degree (left) and job category (right). Bottom row shows the algorithm characteristics for the 98 submissions that completed the training, validation and test phases of the challenge. We report if algorithms were fully-automated (left), used external data for training (middle) or used a general pre-trained network for initialization (right).
Fig. 3Data variability between “seen” and “unseen” sources; (a) Illustration of the differences in the voxel spacing and voxel volume grouped by training, validation, and test sets. (b) Differences in COVID-19 lesion volumes across the image data sources. (c) Normalized histograms showing the CT intensity distributions of the “seen” and “unseen” data sources in Hounsfield units (HU). Note, −1000 HU corresponds to air, and 750 to cancellous bone (Patrick et al., 2017).
Fig. 4Blob plot visualization of the ranking variability via bootstrapping. An algorithm’s ranking stability is shown across the different tasks, illustrating the ranking uncertainty of the algorithm in each task. For more details see Wiesenfarth et al. (2021).
Top-10 finalists after statistical ranking. “Value” represents the average rank the algorithm achieved across all tasks. We also show if methods were automated, used external data for training, the input data dimensions used in the algorithms, and the network architecture.
| Rank | Value | ID # | Fully | Extra | Pre-trained | Ensemble | Data | Network | Authors | Country |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2.6 | 53 | ✓ | ✓ | ✗ | ✗ | 3D | nnU-Net | S. Hu et al. | China |
| 2 | 6.0 | 38 | ✓ | ✗ | ✗ | ✓ | 3D | nnU-Net | F. Isensee et al. | Germany |
| 3 | 7.7 | 65 | ✓ | ✗ | ✗ | ✓ | 2D/3D | nnU-Net | C. Tang | USA |
| 4 | 8.4 | 58 | ✓ | ✗ | ✗ | ✓ | 3D | nnU-Net | Q. Yu et al. | China |
| 5 | 8.5 | 31 | ✓ | ✗ | ✗ | ✓ | 3D | nnU-Net | J. Sölter et al. | Luxembourg |
| 6 | 9.2 | 50 | ✓ | ✗ | ✗ | ✓ | 2D/3D | nnU-Net | T. Zheng & L. Zhang | Japan |
| 6 | 9.2 | 68 | ✓ | ✗ | ✓ | ✗ | 2D/3D | VGG16 Hybrid, | V. Liauchuk et al. | Belarus |
| 8 | 9.4 | 95 | ✓ | ✗ | ✗ | ✓ | 3D | nnU-Net | Z. Zhou et al. | China |
| 9 | 10.6 | 29 | ✓ | ✗ | ✗ | ✗ | 3D | nnU-Net | J. Moltz et al. | Germany |
| 10 | 11.3 | 15 | ✓ | ✗ | ✗ | ✗ | 3D | U-Net | B. Oliveira et al. | Portugal |
Dice coefficients of the top-10 algorithms on (left) all test data, (middle) “seen” data (Dataset 1), and (right) “unseen” test data (Dataset 2).
| All test cases: | “Seen” test cases: | “Unseen” test cases: | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID # | mean | std | median | ID # | mean | std | median | ID # | mean | std | median | ||
| 53 | 0.666 | 0.236 | 0.754 | 38 | 0.740 | 0.195 | 0.797 | 53 | 0.598 | 0.264 | 0.700 | ||
| 58 | 0.658 | 0.242 | 0.741 | 53 | 0.734 | 0.182 | 0.782 | 95 | 0.593 | 0.258 | 0.677 | ||
| 95 | 0.658 | 0.237 | 0.729 | 31 | 0.729 | 0.190 | 0.769 | 58 | 0.588 | 0.263 | 0.724 | ||
| 38 | 0.654 | 0.268 | 0.763 | 65 | 0.729 | 0.186 | 0.778 | 15 | 0.581 | 0.264 | 0.670 | ||
| 15 | 0.649 | 0.242 | 0.716 | 58 | 0.728 | 0.195 | 0.789 | 68 | 0.570 | 0.276 | 0.703 | ||
| 68 | 0.646 | 0.251 | 0.753 | 95 | 0.723 | 0.193 | 0.783 | 38 | 0.569 | 0.302 | 0.729 | ||
| 31 | 0.645 | 0.265 | 0.753 | 68 | 0.723 | 0.196 | 0.779 | 50 | 0.562 | 0.279 | 0.692 | ||
| 65 | 0.644 | 0.258 | 0.754 | 29 | 0.722 | 0.187 | 0.711 | 31 | 0.561 | 0.300 | 0.685 | ||
| 50 | 0.639 | 0.252 | 0.733 | 15 | 0.717 | 0.197 | 0.751 | 65 | 0.559 | 0.291 | 0.686 | ||
| 29 | 0.634 | 0.259 | 0.705 | 50 | 0.716 | 0.194 | 0.773 | 29 | 0.545 | 0.289 | 0.647 | ||
Fig. 5Top-10 algorithms performance measured for the tasks used in the challenge, namely the Dice coefficient (top row), Normalized Surface Dice (middle row), and Normalized Absolute Volume Error (bottom row) on the “seen” (a, c, e) and “unseen” test datasets (b, d, f), respectively. Algorithms are ranked based on their performance from left to right individually for each task.
Fig. 6Example test case from the “seen” data data source (Dataset 1). The performance of the top algorithms #53 and #38 is shown in green and blue, respectively. Ground truth annotations are shown in red (a: axial view, b: coronal view).
Fig. 7Example test case from the “unseen” data source. (a: axial view, b: coronal view) Top algorithms #53 and #38, shown in green and blue, respectively, both predict a false-positive lesion at the locations of a normal lung vessel. At the same time they missed the real lesion in red (c: axial view, d: coronal view).
Fig. 8Podium plots for “seen” (a) and “unseen” (b) test data. The participating algorithms are color-coded. Each colored dot shows the Dice coefficient achieved by the respective algorithm. The same test cases are connected by a line. The lower part of the charts displays the relative frequency for a given algorithm to achieve a podium place, i.e. rank achieved by a given algorithm.