Literature DB >> 36156419

Rapid artificial intelligence solutions in a pandemic-The COVID-19-20 Lung CT Lesion Segmentation Challenge.

Holger R Roth¹, Ziyue Xu², Carlos Tor-Díez³, Ramon Sanchez Jacob⁴, Jonathan Zember⁴, Jose Molto⁴, Wenqi Li², Sheng Xu⁵, Baris Turkbey⁵, Evrim Turkbey⁵, Dong Yang², Ahmed Harouni², Nicola Rieke², Shishuai Hu⁶, Fabian Isensee⁷, Claire Tang⁸, Qinji Yu⁹, Jan Sölter¹⁰, Tong Zheng¹¹, Vitali Liauchuk¹², Ziqi Zhou¹³, Jan Hendrik Moltz¹⁴, Bruno Oliveira¹⁵, Yong Xia⁶, Klaus H Maier-Hein¹⁶, Qikai Li⁹, Andreas Husch¹⁷, Luyang Zhang¹¹, Vassili Kovalev¹², Li Kang¹³, Alessa Hering¹⁷, João L Vilaça¹⁸, Mona Flores², Daguang Xu², Bradford Wood⁵, Marius George Linguraru¹⁹.

Abstract

Artificial intelligence (AI) methods for the automatic detection and quantification of COVID-19 lesions in chest computed tomography (CT) might play an important role in the monitoring and management of the disease. We organized an international challenge and competition for the development and comparison of AI algorithms for this task, which we supported with public data and state-of-the-art benchmark methods. Board Certified Radiologists annotated 295 public images from two sources (A and B) for algorithms training (n=199, source A), validation (n=50, source A) and testing (n=23, source A; n=23, source B). There were 1,096 registered teams of which 225 and 98 completed the validation and testing phases, respectively. The challenge showed that AI models could be rapidly designed by diverse teams with the potential to measure disease or facilitate timely and patient-specific interventions. This paper provides an overview and the major outcomes of the COVID-19 Lung CT Lesion Segmentation Challenge - 2020.

Entities: Chemical

Keywords: COVID-19; Challenge; Medical image segmentation

Mesh：

Year: 2022 PMID： 36156419 PMCID： PMC9444848 DOI： 10.1016/j.media.2022.102605

Source DB: PubMed Journal: Med Image Anal ISSN： 1361-8415 Impact factor: 13.828

Introduction

The SARS-CoV-2 pandemic has had a devastating impact on the global healthcare systems. As of May 28, 2021, more than 169 million people have been infected in the world with over 3.5 million deaths (Hopkins, 2021). COVID-19 is known to affect nearly every organ system, including the lungs, brain, kidneys, liver, gastrointestinal tract, and cardiovascular system. The manifestations of the disease in the lung may be early indicators of future problems. These manifestations have been intensively reported in the adult populations and occasionally in pediatric subjects (Nino et al., 2021, Zang et al., 2021, Larici et al., 2020, Ojha et al., 2020, Wan et al., 2020). Since the early days of the pandemic, lung imaging has been critical for both the early identification and management of individuals affected by COVID-19 (Rubin et al., 2020). Imaging also provides invaluable support for the evaluation of patients with long COVID and after the acute sequelae of the diseases. Repeated waves of infection and changes in the disease course require data, including imaging, classification, quantification, and response tools, as well as standardized reliable interpretation as the global society struggles to provide widely available vaccines and faces evolving challenges such as new mutations of the virus. The most common lung imaging modalities utilized for the evaluation of SARS-CoV-2 infections are chest radiographs (CXR) and chest computerized tomography (CT) with ultrasound (US) being used more sparingly. Chest CT is the reference modality that most accurately demonstrates the acute lung manifestations of COVID-19 (Bao et al., 2020, Zhu et al., 2020). As observed in CT, the most common findings in the chest of the affected individuals were ground-glass opacities (GGO) and pneumonic consolidations. Other manifestations include interstitial abnormalities, crazy paving pattern, halo signs, pleural abnormalities, bronchiectasis, bronchovascular bundle thickening, air bronchograms, lymphadenopathy, and pleural/pericardial effusions. The sensitivity of chest CT to detect these abnormalities in subjects with confirmed COVID-19 was widely variable and somewhat subjective, reported in the range of 44%–97% (median 69%) (Merkus and Klein, 2020). Beside its role in the identification of patterns of SARS-CoV-2 infections, lung CT is also important in the determination of the severity of COVID-19 (Wan et al., 2020, Bao et al., 2020, Zhu et al., 2020, Cao et al., 2020, Bernheim et al., 2020). The presence, location and extension of the lung abnormalities are critical factors for the clinical management of patients to potentially facilitate decisions towards more timely and personalized medical interventions. Quantification of lesions may further provide the tracking of disease progression and response to therapeutic countermeasures. Thus, improving COVID-19 treatment starts with a clearer understanding of the patient’s disease state, which must include accurate identification, delineation and quantification of lung lesions and disease phenotypes and patterns. A prior lack of global data collaboration limited clinicians and scientists in their ability to quickly and effectively understand COVID-19 disease, its severity and outcomes. As access to data has improved, quality annotations have remained a limiting factor in the development of useful artificial intelligence (AI) models derived from machine learning and deep learning (LeCun et al., 2015). Thus, a multitude of AI approaches have been developed, published and indicated great potential for clinical support, but they were often overfit, being trained using proprietary data or from a single site (Kang et al., 2020, Wang et al., 2020b, Fan et al., 2020, Oulefki et al., 2021, Shan et al., 2021, Ippolito et al., 2021). Alternatively, federated approaches allow algorithms to access data from multiple sites without the need of sharing raw data, but through this paradigm access is granted to a single algorithm and consortium, with sharing of model weights instead of raw data (Yang et al., 2021, Dayan et al., 2021). In particular, deep neural networks were used for the identification and segmentation of abnormal lung regions affected by SARS-CoV-2 infection. These can be grouped into two main classes: classification models that extract the affected region inside the lung area by comparison with data from healthy subjects (Bai et al., 2020, Li et al., 2020, Mei et al., 2020, Wang et al., 2020a), and segmentation models that directly extract the abnormal lung areas according to patterns in the image and (typically using fully convolutional networks) (Fan et al., 2020, Shan et al., 2021, Huang et al., 2020, Zhang et al., 2020, Zhou et al., 2021). Without access to public data and an adequate platform to evaluate and compare their performance, AI approaches risk being overtrained, irreproducible, and ultimately clinically not useful. Thus, public efforts are needed to accelerate the understanding of the role of AI towards informing manifestations and qualifying impact of health crises such as the COVID-19 pandemic. The COVID-19 Lung CT Lesion Segmentation Challenge 2020 (COVID-19-20) created the public platform to evaluate emerging AI methods for the segmentation and quantification of lung lesions caused by SARS-CoV-2 infection from CT images. This effort required a multi-disciplinary team science partnership among global communities in a broad variety of often disparate fields, including radiology, computer science, data science and image processing. The goal was to rapidly team up to combine multi-disciplinary expertise towards the development of tools to simultaneously both define and address unmet clinical needs created by the pandemic. The COVID-19-20 platform provided access to multi-institutional, multinational images originating from patients of different ages and gender, and with variable disease severity. The challenge team provided the ability to quickly label a public dataset, allowing radiologists to rapidly add precise annotations. Open access was offered to the annotated CTs of subjects with PCR-confirmed COVID-19, and to a baseline deep learning pipeline based on MONAI (Project MONAI, 2021) that could serve as a starting point for further algorithmic improvements. The challenge was hosted on a widely used competition website () for easy and secure data access control. This paper presents an overview of this challenge and competition, including the data resources and the top ten AI algorithms identified from a highly competitive field of participants who tested the data in December 2020.

Submissions

The challenge was launched on November 2, 2020. The training and validation data were released and 1,096 teams registered before the training phase was closed on December 8, 2020. The 225 teams that completed the validation phase were given access to the test data. Ninety-eight teams from 29 countries on six continents completed the test phase. Fig. 1 shows the countries of origin of the 98 teams. Test results were released on December 18, 2020, and the statistical ranking of the top ten teams (see Section 3) was unveiled during a virtual mini symposium on January 11, 2021.1 Fig. 2 shows the demographic information for the team leaders, i.e., age group, sex, highest educational degree, student status and job category, and algorithmic characteristics for the 98 submissions that completed the training, validation and test phases. We requested participants to disclose whether they used external data for training their algorithms or if they used a general-purpose pre-trained network for initialization (e.g., a network pre-trained for another lung disease). The use of public networks pre-trained for the segmentation of COVID-19 lesions was not allowed (e.g., Clara_train_covid19_ct_lesion_seg2 ).

Fig. 1

The countries of origin of the 98 teams that completed the training, validation and test phases of the challenge.

Fig. 2

Demographic information of the leaders of the 98 teams that completed the training, validation and test phases of the challenge. The top row shows the age group (left), student status (middle) and sex (right) of the participant. The middle row shows the highest degree (left) and job category (right). Bottom row shows the algorithm characteristics for the 98 submissions that completed the training, validation and test phases of the challenge. We report if algorithms were fully-automated (left), used external data for training (middle) or used a general pre-trained network for initialization (right).

Participants uploaded the results on the validation and test data to the hosting website for evaluation. Only (semi-)automated methods were allowed. Submission of manual annotations was prohibited. For validation, the number of submissions from each user was limited to once-a-day for the purpose of refining their algorithms based on the live performance indicators on the challenge validation leaderboard.3 Submission of results on the test data was collected without showing the leaderboard and the last submission was used for final ranking. The test phase was open only to participants who had already submitted their results on the validation set. The leaderboard and final ranking are public and hosted on the challenge website.4 The countries of origin of the 98 teams that completed the training, validation and test phases of the challenge. Demographic information of the leaders of the 98 teams that completed the training, validation and test phases of the challenge. The top row shows the age group (left), student status (middle) and sex (right) of the participant. The middle row shows the highest degree (left) and job category (right). Bottom row shows the algorithm characteristics for the 98 submissions that completed the training, validation and test phases of the challenge. We report if algorithms were fully-automated (left), used external data for training (middle) or used a general pre-trained network for initialization (right).

Materials & results

Data sources

This challenge utilized data from two public resources on chest CT images, namely the “CT Images in COVID-19”5 (Harmon et al., 2020) (Dataset 1) and “COVID-19-AR”6 (Desai et al., 2020) (Dataset 2) available on The Cancer Imaging Archive (TCIA) (Clark et al., 2013). CT images were acquired without intravenous contrast enhancement from patients with positive Reverse Transcription Polymerase Chain Reaction (RT-PCR) for SARS-CoV-2. Dataset 1 originated from China, while dataset 2 was acquired from the US population. In total, we used 295 images, including 272 images from Dataset 1 and 23 images from Dataset 2. Of these images, 199 and 50 from Dataset 1 were used for training and validation, respectively. We therefore refer to Dataset 1 as the “seen” data source that participants used to train and validate their algorithms during the first phase of the challenge. The test set contained 23 images each from Datasets 1 and 2 (46 images in total). Hence, Dataset 2 was only used in the testing phase, and we refer to it as the “unseen” data source. Descriptive statistics, such as x-, y-, and z-spacings and voxel volume in both data sources are shown in Fig. 3. We also show the differences in COVID-19 lesion volumes annotated between the two data sources.

Fig. 3

Data variability between “seen” and “unseen” sources; (a) Illustration of the differences in the voxel spacing and voxel volume grouped by training, validation, and test sets. (b) Differences in COVID-19 lesion volumes across the image data sources. (c) Normalized histograms showing the CT intensity distributions of the “seen” and “unseen” data sources in Hounsfield units (HU). Note, −1000 HU corresponds to air, and 750 to cancellous bone (Patrick et al., 2017).

Annotation protocol

All images were automatically segmented by a previously trained model for COVID lesion segmentation (Yang et al., 2021) that is publicly available.7 These segmentations were subsequently used as a starting point for board certified radiologists (RS, JZ, JM) who equally divided the dataset between them and manually adjudicated and corrected the segmentations. Therefore, the annotation for each case was confirmed by one radiologist. The annotation tool used was ITKSnap8 (Yushkevich et al., 2006) showing multiple reformatted views of the CT scans, and allowing manipulations and corrections of the initial automated segmentation results in three dimensions. All abnormal lung lesions related to COVID-19 were included. Lesions comprised of consolidation, GGO (focal or diffuse), septal thickening, crazy paving, and bronchiectasis, consistent with Bao et al. (2020). Note, that diffuse GGOs were still readily identifiable by the radiologists and defined as abnormal.

Evaluation metrics

We used the three evaluation metrics described below. These metrics were both used to evaluate the performance of different algorithms, and to establish the interobserver variability. Dice Coefficient (Dice). A common evaluation metric of segmentation accuracy defined as the overlap between the volume of the ground truth segmentation and the predicted segmentation volume : . Normalized Surface Dice (NSD). Similarly to Dice, it provides the normalized measure of agreement between the surface of the prediction and the surface of the ground (Andrearczyk et al., 2021). We chose a threshold of 1 mm to define an “acceptable” derivation between the ground truth surface and the predicted surface. Normalized Absolute Volume Error (NAVE). The volume of COVID-19 lesion burden inside the patient’s lung can play an important role for clinical assessment (Sun et al., 2020). Therefore, a measure was chosen that assesses the agreement between the predicted and ground truth lesion volumes, defined as . Note, we used the negative of this value for ranking purposes as higher values indicate better performance in our ranking approach.

Interobserver performance

As a benchmark for comparing the AI algorithms with human performance on the lesion segmentation task, we measured the human interobserver agreement. We compared the annotations utilized in Yang et al. (2021) from 245 of the 272 cases from Dataset 1 used in the challenge with the ones obtained by our radiologists. The interobserver agreement showed mean ±standard deviation (median) of Dice, NSD, and NAVE of 0.702 ±0.172 (0.756), 0.538 ±0.147 (0.563), and 0.601 ±1.969 (0.180), respectively.

Statistical ranking method

Recent work on ranking analysis for biomedical imaging challenges has shown that ranking results can vary significantly depending on the chosen type of metric and ranking scheme (Maier-Hein et al., 2018). Most biomedical challenges use approaches such as “aggregate-then-rank” or “rank-then-aggregate”, which do not account for statistical differences between algorithms (Maier-Hein et al., 2018, Wiesenfarth et al., 2021). These findings motivated the development of a challenge ranking toolkit9 (Wiesenfarth et al., 2021) that we employed for our evaluation. This toolkit utilizes statistical hypothesis testing applied to each possible pair of algorithms. This allows us to better assess the differences between the evaluated metrics. Following the notation of Wiesenfarth et al. (2021), our challenge contained tasks (Dice, NSD, NAVE on each “seen” and “unseen” test data). The test cases for each task are denoted as . In our case, for each task. A bootstrap approach is used to evaluate the ranking stability of the different algorithms. This means that ranking is performed repeatable on bootstrap samples, see Fig. 4. The statistical test employed to determine the consensus ranking is the one-sided Wilcoxon signed rank test with a significance level of , adjusted for multiple testing according to Holm (Wiesenfarth et al., 2021).

Fig. 4

Blob plot visualization of the ranking variability via bootstrapping. An algorithm’s ranking stability is shown across the different tasks, illustrating the ranking uncertainty of the algorithm in each task. For more details see Wiesenfarth et al. (2021).

Each of the m tasks contributed equality to the final consensus using the Euclidean distance between averaged ranks across tasks. We ranked the 98 submitted algorithms using the proposed statistical consensus ranking algorithm to determine the top-10 methods, including the challenge winning algorithm. Blob plot visualization of the ranking variability via bootstrapping. An algorithm’s ranking stability is shown across the different tasks, illustrating the ranking uncertainty of the algorithm in each task. For more details see Wiesenfarth et al. (2021).

Summary of top-10 algorithms

We show the final ranking of the top-10 performing algorithms in Table 1. All top-10 algorithms were fully-automated methods, and all were based on some variation of the U-Net (Ronneberger et al., 2015, Çiçek et al., 2016), a fully convolutional network (Long et al., 2015) for image segmentation based on the popular encoder–decoder design with skip connections (Long et al., 2015, Drozdzal et al., 2016). U-Net has dominated the field of biomedical image segmentation in recent years (Haque and Neubert, 2020) and most challenge participants opted to use one of its implementations. In particular the nnU-Net open-source framework10 (Isensee et al., 2021), which has shown success in multiple biomedical image segmentation challenges, was a popular choice for challenge participants. The U-Net architectures included 2D, 3D, high- and low-resolution configurations. One team used the open-source platform MONAI11 (#68). The majority of algorithms used challenge data only with one method including additional unlabeled data from the public TCIA source (#53), which was done with pseudo labels in a semi-supervised approach. The majority directly targeted the segmentation of COVID-19 lesions, while one participant (#31) targeted multiple outputs, including body and lung masks.

Table 1

Top-10 finalists after statistical ranking. “Value” represents the average rank the algorithm achieved across all tasks. We also show if methods were automated, used external data for training, the input data dimensions used in the algorithms, and the network architecture.

Rank	Value	ID #	Fully automated	Extra data	Pre-trained	Ensemble	Data dimension	Network architecture	Authors	Country
1	2.6	53	✓	✓	✗	✗	3D	nnU-Net	S. Hu et al.	China
2	6.0	38	✓	✗	✗	✓	3D	nnU-Net	F. Isensee et al.	Germany
3	7.7	65	✓	✗	✗	✓	2D/3D	nnU-Net	C. Tang	USA
4	8.4	58	✓	✗	✗	✓	3D	nnU-Net	Q. Yu et al.	China
5	8.5	31	✓	✗	✗	✓	3D	nnU-Net	J. Sölter et al.	Luxembourg
6	9.2	50	✓	✗	✗	✓	2D/3D	nnU-Net	T. Zheng & L. Zhang	Japan
6	9.2	68	✓	✗	✓	✗	2D/3D	VGG16 Hybrid, MONAI	V. Liauchuk et al.	Belarus
8	9.4	95	✓	✗	✗	✓	3D	nnU-Net	Z. Zhou et al.	China
9	10.6	29	✓	✗	✗	✗	3D	nnU-Net	J. Moltz et al.	Germany
10	11.3	15	✓	✗	✗	✗	3D	U-Net	B. Oliveira et al.	Portugal

A popular loss function for biomedical image segmentation is the Dice loss (Milletari et al., 2016). In this challenge, most finalists utilized it together with additional cross entropy, top-k (Lyu et al., 2020), and focal loss (Lin et al., 2017). An important strategy for winning image segmentation is model ensembling, the fusion of predictions from several independently trained models. Here, most methods used 5-fold cross validation and model ensemble to arrive at a consensus prediction. A full description of the top-10 finalists’ algorithms by their authors is given in the Supplementary Material. Top-10 finalists after statistical ranking. “Value” represents the average rank the algorithm achieved across all tasks. We also show if methods were automated, used external data for training, the input data dimensions used in the algorithms, and the network architecture.

Ranking results

Table 2 shows the mean and standard deviation of the Dice coefficients for the top-10 performing algorithms on test cases from the “seen” and “unseen” data sources. Top algorithms performed relatively similar to each other, but all showed a marked decrease when being evaluated on the “unseen” data (Table 2).

Table 2

Dice coefficients of the top-10 algorithms on (left) all test data, (middle) “seen” data (Dataset 1), and (right) “unseen” test data (Dataset 2).

All test cases:				“Seen” test cases:				“Unseen” test cases:
ID #	mean	std	median	ID #	mean	std	median	ID #	mean	std	median
53	0.666	0.236	0.754	38	0.740	0.195	0.797	53	0.598	0.264	0.700
58	0.658	0.242	0.741	53	0.734	0.182	0.782	95	0.593	0.258	0.677
95	0.658	0.237	0.729	31	0.729	0.190	0.769	58	0.588	0.263	0.724
38	0.654	0.268	0.763	65	0.729	0.186	0.778	15	0.581	0.264	0.670
15	0.649	0.242	0.716	58	0.728	0.195	0.789	68	0.570	0.276	0.703
68	0.646	0.251	0.753	95	0.723	0.193	0.783	38	0.569	0.302	0.729
31	0.645	0.265	0.753	68	0.723	0.196	0.779	50	0.562	0.279	0.692
65	0.644	0.258	0.754	29	0.722	0.187	0.711	31	0.561	0.300	0.685
50	0.639	0.252	0.733	15	0.717	0.197	0.751	65	0.559	0.291	0.686
29	0.634	0.259	0.705	50	0.716	0.194	0.773	29	0.545	0.289	0.647

Fig. 5 shows boxplots of the top-10 performing algorithms for each of the tasks. In general, methods present more outliers on the “unseen” test dataset. Fig. 6 shows a typical example from the “seen” test data source. The 1st (#53) and 2nd (#38) ranked algorithms achieved a mean Dice coefficient >0.734 Dice on the “seen” dataset. Fig. 6 shows that most of the COVID-19 related lesions were well segmented by the automated algorithms. In contrast, Fig. 7 shows a challenging case from the “unseen” test data source. Both top-performing algorithms (#53 and #38) generated a false-positive segmentation region at a normal lung vessel while missing the real lesion. Their performance dropped to a Dice coefficient <0.598 on the “unseen” dataset. To illustrate the general performance of the top-10 algorithms on the individual test cases, Fig. 8 shows podium plots (Eugster et al., 2008) with the performance of different algorithms on the same test case connected by a line. Due to the limited test set size for task (n=23), there was mostly no statistically significant difference found between the top-10 algorithms across the six tasks apart for #50 being significantly better than #29 on task “NSD (unseen)” at the 5% significance level.

Fig. 5

Top-10 algorithms performance measured for the tasks used in the challenge, namely the Dice coefficient (top row), Normalized Surface Dice (middle row), and Normalized Absolute Volume Error (bottom row) on the “seen” (a, c, e) and “unseen” test datasets (b, d, f), respectively. Algorithms are ranked based on their performance from left to right individually for each task.

Fig. 6

Example test case from the “seen” data data source (Dataset 1). The performance of the top algorithms #53 and #38 is shown in green and blue, respectively. Ground truth annotations are shown in red (a: axial view, b: coronal view).

Fig. 7

Example test case from the “unseen” data source. (a: axial view, b: coronal view) Top algorithms #53 and #38, shown in green and blue, respectively, both predict a false-positive lesion at the locations of a normal lung vessel. At the same time they missed the real lesion in red (c: axial view, d: coronal view).

Fig. 8

Podium plots for “seen” (a) and “unseen” (b) test data. The participating algorithms are color-coded. Each colored dot shows the Dice coefficient achieved by the respective algorithm. The same test cases are connected by a line. The lower part of the charts displays the relative frequency for a given algorithm to achieve a podium place, i.e. rank achieved by a given algorithm.

Dice coefficients of the top-10 algorithms on (left) all test data, (middle) “seen” data (Dataset 1), and (right) “unseen” test data (Dataset 2). Top-10 algorithms performance measured for the tasks used in the challenge, namely the Dice coefficient (top row), Normalized Surface Dice (middle row), and Normalized Absolute Volume Error (bottom row) on the “seen” (a, c, e) and “unseen” test datasets (b, d, f), respectively. Algorithms are ranked based on their performance from left to right individually for each task. Example test case from the “seen” data data source (Dataset 1). The performance of the top algorithms #53 and #38 is shown in green and blue, respectively. Ground truth annotations are shown in red (a: axial view, b: coronal view). Example test case from the “unseen” data source. (a: axial view, b: coronal view) Top algorithms #53 and #38, shown in green and blue, respectively, both predict a false-positive lesion at the locations of a normal lung vessel. At the same time they missed the real lesion in red (c: axial view, d: coronal view). Podium plots for “seen” (a) and “unseen” (b) test data. The participating algorithms are color-coded. Each colored dot shows the Dice coefficient achieved by the respective algorithm. The same test cases are connected by a line. The lower part of the charts displays the relative frequency for a given algorithm to achieve a podium place, i.e. rank achieved by a given algorithm.

Discussion

Performance of algorithms

Automatic AI algorithms showed great potential to accurately segment the lung COVID-19 lesions from CT images. In the validation phase, 87 out of 225 methods achieved superior Dice coefficients than the interobserver criteria (0.702), with the top team achieving a Dice coefficient of 0.771 (9.8% improvement). However, their level of robustness is inferior to the radiologist’s performance: the top team gets a Dice coefficient of 0.666 on the test data12 (5.1% decrease). This discrepancy could be due to various reasons. One reason could be the domain shift as half of the test data is from an “unseen” source that has not been used in the training or validation phases. Another reason could be the limited number of allowed submissions for the testing phase, which mitigates the possibility for overfitting to the test data. Moreover, the limited number of training data could also affect algorithm performance. The evaluation of the analysis of top-10 algorithms revealed that the ensemble of segmentation from various individual automated methods plays an important role compared to other factors such as the complexity of the network architecture, the learning rate, losses, etc. Most 10 top teams used model ensembles to reduce outliers and improved their performance by collecting the consensus segmentation from separately trained models. This observation also shows that the training pipeline can potentially be further improved based on novel concepts like AutoML (Yang et al., 2019, Zoph et al., 2018) or neural architectures search (Yu et al., 2020, Zhu et al., 2019, Liu et al., 2018) algorithms.

Use of external training data

Only one of the top 10 teams, which was the winning team of the challenge, used external data in their final solution. Using this semi-supervised training approach, they obtained an improvement of 4.27% and 0.86% Dice coefficient on the training and validation data, respectively. Another team did similar work in a student-teacher manner and saw improvement in the validation score. However, they submitted their final results without using the external data after noticing partial overlap between the chosen unlabelled external dataset and the provided training data. Both teams demonstrate that using external data, even unlabelled, could improve the segmentation performance. While this finding clearly calls for larger training datasets, it also shows the great potential of semi-supervised methods to achieve more robust solutions, especially for the healthcare domain where the annotation cost is much higher than in other fields (Tajbakhsh et al., 2020).

U-net dominance

All top-10 teams used a 2D/3D U-Net variant with at most minor modifications. While this seems to conflict with hundreds of yearly publications creating new network architectures, it also shows that most existing deep learning algorithms lack the robustness offered by model ensembles to handle large data variations (e.g., voxel spacing, contrast, etc.) when training data are limited. nnU-Net (Isensee et al., 2021) was adopted by 5 out of the 10 teams to build an end-to-end solution while another team used MONAI.13 Unsurprisingly, these findings show that the majority of participants employed well-validated, open-source resources.

Data variability and generalizability gap

The challenge was designed to use “seen” and “unseen” data sources and thus evaluate the generalizability of AI algorithms in front of variable clinical protocols. Our data sources varied in provenience (China and US), scanner manufacturers (various, as typical in routine clinical practice) and imaging protocols (voxel spacing). Fig. 3 illustrates that the volumes of the annotated COVID lesions have similar distributions on the two data sources. However, there are substantial differences in the voxel spacing used for CT reconstruction in the data (“seen” cases have a 5 mm z-spacing, while the “unseen” cases were in the 3 mm range). To overcome these differences, most participants used common data normalization strategies, such as resampling all data to a constant voxel spacing (Isensee et al., 2021). Still, these differences in voxel spacing, together with variability in scanner manufacturers and imaging protocols, were likely the main contributors to the generalization gap seen in the performance of algorithms on the “unseen” test cases. Additional factors were related to the variability of manifestations of the disease in the lungs. For example, in the challenging case from the “unseen” test data source shown in Fig. 7, the top-performing algorithms generated false-positive predictions at a normal lung vessel while missing to segment the real lesion. Domain shifts like the ones observed in the data used in this challenge are still proving to be challenging for current AI models – an observation in line with other recent works discussing the shortcomings of AI models for the diagnosis of COVID-19 (Roberts et al., 2021, Wynants et al., 2020). Disease phase variability may also have broadened the features of what defines a standard or expected set of features. Early disease may not look like later disease cycles on CT, which may have also increased model noise.

Potential for clinical use

Segmentation and classification models have been postulated to impact diagnosis in outbreak settings with delayed or unavailable PCR, however the point of care classification of COVID-19 versus other pneumonia such as influenzae, could prove of some value during flu season in specific outbreak settings as an epidemiologic tool or as a red flag for patient isolation at the scanner, by early identification, thus expediting or prioritizing interpretation using more conventional radiologist review and verification. AI models have also been proposed to assist in triage or selection of resource-limited therapeutics or critical care, prognostication or prediction of outcomes, or as one data element of a multi-modal model combining clinical, laboratory and imaging data. Standardized response criteria for clinical trials can provide a level “playing field”, thus uniformly defining effects of medical and other countermeasures, or specific scenarios for patient-specific therapies. Specific phenotypes may respond to certain therapies, for example. Imaging AI could thus play a role in determining the optimal disease phase for steroid administration or monoclonal antibodies, or even characterize the presence of different disease manifestations according to variant or underlying comorbidity, although many of these clinical or research utilities are quite speculative. AI models in COVID-19 have been justifiably criticized for a lack of generalize-ability, lack of clinical testing and validation, impracticality of model design, “me-too” models and studies, and easy replaceability of functionality with standard clinical tools. Potential clinical impact has yet to match the excitement from the data science and computational community nor realize the promise at the outset of the pandemic. Federated learning and open-source tools and modeling may help address this, especially for specific research questions for clinical trials or radiologist-sparse settings.

Limitations

The challenge organizers aimed to create a fair and robust evaluation platform for (semi-)automatic AI algorithms. This was a timely effort completed with limited resources, thus several factors could potentially be improved in retrospect. For example, 295 annotated CT images from two different data sources were used in the challenge, which may be suboptimal data quantity for training deep learning algorithms, as performance metrics improve with size of datasets. However, the challenge set a benchmark for the development and evaluation of AI methods to segment lung lesions in COVID-19, the first of its kind to our knowledge, which was reflected by the large number of participants. It is advisable to add more data in future challenges, even if the data are non-annotated as the results of this challenge indicated. Another limitation may be the data annotation. Each case was annotated by one radiologist who rectified the prediction from a publicly available COVID lesion segmentation AI model.14 Although these initial predictions may be considered as a suggestion from an expert, which is a typical workflow for many AI data annotation solutions, a second verification from another human expert would likely further improve the annotation quality. Finally, the statistical consensus ranking algorithm over multiple tasks, although it overcomes the limitations of ranking based on single evaluation metrics, is computed only at the image level. The ranking does not provide a measurement of the algorithm on the lesion level, thus without consideration of each lesion’s clinical relevance. Such information, which was nor available in our data, could be important for clinical diagnosis and tracking of disease progress. It could also provide a more granular interpretation of the strengths and weaknesses of each algorithm, and a guidance on how to improve them.

Conclusions

The COVID-19 Lung CT Lesion Segmentation Challenge - 2020 provided the platform to develop and evaluate AI algorithms for the detection and quantification of lung lesions from CT images. AI models help in the visualization and measurement of COVID specific lesions in the lungs of infected patients, potentially facilitating more timely and patient-specific medical interventions. Over one thousand teams registered to participate in the challenge participating in this challenge reflecting the engagement of the global scientific community to combat COVID-19. The AI models could be rapidly trained and showed good performance that was comparable to expert clinicians. However, robustness to “unseen” data decreased in the testing phase, indicating that larger and more diverse data may be beneficial for training. A more granular interpretation of the strengths and weaknesses of each algorithm might highlight pathways on the road towards a future where AI and deep learning might help standardize, quantify, assess disease response, select patients or therapies, or predict outcomes. But first steps first, as the scientific community builds multi-disciplinary teams to develop new tools and methodology to walk before we run. As more AI applications are being introduced in the biomedical space, it is essential to adequately validate and compare the functionality of these applications through challenges as proposed in this paper.

Code availability

The baseline deep learning pipeline based on MONAI is available at https://github.com/Project-MONAI/tutorials/tree/master/3d_segmentation/challenge_baseline. The model and software to automatically segment COVID lesions in chest CT is available at https://ngc.nvidia.com/catalog/models/nvidia:clara_train_covid19_ct_lesion_seg. The software used by radiologist to correct the automatically generated lesion annotations was ITK-Snap and is available at http://www.itksnap.org/pmwiki/pmwiki.php. Evaluation scripts used for ranking the algorithms will be made available on at https://covid-segmentation.grand-challenge.org. The ranking software used is available at https://github.com/wiesenfa/challengeR.

CRediT authorship contribution statement

Holger R. Roth: Conceptualized, Co-organized the challenge, Drafted, Performed the statistical analysis of the algorithms testing results. Ziyue Xu: Conceptualized, Co-organized the challenge, Drafted. Carlos Tor Diez: Conceptualized, Co-organized the challenge, Drafted. Ramon Sanchez Jacob: Provided expert annotations of the images to be used during training, Validation, Testing phases of the challenge. Jonathan Zember: Provided expert annotations of the images to be used during training, Validation, Testing phases of the challenge. Wenqi Li: Conceptualized, Co-organized the challenge, Drafted, Developed the baseline deep learning pipeline. Sheng Xu: Provided the initial data and expert annotations for analyzing the interobserver performance. Baris Turkbey: Provided the initial data and expert annotations for analyzing the interobserver performance. Evrim Turkbey: Provided the initial data and expert annotations for analyzing the interobserver performance. Dong Yang: Provided the automatically generated segmentations. Ahmed Harouni: Provided the automatically generated segmentations. Nicola Rieke: Conceptualized, Co-organized the challenge, Drafted. Shishuai Hu: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Fabian Isensee: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Claire Tang: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Qinji Yu: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Jan Sölter: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Tong Zheng: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Vitali Liauchuk: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Ziqi Zhou: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Jan Hendrik Moltz: Provided expert annotations of the images to be used during training, Validation, Testing phases of the challenge. Bruno Oliveira: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Yong Xia: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Klaus H. Maier-Hein: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Qikai Li: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Andreas Husch: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Luyang Zhang: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Vassili Kovalev: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Li Kang: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Alessa Hering: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. João L. Vilaça: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Mona Flores: Participated in the challenge, achieved a top-10 rank, Provided the results analyzed in this article. Daguang Xu: Conceptualized, Co-organized the challenge, Drafted. Bradford Wood: Provided the initial data and expert annotations for analyzing the interobserver performance. Marius George Linguraru: Conceptualized, Co-organized the challenge, Drafted.

2 in total

1. SR-CycleGAN: super-resolution of clinical CT to micro-CT level with multi-modality super-resolution loss.

Authors: Tong Zheng; Hirohisa Oda; Yuichiro Hayashi; Takayasu Moriya; Shota Nakamura; Masaki Mori; Hirotsugu Takabatake; Hiroshi Natori; Masahiro Oda; Kensaku Mori
Journal: J Med Imaging (Bellingham) Date: 2022-04-05

2. Survey of Supervised Learning for Medical Image Processing.

Authors: Abeer Aljuaid; Mohd Anwar
Journal: SN Comput Sci Date: 2022-05-17

2 in total