Literature DB >> 36257092

Active deep learning from a noisy teacher for semi-supervised 3D image segmentation: Application to COVID-19 pneumonia infection in CT.

Mohammad Arafat Hussain¹, Zahra Mirikharaji², Mohammad Momeny³, Mahmoud Marhamati⁴, Ali Asghar Neshat⁵, Rafeef Garbi⁶, Ghassan Hamarneh⁷.

Abstract

Supervised deep learning has become a standard approach to solving medical image segmentation tasks. However, serious difficulties in attaining pixel-level annotations for sufficiently large volumetric datasets in real-life applications have highlighted the critical need for alternative approaches, such as semi-supervised learning, where model training can leverage small expert-annotated datasets to enable learning from much larger datasets without laborious annotation. Most of the semi-supervised approaches combine expert annotations and machine-generated annotations with equal weights within deep model training, despite the latter annotations being relatively unreliable and likely to affect model optimization negatively. To overcome this, we propose an active learning approach that uses an example re-weighting strategy, where machine-annotated samples are weighted (i) based on the similarity of their gradient directions of descent to those of expert-annotated data, and (ii) based on the gradient magnitude of the last layer of the deep model. Specifically, we present an active learning strategy with a query function that enables the selection of reliable and more informative samples from machine-annotated batch data generated by a noisy teacher. When validated on clinical COVID-19 CT benchmark data, our method improved the performance of pneumonia infection segmentation compared to the state of the art.

Entities: Chemical

Keywords: Active learning; COVID-19; Deep learning; Noisy teacher; Pneumonia; Segmentation; Semi-supervised learning

Year: 2022 PMID： 36257092 PMCID： PMC9540707 DOI： 10.1016/j.compmedimag.2022.102127

Source DB: PubMed Journal: Comput Med Imaging Graph ISSN： 0895-6111 Impact factor: 7.422

Introduction

Supervised learning using deep neural networks has been extensively used for volumetric medical image segmentation in recent years. However, adequate training of deep segmentation models for volumetric medical data, e.g., computed tomography (CT), requires prohibitive amounts of annotated data at the pixel/voxel level that are often very difficult to achieve (Fan et al., 2020) in practical clinical settings. For example, segmentation of pneumonia infection regions in CT can be beneficial as a first step within detection and analysis methods based on convolutional neural networks (CNN) for coronavirus disease 2019 (COVID-19) (Zhou et al., 2020, Gao et al., 2020a, Amyar et al., 2020, Harmon et al., 2020, H.t. et al., 2020, Paluru et al., 2021, Tilborghs et al., 2020, Ghomi et al., 2020, Voulodimos et al., 2021a, Hasan et al., 2021, Voulodimos et al., 2021b, Ranjbarzadeh et al., 2021, Elharrouss et al., 2020, Singh et al., 2021, Yan et al., 2021). Health condition of many hospitalized COVID-19 patients often deteriorates and requires mechanical ventilation or high-flow oxygenation (Lassau et al., 2021). Therefore, identifying a patient with COVID-19 at increased risk of developing severity can help healthcare professionals plan ahead and make a justified decision on the allocation of resources in the intensive care unit (ICU) (Phua et al., 2020). The severity of patients with COVID-19 is positively correlated with the size and spread of lung infections of various forms (e.g., ground glass opacity (GGO), consolidation, interstitial thickening, air bronchograms, pleural effusion, fibrous strips, etc.) (Xiong et al., 2020). The size and spread of lung infection can be estimated from chest CT scans by segmenting the infection (Wang et al., 2020). However, using supervised learning for COVID-19 infection segmentation is challenged by the scarcity of pixel-level expert-annotated data, hindering the development of models that can reliably perform as well in the wild (i.e., generalize to data from various other sources/scanners) as on the limited data used to train them (Ma et al., 2020a). To address the complexity of attaining sufficient pixel-level annotated data for supervised learning, various types of weakly supervised learning approaches have been proposed that can be grouped based on the level and/or quality of supervision used (Zhou, 2018): (a) incomplete supervision (Zhang et al., 2019), where only a subset of training data comes paired with labels; (b) inexact supervision (Xu and Lee, 2020), where the training data includes only coarse-grained labels; (c) inaccurate supervision (Zhang et al., 2019), where the training data includes labels that are not necessarily correct (i.e., noisy). For example, to bypass the lack of sufficient expert-annotated COVID-19 CT data at pixel-level, Laradji et al. (2021) proposed an infection segmentation approach with ‘inexact supervision,’ where weak annotations, in the form of points, clicked within infection areas, were instead used. Nonetheless, even such weak supervision remains challenging and time-consuming, as annotators must go through all image slices of volumetric medical data for manual labeling. In addition, intra- and inter-annotator reproducibility are often poor, which causes variability in the manually annotated segmentation masks. Alternatively, in semi-supervised learning, e.g., Cheplygina et al., 2019, Abdel-Basset et al., 2020, Yang et al., 2021 and Li et al. (2021), limited pixel-level expert-annotated data are combined with a large pool of machine-annotated data for training deep segmentation models. Furthermore, most semi-supervised learning approaches were designed for only 2D data, for example Fan et al., 2020, Ma et al., 2020a, Laradji et al., 2021, Abdel-Basset et al., 2020, Yang et al., 2021, Taghanaki et al., 2019a, Souly et al., 2017, Lee et al., 2019, Hong et al., 2015 and Wang et al. (2020), however, native 3D segmentation would be superior to the reconstruction of 3D segmentation from stacked slice-based 2D segmentation masks, as native 3D analysis captures the true 3D spatial context of the underlying structures within the imaged field of view (Çiçek et al., 2016). In addition, stacked slice-based 2D segmentation approaches assume that the slice thickness (i.e., the distance between two axial slices) is isotropic, but on many occasions, it may not be true. Machine-generated annotations (i.e., pseudo labels) are generally less reliable and typically require further correction by experts (Marzahl et al., 2020). Thus, uncorrected (by experts) machine-generated annotations are likely to lead to incorrect predictions being reinforced during network optimization, which in turn leads to worse task performance at test time. To address this problem, a semi-supervised active learning (SSAL) strategy is sometimes used, which generally uses a pipeline of (i) query function for selecting “informative” samples from the annotation-free data pools, (ii) forwarding those to oracle annotators for generating ground truth annotation, and subsequently (iii) adding those new annotated data to the training data pool (Zhao et al., 2021, Gao et al., 2020b, Calma et al., 2018, Lv et al., 2022, Bull et al., 2018). However, such oracle annotation systems share limitations similar to those of expert supervision in medical imaging applications, namely, the time and labor requirements placed upon expert radiologists who are rarely available or interested in such manual dense annotation tasks, as well as the poor intra- and inter-annotator reproducibility. To overcome the challenge of producing time and labor-intensive expert annotations for informative 3D volumetric medical images from the annotation-free data pools in the active learning framework, we propose a sample re-weighting-based (Xu et al., 2021) (i.e., a way to emphasize and pick informative data only during training) semi-supervised learning approach, namely, ‘SSAL from a noisy teacher,’ and showcase its utility for pneumonia infection segmentation in clinical CT scans of COVID-19 patients. The proposed method uses alternatives to the conventional human oracle-based active learning steps discussed in the previous paragraph. The proposed method consists of several steps. First, it generates voxel-level annotations (pseudo-annotation) using supervised deep learning. Second, since machine-generated annotations are less reliable, we generate gradient-based relative sample weights that reflect the “trustworthiness” of the samples during training. These sample weights are estimated from the similarity of the gradient directions between the annotation-free batch data and the expert-annotated validation data. Third, because the weighting of samples based on the gradient similarity may lead to an underestimation of a more diversely informative data sample, we adopt a gradient magnitude-based strategy (Ash et al., 2019) to generate another set of sample weights that reflect the “informativeness” of the samples during training. Fourth, we generate an overall sample weight by combining the sample weights of “trustworthiness” and “informativeness”. Finally, we use a query mechanism to choose more informative and trustworthy samples in a batch of annotation-free data by rectification of the combined weight per sample, and subsequently use these combined sample weights during the model optimization. Our sample re-weighting based adaptive data sampling strategy can be viewed as a pool-based SSAL strategy, in which a few annotation-free training samples are adaptively chosen in each training cycle based on some preset criteria and presented for annotation to an oracle annotator, i.e., an expert teacher, albeit a noisy teacher in our case. A concise summary of the contributions of this paper is: We propose active learning from the noisy teacher approach that uses an example re-weighting strategy in using expert-annotation-less data in deep model training. Our re-weighting strategy uses ‘gradient similarity’ and ‘gradient magnitude’ in determining the sample weights to reflect the ‘trustworthiness’ and the ‘informativeness,’ respectively, of machine-annotated data. Our active learning strategy uses a query function that enables the selection of reliable and more informative samples from machine-annotated batch data. We show the effectiveness of our proposed approach on clinical COVID-19 CT benchmark CT data.

Methodology

Method overview

We design the working pipeline of the proposed method using the following steps: Initially, we generate voxel-level annotations (pseudo annotation) using supervised deep models while considering the fact that these machine-generated annotations are less reliable (noisy teacher) than human expert annotations. Then we generate a relative weight based on gradients per sample based on its “trustworthiness” during training. A sample weight is estimated from the similarity of the gradient directions between the annotation-free sample data and the expert-annotated validation data. Gradient similarity-based sample re-weighting approaches have previously been explored for deep learning from inaccurate labels on 2D RGB images (Ren et al., 2018, Mirikharaji et al., 2019), however, they have not been explored for volumetric radiographic images. As the primary aim of an active learning algorithm is to identify and label only maximally-informative samples, gradient similarity-based sample weighting may lead to underestimation of a more diversely informative data sample. Ash et al. (2019) showed the efficacy of using the gradient magnitude, with respect to parameters in the final CNN layer, as a measure of the model’s uncertainty. The higher magnitude of the gradient of the last layer, resulting from a higher loss of training, implies that the interrogated training sample contains newer information (Ash et al., 2019) that the model has not yet seen. In our proposed approach, we adopt this gradient magnitude-based strategy and generate another set of sample weights based on their “informativeness” during training. Afterwards, we generate an overall sample weight by combining the “trustworthiness” and “informativeness” sample weights. Finally, we use a query mechanism to choose more informative and trustworthy samples in a batch of annotation-free data by rectification (i.e., choosing more useful data in a batch) of the combined sample weight, and subsequently use these combined sample weights in the batch during the model optimization.

Data partition

Our method uses a small set of expert-annotated volumetric imaging data to produce pseudo-annotations for a much larger set of data lacking annotations . We start by dividing the small cohort of expert-annotated data into training set , validation set , and hold out test set , where and . We then train a deep 3D segmentation model on this small training set and use the resulting model to generate pseudo-annotations on the large pool of annotation-free scans. We then combine these generated pseudo-annotated data with the expert-annotated validation data to train an example re-weighted active learning model, in which we assign varying weights to each pseudo-annotated training example based on its gradient direction (see Fig. 1).

Fig. 1

Schematic diagram of the proposed example re-weighted active learning from noisy teacher approach for 3D image segmentation.

Supervised learning for pseudo-label generation

Using the training set , our objective is to train a 3D CNN in a fully supervised fashion (Step 1 in Fig. 1) which is then used to generate pseudo-labels for , where is the set of learnable parameters. For model optimization, we adopt the combo loss (Taghanaki et al., 2019b) after we extend it with a sample-specific weighting: where is the number of volumes in the training mini-batches, is the scalar weight associated with each training example, is the total number of class labels, is the total number of voxels in each volume, and are the probability vectors of the output label corresponding to the vector encoded ground truth label vectors . For , we use . After training , we generate pseudo-labels for as (Step 2 in Fig. 1).

Semi-supervised active learning from noisy teacher

Our active learning strategy aims to assign weights to pseudo-annotated data samples and subsequently select more reliable examples for training mini-batch. We train a 3D CNN, in a semi-supervised fashion (Step 3 in Fig. 1). In a standard training loss function (in our case, the combo loss in Eq. (1)), all input data are equally weighted, i.e., in Eq. (1). Given that our expanded training dataset now consists of both an expert-annotated set and a pseudo-annotated set , where is generally less reliable than , our method instead learns a data re-weighting strategy, where we minimize the weighted combo loss () for a mini-batch () of set as follows: In deep learning, the model parameters are updated using gradient descent: where is the step size and denotes the training step. Given the reduced trustworthiness of the generated pseudo labels, our method inspects the gradient descent direction of each training minibatch of the pseudo-annotated dataset in each iteration, on the training loss surface. Instead of equal weighting, we reweight () according to their similarity to the descent direction of the validation combo loss () surface as: However, solving Eq. (4) to find optimal for each update step of the network parameters is computationally intensive as it requires two nested loops of optimization. Therefore, to approximate for every gradient descent step, a meta-learning procedure is used. At step , we perform a single gradient descent step on a mini-batch of validation samples (shown as Step 3.2 in Fig. 1) with respect to (obtained in Step 3.1 in Fig. 1), followed by rectifying the output to generate a non-negative weight (shown as Step 3.3 in Fig. 1): where is the descent step size on , and is the normalization function to ensure . We also estimate the embedding of the gradient in step for pseudo-annotated data samples from the magnitude of the gradient with respect to parameters in the final layer of the meta-network as (Ash et al., 2019): We then generate the gradient magnitude-based sample weights from the gradient embedding (Eq. (6)) as: To ensure a balance between the “trustworthiness” and “informativeness” of a particular pseudo-annotated data sample, we combine and using a relative weight as: Although the optimum set of weights for a mini-batch is expected to have a positive value, which allows all the image volumes in the mini-batch to contribute to the optimization of of , the constituent weight per sample in is different based on the sample’s descent direction similarity to those of the validation data, as well as its gradient embedding . Therefore, we choose more reliable and informative pseudo-annotated samples from the mini-batch whose descent direction is most similar to those of the validation data and which introduces newer information. This approach makes the optimization more robust on the noisy pseudo-annotated samples and also mimics an active learning strategy. To achieve this, we further rectify so that samples with weights greater than the uniform value () are selected as: After learning the adaptive weights (), we perform a final backward pass to estimate the gradient and update the network parameters as: We also present a description of our method in Algorithm 1. This algorithm describes the technical steps in each training iteration. The iteration starts by loading the pseudo- and expert-annotated batch data (lines 1 and 2, respectively). Then the 3D meta-network loads current parameters from the main 3D CNN model of identical architecture (line 4). Afterward, steps to learn the weights per pseudo-annotated sample based on its gradient similarity and gradient magnitude are shown in lines 5–16. Finally, the sample re-weighted loss calculation and updating of parameters of the main 3D CNN model are shown in lines 17–20.

Data

We used three clinical COVID-19 CT datasets to evaluate our proposed method, two of which are publicly available and include expert annotations at the voxel-level of infections. The third database is private and is accessed with proper institutional ethics approval (IR.ESFARAYENUMS.REC.1398.019, Esfarayen Faculty of Medical Sciences, 2020-03-18; 2020s0128, Simon Fraser University).

COVID-19 CT benchmark dataset

The first public database is the COVID-19 CT Benchmark dataset of Ma et al. (2020b) (henceforth, “Benchmark” refers to these data), which contains 20 CT volumes from 20 patients with expert annotations at the voxel-level of COVID-19 infection in the lungs. The proportion of COVID-19 pneumonia infection in the lungs ranges from 0.01% to 59%. The left and right lungs, and pneumonia infection in these data were annotated in three steps: first, junior annotators with 1–5 years of experience annotated the data slice-by-slice using ITKSnap in the axial direction, which was refined by two radiologists with 5–10 years experience. Finally, a senior radiologist with more than 10 years of experience verified and refined the annotations.

COVID-19 lung CT lesion segmentation challenge 2020 dataset

The second public database is the COVID-19 Lung CT Lesion Segmentation Challenge 2020 database of An et al. (2020) (henceforth, “Challenge” refers to these data), which contains 199 chest CT scans from 199 patients with ground truth pixel-level annotations of COVID-19 lesions in the lungs. These data were acquired without intravenous contrast enhancement from COVID-19 patients confirmed by reverse transcription polymerase chain reaction (RT-PCR) in China. COVID-19 infection in these CT volumes was initially segmented using a previously trained model to segment the COVID-19 lesion (Yang et al., 2021). Later, a group of experienced radiologists used the initial segmentation as a starting point for the subsequent ITKSnap-based adjudication and correction of infection masks.

COVID-19 CT private dataset

The third database is composed of 1473 CT scans of 623 patients of Imam Khomeini Hospital, Esfarayen, Iran (henceforth, “hospital” refers to these data). We accessed these data with all required ethics approvals in place (IR.ESFARAYENUMS.REC.1398.019, Esfarayen Faculty of Medical Sciences, 2020-03-18; 2020s0128, Simon Fraser University). Of these scans, 567 were confirmed to be from COVID-19 patients and 906 are from non-COVID patients. None of the scans in this third database had pixel-level annotation of lung infections. These data were acquired using the Toshiba Alexion CT scanner (Toshiba, Minato City, Tokyo, Japan). The axial pixel dimension ranges between 0.571 and 0.763 mm. The thickness of the slice was set to 7 mm.

Implementation details

To standardize the clinical datasets, we resampled all CT volumes to have a common voxel dimension of mm by trilinear interpolation. We used a modified version of 3D UNet (Çiçek et al., 2016) as CNN (for both and ), which has residual connections around the convolutional blocks as in Kerfoot et al. (2018). We show our 3D UNet architecture as a tabular form in Table 1, which also mentions the number of trainable parameters in each layer of the network. We use Adam optimizer with an initial learning rate of 0.01 to train our networks. We carried out two experiments, one using Benchmark and Hospital data and the other using Challenge and Hospital data. We used Challenge and Hospital data for the ablation study (i.e., experiment 2) and the Benchmark and Hospital data for the comparison of our performance with respect to the state-of-the-art approaches (i.e., experiment 1). In both cases, the hospital scans constituted the data without voxel-level annotation (i.e., ). Before running the experiments, we augmented the Benchmark and Challenge datasets by flipping left–right and up-down (i.e., the sample size was increased 4). For experiment 1, we used 16 volumes from the Benchmark dataset as during supervised learning (Step 1 in Fig. 1). We used the remaining 64 Benchmark CT volumes in 4-fold cross-validation, where we used 48 volumes as (i.e., used in gradient similarity estimation during training) and 16 volumes as (i.e., held out for testing) in each fold. For experiment 2, we used the pseudo annotations from experiment 1, thus used 636 (i.e., 4 × 159; used in gradient similarity estimation during training) volumes as and 160 (i.e., 4 × 40; held out for testing) volumes as in 5-fold cross-validation. In both experiments, the entire third dataset was used as annotation-free data . We ensured that the augmented data of a patient was never split between the training, validation, and test sets. Training and are scheduled to run for 500 epochs each in both experiments, which is supposed to take about 1 day and 8 days, respectively. However, we often stopped training early if the training error was found to be saturated. We also chose (i.e., the size of the training batch) and . We implemented our method in PyTorch version 1.6.0 and Python version 3.7.4. The training was performed on a workstation with an Intel E5-2650 v4 Broadwell 2.2 GHz processor, an Nvidia P100 Pascal GPU with 16 GB of VRAM, and 64 GB of RAM.

Table 1

Our 3D UNet architecture. Acronyms—BN: batch normalization, Conv3D: 3D convolution, Conv3D-Res: Conv3D is used for residual connection, PReLU: parametric rectified linear unit, and I: identity connection.

Block	Conv3D	Stride	Activation	BN	Repeat	Input	Output	# trainable
type	Kernel		function			size	size	parameters
Conv3D	33	2	PReLU	Yes	2	963× 1	483× 16	897
Conv3D	33	1	PReLU	Yes	1	483× 16	483× 16	6,929
Conv3D-Res	33	2	–	–	1	483× 16	483× 16	0
Conv3D	33	2	PReLU	Yes	2	483× 16	243× 32	27,713
Conv3D	33	1	PReLU	Yes	1	243× 32	243× 32	27,681
Conv3D-Res	33	2	–	–	1	243× 32	243× 32	0
Conv3D	33	2	PReLU	Yes	2	243× 32	123× 64	110,720
Conv3D	33	1	PReLU	Yes	1	123× 64	123× 64	110,657
Conv3D-Res	33	2	–	–	1	123× 64	123× 64	0
Conv3D	33	2	PReLU	Yes	2	123× 64	63× 128	442,625
Conv3D	33	1	PReLU	Yes	1	63× 128	63× 128	442,497
Conv3D-Res	33	2	–	–	1	63× 128	63× 128	0
Conv3D	33	1	PReLU	Yes	2	63× 128	63× 256	918,017
Conv3D	33	1	PReLU	Yes	1	63× 256	63× 256	1,769,729
Conv3D-Res	13	1	–	–	1	63× 256	63× 256	0
Conv3D	33	2	PReLU	Yes	1	63× (256+128a)	123× 64	663,617
Conv3D	33	1	PReLU	Yes	1	123× 64	123× 64	110,657
Conv3D-Res	I	–	–	–	1	123× 64	123× 64	0
Conv3D	33	2	PReLU	Yes	1	123× (64+64a)	243× 32	110,621
Conv3D	33	1	PReLU	Yes	1	243× 32	243× 32	27,681
Conv3D-Res	I	–	–	–	1	243× 32	243× 32	0
Conv3D	33	2	PReLU	Yes	1	243× (32+32a)	483× 16	27,665
Conv3D	33	1	PReLU	Yes	1	483× 16	483× 16	6,929
Conv3D-Res	I	–	–	–	1	483× 16	483× 16	0
Conv3D	33	2	PReLU	Yes	1	483× (16+16a)	963× 2	1,731
Conv3D	33	1	–	–	1	963× 2	963× 2	110
Conv3D-Res	I	–	–	–	1	963× 2	963× 2	0

							Total	4,806,481

Represents the skip connection between the encoder and decoder sides of the network.

Results and discussion

In this section, we first present our ablation study on the Challenge dataset and then compare the pneumonia infection segmentation performance of our proposed method with that of the state-of-the-art methods on the Benchmark dataset.

Ablation study on the challenge data

Here, we present the results of our ablation study in Table 2 to demonstrate the incremental contributions by different modules of our proposed method. In this table, we present 5-fold cross-validated Dice scores and Hausdorff distances by different approaches such as our fully supervised method, a semi-supervised approach adopting the training strategy from Fan et al. (2020), the proposed gradient similarity-based sample re-weighting method (RGS; re-weighting with but without rectification of ), the proposed last layer gradient magnitude-based sample re-weighting method (RGM; re-weighting with but without rectification of ), proposed gradient similarity and last layer gradient magnitude-based sample re-weighting (RGS&M; re-weighting with but without rectification of ), and the proposed gradient similarity- and last layer gradient magnitude-based sample re-weighting with AL method (RGS&MAL; re-weighting with and ). Note that for the semi-supervised approach, we only adopted the semi-supervision strategy from Fan et al. (2020) but not their deep model, as it was designed for 2D data. This semi-supervision strategy used machine-generated annotation (i.e., noisy annotation) for the annotation-free data pool during training. As a result, we see in Table 2 that the performance of this approach in terms of the mean Dice and mean Hausdorff distance is worse than that of our fully supervised approach for folds 3 and 4. The overall performance of the semi-supervised method is slightly better than that of the fully supervised approach in terms of average Dice; however, the opposite is seen in terms of the mean Hausdorff distance. Next, we see in Table 2 that the sample re-weighting strategy using the gradient similarity and the last-layer gradient magnitude (used in the RGS and RGM methods, respectively) leads to better segmentation performance in terms of Dice score than fully supervised and semi-supervised approaches because of incorporating the “trustworthiness” (in terms of ) and “informativeness” (in terms of ) of pseudo-annotated samples into the model loss calculation. However, the Hausdorff distance by the RGS method is seen to be worse than that by the semi-supervised approach for folds 1, 2, 4, and 5. A similar but marginally worse performance trend is seen in the case of the RGM approach for folds 4 and 5. We further see in Table 2 that the RGS&M method, where we combined the similarity of the gradients and the magnitude of the gradient of the last layer with , improves the segmentation performance in terms of the Dice score more than the RGS or RGM method alone. However, mixed performance is seen in terms of the Hausdorff distance for folds 1 to 5, compared to semi-supervised, RGS, and RGM approaches. Nonetheless, the average Hausdorff distance performance by the RGS&M method is better than that by the fully supervised, semi-supervised, RGS, and RGM approaches. Finally, we see in Table 3 that our active learning strategy using weight rectification (i.e., with ), which completely removes the contribution of less trustworthy and informative samples in batch-wise loss calculation, leads to the best segmentation performance in terms of the Dice score. Furthermore, the average Hausdorff distance by the proposed RGS&MAL method outperforms all other methods. We also performed the two-sample t-test for the 5-fold mean Dice scores between our proposed RGS&MAL and other methods mentioned in Table 2. The estimated p-values are 0.0027, 0.0048, 0.0330, 0.1963, and 0.6404 for the fully supervised, semi-supervised, RGS, RGM, and RGS&M methods, respectively. Since, RGS, RGM, and RGS&M methods are intrinsic parts of our proposed RGS&MAL method, differences in segmentation performance by these approaches may be non-significant. However, as expected, the proposed RGS&MAL showed statistically significant improvements (p 0.01) in terms of Dice score compared to fully supervised and semi-supervised approaches.

Table 2

Table 3

5-fold cross-validation performance in terms of Dice scores and Hausdorff distances using our methods implemented for segmenting COVID-19 pneumonia infection in Challenge data. The upward arrow () indicates that ‘higher is better, and the downward arrow () indicates that ‘lower is better’. Values indicated by the colors and indicate the best performance in terms of the Dice score and the Hausdorff distance, respectively. (Acronyms—RGS: sample re-weighting based on gradient similarity only, RGSAL: sample re-weighting based on gradient similarity followed by active learning, RGS&MAL: sample re-weighting based on both gradient similarity and last layer gradient magnitude followed by active learning, Met: metrics, DS: Dice score, HD: Hausdorff distance.) In Fig. 2, we demonstrate the qualitative performance comparison of fully supervised, semi-supervised, RGS, RGM, RGS&M, and RGS&MAL methods. Here, we show the axial CT slices and corresponding expert-annotated pneumonia infection masks for seven COVID-19-positive patients. We see in this figure that all methods performed reasonably well in identifying pneumonia infections in the lung. However, for more irregular and complex infection patterns (e.g., patients I, IV, V, and VII), the masks produced by the proposed RGS&MAL method, shown in the last row, are the best match to the expert-annotated infection masks, shown in the second row. We also see in the last three columns that there are considerable false positives (indicated with blue arrows) and false negatives (indicated with yellow arrows) in infection masks produced by different methods except for the proposed RGS&MAL approach. This evidence further supports the efficacy of the proposed RGS&MAL method.

Fig. 2

Qualitative performance comparison by our implemented methods in pneumonia infection segmentation on the Challenge data. The first row shows the axial CT slices of seven COVID-19-infected patients. The second row shows the expert-generated infection mask overlaid on the corresponding CT slices. The third to eighth rows show infection segmentation masks generated by different approaches. Blue arrows indicate false positives and yellow arrows indicate false negatives.

4-fold cross-validation performance in terms of Dice scores and Hausdorff distances by our implemented methods for segmenting pneumonia infection in the Benchmark data. The upward arrow () indicates that higher is better, and the downward arrow () indicates that lower is better. The values indicated by the colors and indicate the best performance in terms of dice score and Hausdorff distance, respectively. (Acronyms—RGS: sample re-weighting based on gradient similarity only, RGSAL: sample re-weighting based on gradient similarity followed by active learning, RGS&MAL: sample re-weighting based on both gradient similarity and last layer gradient magnitude followed by active learning, Met: metrics, DS: Dice score, HD: Hausdorff distance.) Qualitative performance comparison by our implemented methods in pneumonia infection segmentation on the Challenge data. The first row shows the axial CT slices of seven COVID-19-infected patients. The second row shows the expert-generated infection mask overlaid on the corresponding CT slices. The third to eighth rows show infection segmentation masks generated by different approaches. Blue arrows indicate false positives and yellow arrows indicate false negatives.

Performance comparison on the benchmark data

In Table 3, we show the Dice scores and Hausdorff distances achieved in 4-fold cross-validation by our fully supervised method, proposed gradient similarity-based sample re-weighting method (RGS; re-weighting but no rectification of ), proposed gradient similarity-based sample re-weighting with AL (RGSAL; re-weighting with and ), and proposed gradient similarity- and last layer gradient magnitude-based sample re-weighting with AL method (RGS&MAL; re-weighting with and ). Since we also compared our performance to those of the state-of-the-art methods on the same benchmark data set, and we had to use part of the data as to generate pseudo annotation, we had to use 4-fold cross-validation so that we have the same number of data samples in the test cohort as in the state-of-the-art. Table 3 shows that the proposed RGS method, where we used CT volumes with noisy annotation during training, performs better in terms of the Dice score than the fully supervised approach. This performance confirms that the gradient similarity between the data with expert annotation and noisy annotation helps to automatically emphasize more trustworthy samples in a training batch in the loss calculation and thus update the model parameters during the back-propagation. We further see in Table 3 that the RGSAL method performs better in terms of Dice score than the RGS method alone in all folds except Fold 4, which demonstrates that the complete removal of the contribution from less trustworthy samples (i.e., AL in terms of rectification of with ) in batch-wise loss calculation improves the model’s segmentation performance. We further see in Table 3 that the RGS&MAL method performs the best in terms of the Dice score compared to all other approaches in all folds. It proves that the re-weighting of a machine-annotated sample, based on its gradient similarity and last layer gradient magnitude reflecting its “trustworthiness” and “informativeness”, respectively, improves the accuracy of the deep segmentation model. Here also, after incorporating both the data informativeness and label trustworthiness into the sample re-weighting, followed by the complete removal of the contribution of less trustworthy and informative samples (i.e., AL in terms of rectification of with ) in batch-wise loss calculation, we obtain a better-performing model. We also observe in Table 3 that the average Hausdorff distance by the proposed RGS&MAL method is the best among all other techniques. We further performed the two-sample t-test for the 5-fold mean Dice scores between our proposed RGS&MAL and other methods mentioned in Table 3. The estimated p-values are 0.0021, 0.3399, and 0.5180 for the fully supervised, RGS, and RGSAL methods, respectively. Since RGS and RGSAL methods are intrinsic parts of our proposed RGS&MAL method, differences in segmentation performance by these approaches may not be significant. However, as expected, the proposed RGS&MAL showed statistically significant improvements (p 0.01) in terms of Dice score compared to the fully supervised approach. In Fig. 3, we present the qualitative performance comparison of the fully supervised, RGS, RGSAL, and RGS&MAL methods. Here, we show the axial CT slices and corresponding expert-annotated pneumonia infection masks for five COVID-19-positive patients. Similar to the segmentation performance on the challenge data, here we see in this figure that all the methods performed reasonably well in identifying pneumonia infections in the lung. However, for more irregular and complex infection patterns (i.e., patients I, III, IV, and V), the masks produced by the proposed RGS&MAL method (shown in the last row) match the best with the expert-annotated infection masks (shown in the second row). Therefore, this qualitative performance further supports the efficacy of the proposed RGS&MAL approach.

Fig. 3

Qualitative performance comparison by our implemented methods in pneumonia infection segmentation on the Benchmark data. The first row shows the axial CT slices of five COVID-19-infected patients. The second row shows the expert-generated infection mask overlaid on the corresponding CT slices. The third to sixth rows show infection segmentation masks generated by different approaches. Next, in Table 4, we show the Dice scores achieved by different methods to segment pneumonia infection in the benchmark dataset. Here, we show results for two fully supervised learning approaches (Isensee et al. (2019) and our 3D UNet implementation), five semi-supervised learning approaches (Chen et al. (2020), Yu et al. (2019), Ma et al. (2020a), Fan et al. (2020), and proposed RGS method), and two active learning by noisy teacher approaches, namely, ‘proposed RGSAL’ and ‘proposed RGS&MAL.’ Note that Ma et al. (2020a) are the curator and publishers of this Benchmark dataset (Ma et al., 2020b) that we use in this paper for validation. Furthermore, the methods of Isensee et al. (2019), Chen et al. (2020), and Yu et al. (2019) were implemented and tested on the same Benchmark dataset of Ma et al. (2020a), where they adhered to the exact computation and pre-processing steps discussed in the respective articles. Therefore, in Table 4, we report the mean Dices of methods by Isensee et al. (2019), Chen et al. (2020), Yu et al. (2019), and Ma et al. (2020a), as reported by Ma et al. (2020a). In this way, we also avoided any deteriorated performance that could have resulted from our own implementation of these methods. We also plot the mean Dice by Fan et al. (2020) in Table 4, which was reported on the same Benchmark dataset although they did not report the standard deviation. Comparing Dice scores by different methods in Table 4, we see that our base 3D UNet outperformed all other methods in infection segmentation. Further improvement in the mean Dice score is achieved by our proposed semi-supervised RGS method. Although trained with the machine-annotated data, the proposed RGS method outperformed our base 3D UNet model because of using the gradient similarity-based sample re-weighting. Finally, we see in Table 4 that the proposed RGSAL method further improves the Dice score than the RGS method, and the proposed RGS&MAL method performs the best among all methods. We also performed the two-sample t-test between our proposed RGS&MAL and other state-of-the-art methods mentioned in Table 4, where the estimated p-values are 0.0002, 0.0001, 0.0003, 0.0023, 0.0567, and 3.86e−36 for the methods by Isensee et al. (2019), our 3D UNet, Chen et al. (2020), Yu et al. (2019), Ma et al. (2020a), and Fan et al. (2020), respectively. Except for the method by Ma et al. (2020a), the performance improvement by the proposed RGS&MAL is statistically significant (p 0.01) compared to other methods. This result again reinforces our claim that the re-weighting of pseudo-annotated samples, based on their gradient similarity and last layer gradient magnitude followed by the complete removal of the contribution from lesser trustworthy and informative samples (i.e., AL in terms of rectification of ) in batch-wise loss calculation, leads to better model optimization and segmentation performance.

Table 4

Method type	Methods	Mean Dice
Fully supervised	Isensee et al. (2019)	0.6728±0.2220
Fully supervised	Our 3D UNet	0.7307±0.0660

Semi-supervised	Chen et al. (2020)	0.6759±0.2230
	Yu et al. (2019)	0.6962±0.2030
	Ma et al. (2020a)	0.7225±0.1989
	Fan et al. (2020)	0.5970
	Proposed RGS	0.7548±0.06517

Active Learning from Noisy Teacher	Proposed RGS+AL	0.7562±0.0848
Active Learning from Noisy Teacher	Proposed RGS&M+AL	0.7635±0.0687

Dice scores achieved by contrasting methods in the segmentation of pneumonia infections in Benchmark data. (Acronyms—RGS: sample re-weighting based on gradient similarity only, RGSAL: sample re-weighting based on gradient similarity followed by active learning, RGS&MAL: sample re-weighting based on both gradient similarity and last layer gradient magnitude followed by active learning.) Although the proposed RGS&MAL approach showed better segmentation performance compared to other state of the art, this approach is slightly computationally expensive compared to training a 3D UNet under full supervision. The proposed approach needs an additional forward pass and gradient calculation for the meta-network in each iteration, although the parameters of the meta-network are not updated. The meta-network is an identical 3D UNet that loads the current state of parameters from the actual 3D UNet in training in each iteration. Despite the use of an additional meta-network, the total number of trainable parameters of the proposed method remains the same as shown in Table 1. Also because of an additional forward pass and gradient calculation for the meta-network in each iteration, the total training time is slightly higher for our proposed model training than for training a 3D UNet under full supervision. Despite a slightly longer training time and more computational complexity while incorporating larger training data without expert annotation, our approach showed better segmentation performance compared to other state-of-the-art approaches.

Conclusions

We proposed a new semi-supervised segmentation method that deploys a noisy teacher-based active deep learning strategy. We use an example re-weighting scheme that adaptively weights pseudo-annotated training samples based on the similarity of their gradient directions to those of the expert-annotated validation data and the gradient magnitude of the last layer of the deep model. We incorporated the trustworthiness and informativeness of pseudo-annotated data samples within an active learning strategy by incorporating a query function in the re-weighting process that favors more trustworthy and more informative samples from batch data. We validated our approach using 3 different clinical CT databases of COVID-19 and non-COVID pneumonia lung images and demonstrated that our method outperformed state of the art in COVID-19 pneumonia infection segmentation. Our proposed method achieved the highest Dice score using a smaller number of expert-annotated data in the semi-supervised model training. The conventional deep learning framework often faces various challenges in maintaining a standard accurate predictability when training and testing data come from different sources, which is referred to as the ‘domain shift.’ Since our proposed approach utilized the gradient similarity between the training and validation data (from two different sources), it can be more robust in the domain shift problem. Additionally, our proposed approach showed efficacy in producing accurate pneumonia infection masks, although we had extremely limited expert-annotated data. Our method can significantly contribute to image-based diagnosis procedures in the clinical environment via leveraging the commonly available large pool of annotation-free data in the hospital records, as attaining expert annotation by radiologists is a common bottleneck in the volumetric medical image-based supervised learning framework. While demonstrating the best COVID-19 pneumonia infection segmentation performance compared to the state of the arts, our method has a few limitations that require further improvement in the future. For example, we empirically choose the value of . Our future plan includes developing an automatic data-driven process to set the value of on the fly during model training. We also plan to validate our method on a larger expert-annotated data pool once available.

CRediT authorship contribution statement

Mohammad Arafat Hussain: Conceptualization, Methodology, Software, Data curation, Visualization, Investigation, Writing – original draft. Zahra Mirikharaji: Conceptualization, Methodology, Software, Writing – review & editing. Mohammad Momeny: Data curation. Mahmoud Marhamati: Data curation. Ali Asghar Neshat: Data curation. Rafeef Garbi: Supervision, Writing – review & editing. Ghassan Hamarneh: Supervision, Software, Data curation, Methodology, Writing – review & editing.

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Mohammad Arafat Hussain reports financial support was provided by Simon Fraser University. Mohammad Arafat Hussain reports a relationship with Simon Fraser University that includes: employment.

25 in total

Review 1. Not-so-supervised: A survey of semi-supervised, multi-instance, and transfer learning in medical image analysis.

Authors: Veronika Cheplygina; Marleen de Bruijne; Josien P W Pluim
Journal: Med Image Anal Date: 2019-03-29 Impact factor: 8.545

2. DSAL: Deeply Supervised Active Learning From Strong and Weak Labelers for Biomedical Image Segmentation.

Authors: Ziyuan Zhao; Zeng Zeng; Kaixin Xu; Cen Chen; Cuntai Guan
Journal: IEEE J Biomed Health Inform Date: 2021-10-05 Impact factor: 5.772

3. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation.

Authors: Fabian Isensee; Paul F Jaeger; Simon A A Kohl; Jens Petersen; Klaus H Maier-Hein
Journal: Nat Methods Date: 2020-12-07 Impact factor: 28.547

4. Federated semi-supervised learning for COVID region segmentation in chest CT using multi-national data from China, Italy, Japan.

Authors: Dong Yang; Ziyue Xu; Wenqi Li; Andriy Myronenko; Holger R Roth; Stephanie Harmon; Sheng Xu; Baris Turkbey; Evrim Turkbey; Xiaosong Wang; Wentao Zhu; Gianpaolo Carrafiello; Francesca Patella; Maurizio Cariati; Hirofumi Obinata; Hitoshi Mori; Kaku Tamura; Peng An; Bradford J Wood; Daguang Xu
Journal: Med Image Anal Date: 2021-02-06 Impact factor: 8.545

5. Segmentation of COVID-19 pneumonia lesions: A deep learning approach.

Authors: Zahra Ghomi; Reza Mirshahi; Arash Khameneh Bagheri; Ali Fattahpour; Saeed Mohammadiun; Abdorreza Alavi Gharahbagh; Abtin Djavadifar; Hossein Arabalibeik; Rehan Sadiq; Kasun Hewage
Journal: Med J Islam Repub Iran Date: 2020-12-22

6. A Few-Shot U-Net Deep Learning Model for COVID-19 Infected Area Segmentation in CT Images.

Authors: Athanasios Voulodimos; Eftychios Protopapadakis; Iason Katsamenis; Anastasios Doulamis; Nikolaos Doulamis
Journal: Sensors (Basel) Date: 2021-03-22 Impact factor: 3.576

7. Integrating deep learning CT-scan model, biological and clinical variables to predict severity of COVID-19 patients.

Authors: Nathalie Lassau; Samy Ammari; Emilie Chouzenoux; Hugo Gortais; Paul Herent; Matthieu Devilder; Samer Soliman; Olivier Meyrignac; Marie-Pauline Talabard; Jean-Philippe Lamarque; Remy Dubois; Nicolas Loiseau; Paul Trichelair; Etienne Bendjebbar; Gabriel Garcia; Corinne Balleyguier; Mansouria Merad; Annabelle Stoclin; Simon Jegou; Franck Griscelli; Nicolas Tetelboum; Yingping Li; Sagar Verma; Matthieu Terris; Tasnim Dardouri; Kavya Gupta; Ana Neacsu; Frank Chemouni; Meriem Sefta; Paul Jehanno; Imad Bousaid; Yannick Boursin; Emmanuel Planchet; Mikael Azoulay; Jocelyn Dachary; Fabien Brulport; Adrian Gonzalez; Olivier Dehaene; Jean-Baptiste Schiratti; Kathryn Schutte; Jean-Christophe Pesquet; Hugues Talbot; Elodie Pronier; Gilles Wainrib; Thomas Clozel; Fabrice Barlesi; Marie-France Bellin; Michael G B Blum
Journal: Nat Commun Date: 2021-01-27 Impact factor: 14.919

8. A Rapid, Accurate and Machine-Agnostic Segmentation and Quantification Method for CT-Based COVID-19 Diagnosis.

Authors: Longxi Zhou; Zhongxiao Li; Juexiao Zhou; Haoyang Li; Yupeng Chen; Yuxin Huang; Dexuan Xie; Lintao Zhao; Ming Fan; Shahrukh Hashmi; Faisal Abdelkareem; Riham Eiada; Xigang Xiao; Lihua Li; Zhaowen Qiu; Xin Gao
Journal: IEEE Trans Med Imaging Date: 2020-08 Impact factor: 11.037

9. Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets.

Authors: Stephanie A Harmon; Thomas H Sanford; Sheng Xu; Evrim B Turkbey; Holger Roth; Ziyue Xu; Dong Yang; Andriy Myronenko; Victoria Anderson; Amel Amalou; Maxime Blain; Michael Kassin; Dilara Long; Nicole Varble; Stephanie M Walker; Ulas Bagci; Anna Maria Ierardi; Elvira Stellato; Guido Giovanni Plensich; Giuseppe Franceschelli; Cristiano Girlando; Giovanni Irmici; Dominic Labella; Dima Hammoud; Ashkan Malayeri; Elizabeth Jones; Ronald M Summers; Peter L Choyke; Daguang Xu; Mona Flores; Kaku Tamura; Hirofumi Obinata; Hitoshi Mori; Francesca Patella; Maurizio Cariati; Gianpaolo Carrafiello; Peng An; Bradford J Wood; Baris Turkbey
Journal: Nat Commun Date: 2020-08-14 Impact factor: 14.919