Literature DB >> 33519100

FSS-2019-nCov: A deep learning architecture for semi-supervised few-shot segmentation of COVID-19 infection.

Mohamed Abdel-Basset¹, Victor Chang², Hossam Hawash¹, Ripon K Chakrabortty³, Michael Ryan³.

Abstract

The newly discovered coronavirus (COVID-19) pneumonia is providing major challenges to research in terms of diagnosis and disease quantification. Deep-learning (DL) techniques allow extremely precise image segmentation; yet, they necessitate huge volumes of manually labeled data to be trained in a supervised manner. Few-Shot Learning (FSL) paradigms tackle this issue by learning a novel category from a small number of annotated instances. We present an innovative semi-supervised few-shot segmentation (FSS) approach for efficient segmentation of 2019-nCov infection (FSS-2019-nCov) from only a few amounts of annotated lung CT scans. The key challenge of this study is to provide accurate segmentation of COVID-19 infection from a limited number of annotated instances. For that purpose, we propose a novel dual-path deep-learning architecture for FSS. Every path contains encoder-decoder (E-D) architecture to extract high-level information while maintaining the channel information of COVID-19 CT slices. The E-D architecture primarily consists of three main modules: a feature encoder module, a context enrichment (CE) module, and a feature decoder module. We utilize the pre-trained ResNet34 as an encoder backbone for feature extraction. The CE module is designated by a newly introduced proposed Smoothed Atrous Convolution (SAC) block and Multi-scale Pyramid Pooling (MPP) block. The conditioner path takes the pairs of CT images and their labels as input and produces a relevant knowledge representation that is transferred to the segmentation path to be used to segment the new images. To enable effective collaboration between both paths, we propose an adaptive recombination and recalibration (RR) module that permits intensive knowledge exchange between paths with a trivial increase in computational complexity. The model is extended to multi-class labeling for various types of lung infections. This contribution overcomes the limitation of the lack of large numbers of COVID-19 CT scans. It also provides a general framework for lung disease diagnosis in limited data situations.

Entities: Chemical Disease Gene Species

Keywords: COVID-19; CT images; Context fusion; Deep learning; Few-shot segmentation

Year: 2020 PMID： 33519100 PMCID： PMC7836902 DOI： 10.1016/j.knosys.2020.106647

Source DB: PubMed Journal: Knowl Based Syst ISSN： 0950-7051 Impact factor: 8.038

Introduction

In December 2019, a global health crisis began with the spread of the novel Coronaviridae species called severe acute respiratory syndrome coronavirus 2 (SARS-COV-2)—specifically, the novel Coronavirus Disease (COVID-19) [1]. Over the last few months, the CSSE at Johns Hopkins University has reported 4,733,349 infections and 313,384 deaths in 180 countries around the world1 (online access: 17 May). The reverse-transcription polymerase chain reaction (RT-PCR) is regarded as the major means for inspecting COVID-19 infection. However, the lack of equipment and the restrictions of appropriate testing settings limit rapid and precise screening. Additionally, the RT-PCR test has also been shown to have high false-negative rates [2]. Radiological imaging methods (such as X-ray and computed tomography (CT)) provide a significant supplement to RT-PCR tests and have shown their efficiency in lung disease diagnosis and quantification [3]. Moreover, several studies show that chest CT analysis results in higher performance (greater sensitivity) in COVID-19 detection compared to RT-PCR [4]. In comparison to X-rays, CT screening has the advantages of a three-dimensional representation of the patient’s lung. Recent studies [5] indicate that the distinctive infection indication of ground-glass opacity (GGO) and consolidation could be detected from CT scans. The GGO was defined as hazy growing lung attenuation with the conservation of bronchial and vascular margins. In contrast, the consolidation was identified as opacification with obscuration boundaries of bowls and airway walls [6]. Therefore, the qualitative assessment of contagion and longitudinal variations in CT images could provide beneficial and substantial information about COVID-19. However, the manual projection of lung infections is laborious and time-consuming and the accuracy of infection annotation depends heavily on the knowledge and experience of the radiologist. There is, therefore, a need for automatic and accurate segmentation techniques that enable rapid screening of COVID-19. Recently, a wide variety of deep-learning approaches has been used for semantic image segmentation. Among them, fully convolutional neural networks (F-CNNs) have shown superior performance on both traditional and medical images [7], [8], [9], [10], [11], [12], [13]. Notwithstanding their great success in image segmentation, F-CNNs require thousands of labeled images for training and their performance degrades when only a small number of annotated images are available [14]. Consequently, an improved mechanism is required for F-CNN training that enables the segmentation of a new semantic class based on a limited number of labeled images [15]. Such approaches frequently use transfer learning (TL) to transfer the knowledge from pre-trained models to offer an initialization that is later enhanced with the new data to adapt to the underlying problem. Yet, the pre-trained model improvement is still subject to the overfitting problem and requires a reasonable number of labeled images (at least in the order of hundreds) [16]. In situations where there is very little data (such as with COVID-19), the new class has a limited number of annotated images, so such enhancement based TL usually result in overfitting [17]. Few-shot learning (FSL) is an artificial intelligence technique that effectively enables the model to generalize to an anonymous semantic class with a few instances. The primary notion of few-shot learning is driven by an aspect of human learning in which rapid learning of new semantics is possible from a few remarks, exploiting the knowledge acquired from prior experience. Few-shot learning has been extensively studied for object detection and image classification, and lately, used for medical image segmentation. It is shown to be an extremely challenging task to perform pixel-wise predictions in such an extremely low-data regime since it conducts learning from rarely labeled instances since medical experts are required to label images [18] manually. In this paper, we introduce a novel semi-supervised few-shot segmentation (FSS) approach designed specifically for segmenting volumetric COVID-19 CT scans. The key to attain this objective is the combination of the recently proposed recombination and recalibration module within the construction of the proposed architecture.

Few-shot segmentation

FSL techniques for image segmentation aim to generalize a model to a new observed image with limited annotation using the learned knowledge from various annotated images. The FSL network architecture for image segmentation usually comprises three portions: the conditioner path, segmentation path and the interaction blocks. Throughout the inference procedure, the model is supplied with a pair of images called the support set () that contains a group of images belonging to the semantic class coupled with their corresponding mask called . Simultaneously, a group of unlabeled query images is passed into the model to be segmented. Specifically, the support set is forward fed into the conditioner to produce several feature maps within the middle layers of the conditioner path. These maps are declared as the knowledge representation since they encompass the critical information essential to performing segmentation. The generated knowledge representation is captured via interaction blocks, primarily responsible for passing the pertinent information to the corresponding layers within the segmentation path. Meanwhile, the input query image is fed into the segmentation path that makes use of the transferred knowledge information to produce a segmentation mask . Consequently, the major role of interaction blocks is to communicate the learned knowledge from the conditioner to the segmented and build a powerful architecture for semantic image segmentation. However, most of the present methods [19], [20], [21] utilize weak interactions between paths, such as one interaction module at the end layer of the network [16], [17].

Semantic segmentation

Swift advances in medical imaging equipment such as scanners necessitate efficient lesion segmentation techniques that are capable of segmenting the entire infection region and discriminate between relevant interior diseased lesions. Specifically, this necessitates that segmentation approaches must be able to learn more thorough features of various types of infection lesions, which are usually only minor portions of CT images, having an irregular appearance and comparable concentration as the normal areas. Even though a variety of DL models have offered good solutions for automated lesion segmentation, their lesion segmentation performance requires two crucial enhancements: (1) expanding the receptive field to learn extra features by stacking several convolutions and pooling operations must not result in a decrease in the resolution of feature maps layer-by-layer and hence result in the loss of fractional and small features of the lesion; and (2) owing to the diverse sizes of COVID-19 lesions in the CT images, the DL technique must be able to segment lesions at a range of scales. In order to address these issues, dilated convolutions (atrous convolution) has been recently employed for capturing multi-scale information in segmentation networks using atrous spatial pyramid pooling (ASPP) [22]. This primarily has two motives. First, it is eminent that the atrous-convolution sample the incoming input data immutably to calculate the output feature map. Second, nearby contexts could be a beneficial type of supplementary statistics to differentiate diverse tissues, counting both the infected and uninfected regions. Despite the efficiency of atrous convolution in capturing multi-scale semantic representation, using it into a segmentation model has two drawbacks degrading the segmentation performance [23], [24]: (1) local information loss, since its kernel just performs partial sampling on the nine points of pixels and neglects the pixel values at the in-between sites; and (2) the gridding artifacts problem [25].

Challenges and goals

The current computer vision literature for few-shot segmentation (FSS) employs TL from the pre-trained models in both paths to effectively segment RGB images [26]. TL models enable effective exploitation of preceding knowledge with informative features from the beginning of training. Accordingly, adopting a weak interaction module between the conditioner and the segmentation path (i.e., at earlier or later layer) is adequate to train the model efficiently. However, extending such a learning technique to medical images did not realize satisfactory performance due to the absence of the DL model pre-trained on medical data. This limits performance gains realized by TL in medical domains. Hence, we introduce a robust interaction module that enables knowledge communication at several intermediate locations between the paths while simplifying the gradient flow through the two paths. In view of this, we propose the RR module to communicate the learned representation between the two paths of the FSS network. The module particularly receives the extracted conditioner feature maps as input and performs concurrent spatial and channel squeezing to learn the feature maps from the conditioner path. This is used to accomplish excitation on the corresponding feature map of the segmentation path. A mutual shortcoming of the most U-Net alike networks is that the strided-convolutions and successive pooling layers gradually decrease the representational resolution to learn the compressed feature representations. Although this behavior is valuable for object detection or classification procedures, it always hinders the segmentation task that necessitates comprehensive spatial representation. Instinctively, maintaining high-resolution feature maps at the intermediate phases can enhance the performance of the segmentation model. Nevertheless, it raises the dimension of feature maps, which is infeasible for accelerating the training operation and facilitate the optimization process. Hence, there is a trade-off between the high resolution and the training speed. In general, the U-Net is shaped with an ED structure. Where the encoder seeks to minimize the spatial size of feature maps progressively and acquire extra complex semantic representations. The decoder seeks to retrieve both the details of the segmentation target (i.e., lesion) and the spatial size. Thus, it is necessary to learn more advanced representations in the encoder and maintain more spatial representation in the decoder to ensure optimal segmentation performance. Inspired by the debates mentioned above and the Inception-Net [27], the network gets deeper and wider, we propose a novel smoothed atrous convolution (SAC) module. Unlike traditional U-Net architectures that are limited in learning multi-scale representations via 3 × 3 convolutions and pooling layers through the encoding processes, the proposed SAC can learn and extract a deeper and wider range of semantic representations using four parallel paths of multi-scale smoothed atrous convolutions, while the residual links are employed to avoid gradient vanishing issues. Additionally, we introduce a multi-scale pyramid pooling (MPP) module stimulated by spatial pyramid pooling [28]. The MPP module further learns multi-scale contextual representations of the SAC module entity by employing pooling layers with varied sizes, without requiring any additional learning parameters. Integrating these two modules in the middle of the E-D architecture can help gain greater improvement and reserve extra spatial representation to enhance segmentation performance. The generated E-D architecture is used to build the segmentation and conditioner path. Furthermore, to avoid the overfitting problem and gain better generalization, we train our model using SSL by incorporating unlabeled CT slices during training. Although most current studies on FSS concentrate on volumetric images with multiple annotated slices, we focus on axial scans of COVID-19. It is time-consuming to manually annotate the lung nodules or infection regions on all slices of CT images of COVID-19 patients. Therefore, we introduce a novel technique, called FSS-2019-nCov, which is able to accurately pair a limited number of COVID-19 slices of the support slices with all the slices of the query set.

Contributions

The primary contributions of our paper are: A novel COVID-19 segmentation technique is based on FSS to enable better generalization from a small number of annotated CT slices in both binary and multi-class scenario. We introduce a SAC block and MPP block for efficient exploitation of high-level contextual and spatial information and to assist in overcoming the problem of infection size variation. Both the SAC block and MPP block are integrated within the encoder–decoder architecture that is adapted to form the conditioner and segmentation path. Adaptive feature recombination and recalibration (RR) modules are included to effectuate knowledge representation interaction between the two paths. There is a resultant increase in generalization performance using semi-supervised training for the proposed FSS-2019-nCov.

Paper organization

The remainder of the paper is structured as follows. Section 2 reviews the current related studies. Detailed explanations and information corresponding to our proposed frameworks and principles incorporated are presented in Section 3. Proposed experimental conditions, comparison studies, and comprehensive analysis are provided in Section 4. Finally, the conclusions and intended future directions are explained in Section 5.

Related work

In this section, three kinds of studies related to our work are discussed: chest CT segmentation, semi-supervised learning, and COVID-19 segmentation.

Chest CT segmentation

The CT scan is a prevalent diagnostic tool for lung diseases [6]. Practically, segmenting a variety of lesions from chest CT images supplies clinicians with substantial information on lung disease diagnosis and quantification [29]. Several studies achieve chest nodules segmentation using a feature extractor accompanied by a classifier. For instance, Kumar et al. [30] introduced a new supervised CNN to fuse complementary multi-modality information from lung cancer scans. Ozdemir et al. [31] addressed lung cancer diagnosis using a 3D probabilistic DL approach for nodule segmentation and diagnosis while presenting model uncertainty. Gerard et al. [32] proposed a coarse-to-fine cascade of two CNN to reduce the impact of thin structure on the segmentation network. Jiang et al. [33] developed two residual networks to concurrently syndicate features across several resolutions and levels to detect lung tumors. Additionally, Cheplygina et al. [34] reviewed the recent studies of semi-supervised techniques for medical imaging tasks. However, these studies are extremely successful in data-intensive problems but are often obstructed for very small data sets. Such approaches also suffer from low generalization capability, making them inefficient for the underlying task of COVID-19 segmentation. To tackle these issues, we propose an FSL-based approach to enable learning from limited data.

Few-shot segmentation (FSS)

Recently, many studies have explored FSS with deep learning. Caelles et al. [19] performed video segmentation using the first frame annotation based on the notion of tuning pre-trained architectures. Even though their model operates effectively in this scenario, it is subject to overfitting and necessitates retraining to adopt a new class, which hampers the swiftness of adaptation. Shaban et al. [16] introduced a two-step approach, where the first step processes the new image-label pair to infer the classification parameter for the other step, which receives a query image and predicts the corresponding segmentation. Dong et al. [20] improved this approach to address numerous unidentified classes to perform multi-class segmentation simultaneously. Rakelly et al. [21] applied the approach in a very difficult scenario and they chose a tiny set of landmarks to induce the supervision of the support set, rather than using a compactly annotated binary mask. The training process in the before-mentioned studies relies on TL models. Despite the effectiveness of TL in many computer vision studies, there are no pre-trained architectures in medical imaging. In the medical imaging domain, FSS was first proposed in [35]. The authors used adversarial learning for brain image segmentation depending on one or two annotated labeled brain images, enthused by the previous achievement of SSL. Zhao et al. [36] exploited the captured transformation to extremely augment a fully labeled volume for one-shot segmentation. Roy et al. [18] introduced the two-stage model and applied the recently proposed squeeze and excite modules to empower the knowledge exchange between both arms and smooth the gradient flow. However, these studies suffer from a number of shortcomings. First, the approaches in these papers rely on the assumption that every shot is a complete 3D image that comprises many 2D slices. Second, they construct huge architectures without analyzing the effectiveness of every building block, which results in composite and potentially inconsistent models. Finally, they considered neither contextual information nor multi-scale features. Motivated by this, our study investigates the role of unsupervised data in the process of segmenting COVID-19 CT scans in an FSL scenario. Predominantly, we make use of the successful achievement of FSS studies in normal images. To further boost the performance of the proposed FSS-2019-nCov, we leverage unannotated axial CT slices as a supervisory signal. Incorporating unannotated slices into auxiliary tasks has been used to improve the generalization capabilities of deep learning approaches in many studies.

Semi-supervised learning

While owing to the challenge in finding entirely annotated data, semi-supervised learning (SSL) has been attracting much attention to enhance the network performance using a small amount of annotated data and a very large amount of unlabeled data [34]. SSL has been widely adopted for training deep models, which always seek to optimize the supervised and unsupervised loss on labeled and unlabeled data, respectively. [37], [38]. Fan et al. [39] proposed using weighted Intersection-over-Union (IoU) loss for edge supervision and cross-entropy loss for segmentation supervision. In a nutshell, the current deep SSL models regulate the network by imposing fine-grained and reliable classification boundaries, which are vigorous to a random disturbance. Other approaches enhance the supervision signals by investigating the acquired knowledge, such as pseudo labels and temporal ensemble dependency [40]. Inspired by the recent success of SSL architectures in the studies mentioned above, we propose to adopt SSL in model training to attain better generalization and avoid overfitting effects that may be incurred with pre-trained models.

COVID-19 segmentation

Recently, artificial intelligence has been widely adopted in multiple applications applied to COVID-19 detection [41]. These applications could be categorized into three groups [42]: patient-level, concerned with medical images analysis tasks (e.g., segmentation, classification, and quantification, etc.); molecular applications dedicated to protein structure (e.g., protein interactions, drug repurposing, etc.); and societal applications concentrated on epidemiology related tasks. In this paper, we focus on patient-relevant applications, specifically for those depend on CT scans. For example, Wang et al. [43] introduced an adapted inception network to classify COVID-19 patients from normal cases by training the network on the ROIs defined with two experienced radiologists according to the characteristics of pneumonia. Chen et al. [44] accumulated 46,096 slices of CT volumes from confirmed COVID-19 cases and other disease cases. The collected CT slices were used to train a U-Net++ [12] to identify COVID-19 cases. Their results demonstrate that the model diagnoses COVID-19 as well as radiologists. Additionally, other models proposed to act as AI-assisted systems for COVID-19 diagnosis, including ResNet [45], [46], and U-Net [39], [43]. Moreover, deep learning has been utilized for infection segmentation in lung CT scans and the obtained quantitative outcomes can be exploited to assess disease severity [47], quantify infection [3], screen infection at a large-scale [48]. All of the studies mentioned above assumed utilizing a large amount of data to train their models in a supervised manner, but the lack of annotated CT scans for COVID-19 means that such approaches lack utility. Fan et al. [39] were first to tackle this problem using a semi-supervised learning scheme, yet they first segment infection regions to use them to guide the multi-class segmentation, which results in suboptimal performance. This motivates us to use FSS to enable better generalization from small data samples using newly proposed context encoder–decoder architecture, efficiently exchanging this knowledge with segmentation path using the proposed smoothed RR module. We also boost the model generalization by semi-supervised training incorporating unlabeled CT slice.

Proposed approach

In this section, we present a detailed explanation of the proposed FSS-2019-nCov in terms of network architecture, network building blocks, and cost function. Then, we introduce the semi-supervised variant of FSS-2019-nCov and clarify how to use an SSL paradigm to increase the number of training instances to improve the segmentation performance. In addition, we extend our model for the multi-label classification for a variety of lung infections. Finally, we indicate the details of the implementation.

Problem formulation for FSS

In the infection segmentation scenario, the training data for FSS-2019-nCov contains N couples of CT axial scan and its respective annotation map . In the multi-class scenario, every semantic label have an annotation map where {1,2,…,K}, where k is the number of training classes. (e.g., in multi-class COVID-19 segmentation, the 1, 2, and 3 represent the GGO, consolidation, and background). The FSS-2019-nCov learns on with objective function such that having a support set for a new semantic label ( is the number of testing class) and a query slice , the COVID-19 infection segmentation of the query is predicted. There is no intersection between semantic labels of training and testing data. The most remarkable aspect of FSS is that test classes exist in the training data as the background class. This possibly could be exploited as a form of past knowledge during testing in cases where scarce instances are provided with the infection annotated. The architecture of the proposed FSS-2019-nCov. It consists of two identical paths with the encoder–decoder structure, namely the conditioner path (upper) and the segmentation path (top). The recombination and recalibration (RR) blocks (see Fig. 4) are introduced to effectuate knowledge interaction between two paths. The axial CT images are passed through a feature encoder blocks (E) module that is implemented with the pre-trained ResNet-34 blocks. The context enrichment module is then introduced to generate an improved semantic representation using SAC and MPP modules. Finally, the acquired representations passed into the feature decoder blocks (D).

Fig. 4

The architecture of the proposed MPP module containing five parallel paths for changing input resolution. Convolution layers are employed to capture different resolution information. The global average pooling (GAP) layer is employed to implement the residual connection.

FSS-2019-nCov architecture

As previously stated, the architecture of FSS-2019-nCov comprises three modules: the conditioner path, the adaptive interaction module, and the segmentation path. The conditioner path learns the visual information of the support set to infer infection on the query slice. The adaptive interaction module effectively conveys the learned representation in terms of feature maps to the segmentation path. The segmentation path makes use of the acquired representation to segment the query slice. Fig. 1 shows the detailed architecture of the proposed FSS-2019-nCov, which is further described in the following subsections. In FSS-2019-nCov, both the conditioner and the segmentation have an identical layout. In this way, feature maps in each path have the symmetric spatial resolution, which facilitates and empowers the interaction between corresponding blocks and eases gradient flow.

Fig. 1

The architecture of the proposed FSS-2019-nCov. It consists of two identical paths with the encoder–decoder structure, namely the conditioner path (upper) and the segmentation path (top). The recombination and recalibration (RR) blocks (see Fig. 4) are introduced to effectuate knowledge interaction between two paths. The axial CT images are passed through a feature encoder blocks (E) module that is implemented with the pre-trained ResNet-34 blocks. The context enrichment module is then introduced to generate an improved semantic representation using SAC and MPP modules. Finally, the acquired representations passed into the feature decoder blocks (D).

E-D architecture of conditioner and segmentation paths

The architecture of the conditioner path has an encoder–decoder based architecture consisting of four encoder blocks based on pre-trained Res2Net [49], four decoder blocks, and a Context enrichment (CE) module—see Fig. 1. Feature encoding: Recently, the Res2Net [49] architecture has shown great success in many computer vision tasks and has validated its effectiveness overall residual architectures owing to the multi-scale feature extraction capability that enables fine-grained level representations for every network layer. Motivated by this, we propose to implement the encoder path using Res2Net-50 as a backbone architecture. The structure of the encoder (or Res2Net) module is presented in Fig. 2(a) the multi-scale processing enables learning more representative information from the input CT images. The residual linking facilitates the network convergence and evade the gradient vanishing problem. From the input image, the E blocks acquire the global representation of the target entity (i.e., lesions) and relevant parts class property of the target [38], [44]. Nevertheless, these kinds of representation might gradually debilitate at the time they transmitted to deeper levels [34]. Thus, we introduce the CE module to tackle this issue, as presented in Fig. 1.

Fig. 2

Illustration of the encoder and decoder modules used in the proposed FSS-2019-nCov: (a) the encoder module implemented using Res2Net module [49]; and (b) the architecture of the decoder module.

Context Enrichment: The CE is introduced to learn semantic context representation and hence provide more informative feature maps, and it contains two blocks: the smoothed atrous convolution (SAC) block and the multi-scale pyramid pooling (MPP) block. Illustration of the encoder and decoder modules used in the proposed FSS-2019-nCov: (a) the encoder module implemented using Res2Net module [49]; and (b) the architecture of the decoder module. The architecture of the proposed SAC module consisting of four parallel paths. Each path from left to right contains 1, 2, 3, and 4 separable and shared convolutions, respectively. The architecture of the proposed MPP module containing five parallel paths for changing input resolution. Convolution layers are employed to capture different resolution information. The global average pooling (GAP) layer is employed to implement the residual connection. Smoothed Atrous convolution: The typical convolutional layer is widely adopted feature extraction in many semantic segmentation tasks [22]. Nevertheless, it still suffers from semantic information loss caused by pooling layers. In order to tackle this shortcoming, atrous convolution has been used in many segmentation tasks [50]. However, atrous convolution (with dilation larger than 1) still suffering from the gridding artifacts problem [25], which means that the calculation of neighboring is based on dispersed sets of input units, which causes local information discrepancy and degrades the network performance. This issue has been tackled with the recently proposed separable and shared convolution (SS-Conv) [25]. In an attempt to capture multi-level information learned through an encoder, we propose the SAC block presented in Fig. 3. in which we stack the SS-Conv layers in the form of four cascade tracks with a receptive field of 3, 7, 9, 19 causes a gradual increase in the number of SS-Conv from 1 to 1, 3, and 5. Inspired by the inception module [51], we attach SS-Conv 1 × 1 with activation at the end of each track. Finally, we concatenate the output of four tracks with the original feature maps as the output of the SAC block. The SS-Conv with a large reception field effectively captures and produces more detailed information for large infection areas. In contrast, the SS-Conv with a small reception field is better for small infection areas. By integrating the atrous SS-Conv of various atrous rates, the SAC enables efficient feature extraction for infections of various sizes.

Fig. 3

The architecture of the proposed SAC module consisting of four parallel paths. Each path from left to right contains 1, 2, 3, and 4 separable and shared convolutions, respectively.

Multi-scale pyramid pooling: The most challenging issue in infection segmentation is the wide variety of infection sizes in medical scans. For instance, the size of GGO in the middle or late stage can be much larger than that in the early stage of COVID-19 infection [5], [6]. To tackle this problem, we propose multi-scale pooling layers that depend on several operative fields of view to distinguish infection of various sizes, as shown in Fig. 4. Unlike [28], MPP takes the incoming feature maps and passes them to the four paths to alert their resolution using the pooling layer (i.e. average or max), hence the resolution at each path gets decreased to 1/2, 1/4, or 1/8 of the corresponding input. Then, a 3 × 3 convolution is employed to extract and learn multi-scale contextual representations. Additionally, we redesign the residual connection [28] to be implemented with global average pooling (GAP). Unlike [28], the MPP module can capture extra contextual information from the received input due to the nature of the average pooling operation that processes input maps at the regional level instead of point level. For example, given an input map 64 × 64, decreasing the input map resolution to 1/8 creates the new map of 16 (i.e., 4 4) points, so that 3 × 3 convolutions could capture information of nine of them, which means increasing the information consumption ratio. Thus, applying such a pooling layer enables the utmost input map values to contribute to the output map of the MPP module. Additionally, reducing the input map resolution often decreases the computation burden and logically increases the time efficiency compared with [28]. Further, the proposed non-dilated convolution usage in the MPP module also helps avoid the gridding artifacts problem [25]. After the convolution layer, the low-dimension feature map is up-sampled using bilinear interpolation to obtain the feature map with the same size as the input feature map. Furthermore, similar to SAC, the up-sampled feature maps are concatenated with the input feature map. Finally, the concatenated map is passed into then 1 × 1 convolution to generate the final output of the MPP module. Optimal parameter grid search showed that the size of stride should be to 2, 4, 6, and 8, which corresponds to kernel dimensions of 2, 4, 6, and 8, respectively. Feature Decoding: For restoring powerful resolution feature representations rapidly and professionally, four simple D blocks are employed to form the decoder path. The main purpose of the decoder is to reinstate the spatial representation with sophisticated features engendered from the CE module and progressively fuses the global contextual information. The architecture of decoder blocks presented in Fig. 2(b) contains 3 × 3 de-convolution, followed by a sampling layer for reducing the number of network parameters. The output of a D block is attained after 1 × 1 convolution. The generated map of the last D block is directly up-sampled to the same dimension of the original image. Therefore, the D blocks have the following number of filters 64, 128, 256, and 512 sequentially.

Conditioner path

The main job of ask of the conditioner path is to take as an input the support set with a slice and mask , which is later passed to the proposed encoder–decoder architecture to learn the visual representation that is used to generate informative task-specific feature maps and enable detecting the area to be segmented in the query slice in the segmentation path. In this paper, the feature maps of the middle layers of the conditioner path are referred to as the knowledge representation. The conditioner path has a two-channel input formed by stacking and . The architecture of the RR module: (a) illustration of the recalibration block implemented using separable and shareable convolution; (b) illustration of the recombination block; and (c) integration of both recalibration and recombination in a single module.

Adaptive interaction module

The interaction module plays an essential role in the proposed for FSS-2019-nCov. It consists of multiple interaction blocks that take the generated knowledge representation of the conditioner path as input and transfer it to the segmentation path to conduct the query slice segmentation. The most essential characteristics of these blocks are (1) a slight increase in the computational complexity of the model; (2) improved gradient flow and hence facilitated model training, and (3) adaptive exploitation of channel-wise relationships. For this purpose, we introduce a modified version of the recently proposed feature recalibration block (SegSE) and combine it with the feature recombination block [52] to obtain the recalibration and recombination (RR) module presented in Fig. 5(c). SegSE blocks are computational blocks to achieve adaptive recalibration of feature maps that act as a channel-wise attention mechanism that improves the discriminative power of generated feature maps, with a marginal increase in model complexity.

Fig. 5

The architecture of the RR module: (a) illustration of the recalibration block implemented using separable and shareable convolution; (b) illustration of the recombination block; and (c) integration of both recalibration and recombination in a single module.

Recalibration Module: Since there is a spatial correspondence between the segmentation pixels/voxels and the units of feature maps, applying channel squeeze and excite (SE) operation [53] potentially suppresses the entire feature maps that could encompass significant regions. To address this, we propose to use a spatially adaptive variant of SE (SegSE) that enables concurrent spatial and channel SE, which is more appropriate for COVID-19 semantic segmentation. The architecture of the SegSE block is presented in Fig. 5(a). The spatial structure and the correspondence of the feature maps are preserved by replacing the global average pooling in the SE block with layer to capture large-scale contextual information through dilated kernel operating over adjacent voxels to obtain , but without increasing those kernels’ parameters. Assuming that the convolution layer performs the transformation function that maps the input to the output where , ; represents the height and width of the input feature map; represents the height and width of the output feature map and denote the count of feature maps such that and . Then, we obtain a feature map using Eqs. (1), (2). where denotes the batch-normalization tailed with the activation function, is the kernel size, d represents the dilation rate that is determined based on the scale of the layer, is the number of kernels, and denotes the reduction factor. Hence, increasing the number of layers increases the field of view, which means that the units of the feature maps represent a wider area of the input space. After that, to obtain the recalibration output feature maps, a convolutional layer with kernels 1 × 1 operates on , and its output is fed into the sigmoid function as formulated in Eq. (3). where , and . Thus, we integrate the squeeze and excitation operation since the dilated layer decreases the number of feature maps, presenting a bottleneck. Finally, element-wise multiplication is applied to input to obtain the recalibrated feature maps. So, the recalibration of the given feature map is calculated with Eq. (4). So, the overall operation of the SegSE block could be expressed as Recombination Module: The main purpose of recombination is to empower the representativeness of the features by linearly combining them (see Fig. 5(b)). Accordingly, we utilize a convolutional layer with a kernel size of 1 × 1. The features map is expanded with factor and then recompressed again to the original number size . Thus, recombination operation could be expressed as where is mathematically formulated in Eq. (5).

Segmentation path

The main target of the segmentation path is to segment the input query slice utilizing the knowledge representation acquired from the conditioner path, which contains a high-level informative feature about the formerly unseen query slice. The SegSE blocks within the interaction modules compress the feature maps of intermediate layers of conditioner. They then perform cross-channel feature recalibration on the feature maps of the segmentation path. The architecture of the segmentation path is symmetric to the conditioner with just two main variations: (1) unlike the segmentation path, there are no interaction blocks presented after encoding and decoding modules of the conditioner path.; and (2) in the segmentation path, we final classification block with layer that produces the output segmentation maps that followingly fed into function to infer the infection segmentation in query slice.

Semi-supervised training

Currently, there are only a small amount of annotated CT images for COVID-19 patients. The manual segmentation of lung area and COVID-19 lesions is laborious and time-consuming, and most studies focus on studying the virus itself and finding the best inhibitor. To tackle this data limitation problem, we propose to train the FSS-nCoV-Net in a semi-supervised manner, in which the widely available unannotated CT image set is exploited for augmenting the training data, motivated by recent studies in [39], [54], [55], in which a random sampling mechanism for gradually expanding the CT training data using unannotated CT images. Algorithm 1 is employed unambiguously to estimate and generate the pseudo labels corresponding to the unannotated CT images. The follow-on CT scans, along with the corresponding pseudo labels, are subsequently exploited to train the proposed FSS-2019-nCov. In view of this, semi-supervised training of the proposed FSS-201-nCov has several benefits summarized as follows. First, the training and assortment technique is straightforward and not difficult to implement. Second, it is threshold-free and also does not necessitate measures to evaluate the forecast annotation. Third, it helps avoid the overfitting issue, which can provide more robust performance than other semi-supervised training approaches demonstrated by recently published studies [39], [54], [55]. Model training parameters.

Model training methodology

We train our model using the training mechanism adopted in [16], [18], where a batch sampler is used to randomly select a mini-batch that is subsequently used for model training. As opposed to traditional supervised training, we implement the following steps for picking samples from a mini-batch in every iteration. First, a label is randomly selected. Second, two CT slice and their corresponding labels are randomly sampled, such that they contain a semantic label . Third, binarization of the label map to set label at the foreground and to make the remaining areas the background. Fourth, the two pairs respectively establish the support set and the query set , where is the GT for calculating the loss. To sum up, the FSS-2019-nCov takes the two pairs as a training batch, where the support pair is combined to form two-channeled input to the conditioner path. Meanwhile, the query slice is used as the segmentation path input. Both inputs pass through the two paths of the model in a feed-forward manner seeking to predict the segmentation for the query slice for label . Dice loss [56] calculated between and using Eq. (6) is: where x represents the pixels of the prediction map. In order to reduce the , the batch sampler offers different instances belonging to diverse , and the loss is calculated for that particular and subsequently, the weights are modified, continuous altering of the inputs at each iteration, makes the model converges. Therefore, it could be said that the prediction turns out to be agnostic to the selected . We train FSS-2019 to minimize the loss for segmentation from annotated slices only. Simultaneously, to leverage the unannotated CT slices data, we employ an auxiliary manifold embedding loss on the dormant feature representations of both labeled and unlabeled samples to diminish the discrepancy between similar inputs in the latent space [57]. Thus, similarity among of unlabeled CT slices is specified by preceding knowledge. The final objective function could be formulated using Lagrangian multipliers, as shown in Eq. (7). where represent regularization parameter for the embedding loss at hidden layer . Naturally, this loss function seeks to minimize the distance between concealed representations of analogous and of adjacent data samples and, if not, attempt to push them away from each other. Furthermore, through extensive experiments, we tried different model training parameters to find out the most optimal configuration for our model and got the highest performance using the parameter shown in Table 1.

Table 1

Model training parameters.

Methods	DSC
Learning Rate	0.001
Weight decay constant	0.0001
Momentum	0.9
No. of epochs	50
Iterations per epoch	300
Optimizer	SGD
Balance factor	0.5

Experiments and results

Dataset

Two annotated CT datasets are employed for model evaluation, publicly published by the Italian Society of Medical and Interventional Radiology [58]. The first dataset (CT-1) comprises 110 axial CT slices belonging to 60 patients that are positively confirmed to have Covid-19. The CT slices were grayscaled, resized, and compiled into a NIFTI-file. The size of each slice was set to 512 × 512 pixels. An experienced radiologist annotated the CT slices using three-class labels, namely pleural effusion, GGO, and consolidation. We eliminated two images because of their low resolution. We split the CT-1 data into a training set of 38 CT images, a validation set of 20 images, and a test set of 50 images. Additionally, the second dataset (CT-2) comprised nine CT volumes consisting of 829 slices. Among them, there were 373 annotated axial CT slices that were positively confirmed as a COVID-19. 638 axial slices (i.e. 285 lesion-free slices and 353 infected slices) were selected for model evaluation. The annotated CT slice was resized from 630 × 630 resolution to 512 × 512 resolution as with CT-1 data. For semi-supervised training, a total of 1600 unannotated axial CT images were collected from the COVID-19 CT dataset [59], comprising 20 CT volumes from distinct COVID-19 patients. Then, the data was prepared by eliminating non-lung regions to form an unlabeled training set. All slices were preprocessed using an intensity normalization procedure for all input data.

Comparative studies

Baseline architectures. In the experiment relevant infection region segmentation scenario, we compare our model with robust semantic segmentation models including UNet [9] 1 and H-DenseUNet [11] 2 , U-Net++ [12] 3 , SegNet [13] 4 , FCN8s [7] 5 , DeepLabV3+ [14] 6 , SE-Net [18] 7 Inf-Net [39] as a baseline architecture, and compare the proposed approach against the recently proposed Inf-Net for COVID-19 segmentation [39]. In the multi-class scenario, we compare the proposed FSS-2019-nCov against the before mentioned, including DepLabV3+ [14] with different stride values, FCN8s [7], and Semi-Inf-Net-U-Net [9], Semi-Inf-Net-FCN8s [39], Semi-Inf-Net, and MC [39].

Evaluation metrics

In this study, we choose three broadly adopted metrics for performance evaluation namely , and Dice similarity coefficient (DSC). In order to measure the overlap between the segmentation outcomes represented with set and the ground-truth represented with set , the DSC is calculated as formulated in Eq. (8). where denote the set size, and denotes the intersection of both sets. The generated score always exists between 0 and 1; achieving high DSC reflects the greater segmentation performance. Also, following [39], we adopt three additional object detection metrics as follows. Model comparison for COVID-19 infection segmentation. denote ‘higher is better’, denote ‘lower is better’. (1) The structural similarity between a calculated map and the GT mask is measured with Structure Measure () with balance factor between object-aware resemblance () and object region-aware resemblance () according to Eq. (9). Here, we choose , as recommended by the original study [60] and some other recent studies either for COVID-19 segmentation [39], semantic segmentation [61], or object detection [62]. (2) The recently proposed Enhanced-alignment Measure () to measure similarity (local and global) between two maps based on Eq. (10). where w and h respectively represent the width and height of GT, the is the pixel position in GT, and denote the boosted alignment matrix. The value of calculated transforming the prediction into a binary mask with a threshold value in the range [0,255] as introduced in [63]. We provide the average of calculated from overall thresholds. (3) Mean Absolute Error (MAE): used to compute the error between and GT at the pixel level as formulated in Eq. (11) .

Results and discussion

Whole lung infection segmentation

In Table 2, we present the obtained results of the proposed FSS-2019-nCov on the five before-mentioned metrics. It could be observed that our model performs COVID-19 infection segmentation with DSC of 0.789, the sensitivity of 0.803, Specificity of 0.986, of 0.834, of 0.908, and MAE of 0.065, which outperforms the cutting-edge studies on the first four metrics. Also, it could be observed that the SSL based architectures (i.e. Inf-Net [39], Semi-Inf-Net [39], and the FSS-2019-nCov) have the highest performance on all metrics compared to the supervised models that require a large number of samples to learn. This supports our choice for training FSS-2019-nCov in a semi-supervised manner. In addition, the FSS-2019-nCov achieved 4%, 5%, 2%, and 2% improvement respectively on DSC, Sens, Spec, and over the recently proposed Semi-Inf-Net, which validates the effectiveness of FSS for tackling problems with low volumes of data. Besides, that Semi-Inf-Net still shows the lowest MAE. This might be explained by the negative impact of eliminating the skip connection in our E-D architecture, which also demonstrates the effectiveness of GT guidance presented in [39].

Table 2

Model comparison for COVID-19 infection segmentation.

Methods	Pre-trained architecture	DSC↑	Sens↑	Spec↑	Sα↑	Eϕ↑	MAE↓
U-Net [9]	VGG16	0.459	0.568	0.881	0.639	0.651	0.196
H-DenseUNet [11]	DenseNet-101	0.537	0.611	0.870	0.663	0.683	0.189
U-Net++ [12]	VGG16	0.607	0.701	0.932	0.739	0.751	0.139
SegNet [13]	VGG16	0.657	0.728	0.941	0.744	0.750	0.129
Inf-Net [39]	Res2Net	0.705	0.746	0.966	0.798	0.851	0.086
SE-Net [18]	–	0.621	0.719	0.949	0.751	0.801	0.142
Semi-Inf-Net [39]	Res2Net	0.752	0.757	0.965	0.818	0.902	0.061
*FSS-2019-nCov	Res2Net	0.798	0.803	0.986	0.834	0.908	0.065

denote ‘higher is better’, denote ‘lower is better’.

In addition, we can further confirm the effectiveness of semi-supervised FSS-2019-nCov by providing a visual comparison of the output of different models, as presented in Fig. 6.

Fig. 6

Lung infection segmentation using proposed FSS-2019-nCov. The first row represents the original CT image from the test set. The corresponding segmentation outcome from the U-Netv [9], U-Net++ [12], Inf-Net [39], Semi-Inf-Net [39], SE-Net [18] are presented in the second, third, fourth, fifth, sixth row respectively. The segmentation results of the proposed FSS-2019-nCov is presented in the seventh row. The corresponding ground truth label for every image is presented at the bottom of the last row of images.

Multi-class scenario

In addition to whole lung segmentation, we seek to provide more informative segmentation of different classes of lung infections, namely GGO, which is represented as a hazy gray shade, and consolidation is represented as opacification with obscuration of margins. Thus, we evaluate the proposed FSS-2019-nCov in the context of multi-class lung infection to validate the efficiency of the model in providing clinicians with fine-grained information for COVID-19 diagnosis and quantification. Table 3 presents the quantitative results of the multi-class FSS-2019-nCov on GGO class compared with state-of-the-art approaches. For GGO lesion, the FSS-2019-nCov achieved 0.679 of DSC, 0.768 of Sens, 0.980 of Spec, 0.735 of , 0.894 of , and 0.061 of MAE. It could be noted that the supervised models (i.e., FCN8 and DeepLab V3+) with pre-trained backbones show unacceptable performance owing to the data-hungry nature of supervised learning. Among them, a multi-class version of U-Net [9] shows comparatively higher results on several metrics. Additionally, few-shot-based SE-Net [18] has shown 3% improvements on the DSC measure though in the absence of a pre-trained backbone, which explains the superiority of few-shot learning limited data scenarios. Moreover, the semi-supervised approaches (either Semi-Inf-Net-FCN8s or Semi-Inf-Net MC) shows better performance than supervised models or few-shot based SE-Net [18]. This explains the effect of incorporating unlabeled samples in training to improve model classification performance and improve generalization performance. Furthermore, we also note that FSS-2019-nCov obtains 2.2%, 3.7%, and 1.7% improvements on DSC, Sensitivity and respectively over the best result in each measure. On the other hand, for consolidation lesion, the FSS-2019-nCov achieved 0.529 of DSC, 0.534 of Sens, 0.983 of Spec, 0.661 of , 0.797 of , and 0.045 of MAE. It is observed that the model has similar behavior in segmenting this lesion, as noted from results in Table 3 where we attain 5%, 1%, 1%, and 5% improvement on DSC, Sensitivity, Specificity, and , respectively. However, Semi-Inf-Net-FCN8s obtained a slight improvement over our model for the MEA measure, which could result from the effectiveness of parallel partial decoders in pixelwise error between the segmentation result and GT even if they increase computation burden. The above discussion further validates that integrating TL, SSL, FSL in a single segmentation framework extensively improves the segmentation performance is scarce annotation scenarios.

Table 3

Model comparison for GGO segmentation.

Methods	Pre-trained architecture	GGO segmentation						Consolidation segmentation
		DSC↑	Sens↑	Spec↑	Sα↑	Eϕ↑	MAE↓	DSC↑	Sens↑	Spec↑	Sα↑	Eϕ↑	MAE↓
FCN8s [7]	VGG16	0.482	0.552	0.917	0.591	0.788	0.098	0.289	0.281	0.728	0.573	0.581	0.058
DeepLabV3+ (s=8) [14]	ResNet101	0.402	0.501	0.871	0.553	0.682	0.121	0.157	0.173	0.744	0.511	0.556	0.065
DeepLabV3+ (s=16) [14]	ResNet101	0.457	0.728	0.845	0.559	0.673	0.149	0.245	0.322	0.721	0.526	0.619	0.079
U-Net [9]	VGG16	0.462	0.374	0.988	0.564	0.731	0.079	0.421	0.427	0.978	0.581	0.781	0.053
SE-Net [18]	–	0.508	0.415	0.889	0.541	0.751	0.075	0.449	0.467	0.958	0.554	0.797	0.051
Semi-Inf-Net-FCN8s [39]	Res2Net + VGG16	0.657	0.731	0.954	0.722	0.884	0.073	0.318	0.251	0.819	0.582	0.588	0.043
Semi-Inf-Net & MC [39]	VGG16 + Res2Net	0.639	0.631	0.973	0.715	0.904	0.070	0.471	0.527	0.979	0.618	0.781	0.045
*FSS-2019-nCov	Res2Net	0.679	0.768	0.980	0.739	0.894	0.061	0.529	0.534	0.983	0.661	0.797	0.045

denote ‘higher is better’, denote ‘lower is better’.

Model comparison for GGO segmentation. denote ‘higher is better’, denote ‘lower is better’. The results of evaluating different comparative models on the CT-2 dataset. denote ‘higher is better’, denote ‘lower is better’.

Generalization analysis

The generalization capability of any segmentation model is an important aspect to demonstrate its effectiveness in real-world scenarios. In view of this, to understand and analyze the generalization capability of the proposed FSS-2019-nCov, we propose to evaluate it against previously mentioned comparative studies on the CT-2 data and present the corresponding results presented in Table 4. It can be noted that the proposed FSS-2019-nCov has a robust generalization performance overcoming all other approaches on all measures even though the data comprises axial slices with no lesions (i.e., lesion-free slice). This might be reasoned by utilizing two datasets during training, i.e., CT-1 data and unannotated CT slice extracted from 20 CT volumes. Further, the unannotated data comprises many lesion-free slices with no lesion to assure that FSS-2019-nCov can efficiently handle deal with lesion-free slices. Therefore, we can conclude that FSS-2019-nCov is a general lesion segmentation technique that can be applied to a variety of diseases.

Table 4

The results of evaluating different comparative models on the CT-2 dataset.

Methods	DSC↑	Sens↑	Spec↑	Sα↑	Eϕ↑	MAE↓
U-Net [9]	0.337	0.682	0.841	0.523	0.649	0.221
H-DenseUNet [11]	0.419	0.635	0.964	0.547	0.561	0.167
U-Net++ [12]	0.462	0.881	0.937	0.589	0.614	0.115
SegNet [13]	0.453	0.844	0.932	0.624	0.6330	0.107
Inf-Net [39]	0.579	0.870	0.974	0.651	0.742	0.054
SE-Net [18]	0.555	0.837	0.924	0.673	0.713	0.054
Semi-Inf-Net [39]	0.597	0.865	0.977	0.723	0.792	0.037
*FSS-2019-nCov	0.632	0.892	0.975	0.764	0.824	0.031

denote ‘higher is better’, denote ‘lower is better’.

Ablation experiment

Impact of RR module

In this part, we inspect the ideal positions of RR blocks for smoothing knowledge interactions between the conditioner path and the segmentation path and also compare the FSS-2019-nCov performance when using recombination block only, recalibration block, and both blocks together (RR). Meanwhile, this experiment seeks to find the position and the type of interaction blocks—here, we fix all the network parameters and they later analyzed in subsequent sections. With two types of interaction blocks (i.e., SegSE, and recombination) and four possible positions for interaction block, there are twelve model variants termed as BLK-1, BLK-2.etc. In Table 5, we provide the segmentation DSC performance in terms of whole lung scenario and multi-class scenario for every configuration in these twelve model variants. It could be noted that BLK-3, 6, 9, 12 with Recombination and Recalibration (RR) blocks (the ones that have ✓ under the R (SegSE), and R column) yield the highest DSC score, which demonstrates the efficiency of RR interaction modules in effectuating the interactions between two paths of FSS-2019-nCov architecture. This network behavior could be explained due to concurrent spatial and channel squeezing using to reduce the number of feature maps and increase their number later hence empower their representational power to convey the relevant information from the conditioner path to the segmentation path. Additionally, we could observe that the BLK-12 with RR blocks between all encoder, CE, and decoder blocks, BLK-12 attained the maximum DSC since it achieved a 3% improvement for infection segmentation over the best DSC obtained by other variants that correspond to BLK-11. In the multi-class scenario, BLK-12 attained 1% and 2% improvements over GGO and consolidation correspondingly. This improvement is potentially associated with the complexity and size of each class. In other words, the size and contrast of the GGO facilitate its segmentation in comparison to consolidation. Also, BLk-1: BLK-9 show poor performance in comparison to BLK-10: BLK-12. This shows that extra interactions enable better learning. It is obviously notable that model variants with encoder interactions (i.e., BLK-1, 2, 3) show higher performance compared to model variants with decoder interactions (i.e., BLK-7: BLK-9). This shows that encoder interactions are much representative and influential than CE or decoder interactions. Nevertheless, as BLK-12 yielded better performance than other model configurations. This could be explained by the encoder and decoder interactions generating complementary knowledge representation to the segmentation path to enable more enhanced segmentation of the query slices. From these discussions, it could be deduced that applying RR blocks at Encoder, CE, decoder leads to better performance than applying them to any single position.

Table 5

Comparison between a different variant of the model to investigate the optimal position and kind of interaction blocks.

	Position of RR Block			Interaction block		DSC
	Enc	CE	Dec	R (SegSE)	R	Infection	GGO	Cons
BLK-1	✓	✗	✗	✓	✗	0.661	0.475	0.405
BLK-2	✓	✗	✗	✗	✓	0.414	0.274	0.314
BLK-3	✓	✗	✗	✓	✓	0.698	0.513	0.426
BLK-4	✗	✓	✗	✓	✗	0.571	0.369	0.321
BLK-5	✗	✓	✗	✗	✓	0.327	0.221	0.221
BLK-6	✗	✓	✗	✓	✓	0.545	0.373	0.395
BLK-7	✗	✗	✓	✓	✗	0.623	0.441	0.234
BLK-8	✗	✗	✓	✗	✓	0.421	0.239	0.326
BLK-9	✗	✗	✓	✓	✓	0.644	0.455	0.361
BLK-10	✓	✓	✓	✓	✗	0.733	0.669	0.511
BLK-11	✓	✓	✓	✗	✓	0.77	0.632	0.514
BLK12	✓	✓	✓	✓	✓	0.798	0.679	0.529

R (SegSE) represent recalibration block, R represent recombination block.

Comparison between a different variant of the model to investigate the optimal position and kind of interaction blocks. R (SegSE) represent recalibration block, R represent recombination block. Experimental results for analyzing the impact of using sip connection in E-D architecture. Ablation experiments on the proposed FSS-2019-nCov on CT-1 dataset.

Impact of skip connection

The connections have been regarded as a principle design choice in most F-CNN. It enables the concatenation of encoder output map and input feature maps of the decoder block with the same spatial resolution. This connection helps the decoder in capturing the contextual information and hence smooths the flow of gradient. In light of this, we start building our model by applying skip connections in both the conditioner path and segmentation path, and the result show copy over effect [18]. This means that the prediction on the query slice is almost symmetric to the support mask despite the difference between the support and query slice. Therefore, we conducted several experiments to investigate the impact of using skip connection on model performance in terms of DSC and hence on the copy over effect. In this experiment, we fixed all network parameters used in BLK-12 and just try different skip connection configurations. Thus, the performance of FSS-2019-nCov with and without skip connections is presented in Table 6. It could be noted that the DSC of whole infection segmentation decreased by 4% and also decreased by 3% in the case of GGO and Consolidation when applying skip connections in the two paths of the network (i.e., conditioner and segmentation paths). Also, applying skip connection on only the segmentation path obviously yields unsatisfactory results. Moreover, including the skip connections in the conditioner path results in a 5% decrease in DSC in different segmentation scenarios.

Table 6

Experimental results for analyzing the impact of using sip connection in E-D architecture.

Skip Connections		DSC
Conditioner path	Segmentation path	Infection	GGO	Cons
✓	✗	0.749	0.573	0.485
✓	✓	0.752	0.644	0.506
✗	✗	0.798	0.679	0.529
✗	✓	0.415	0.256	0.201

Impact of pre-training

In this experiment, we choose U-Net with a non-pre-trained encoder as a baseline architecture for both segmentation and conditioner paths. Then we replace the baseline encoder with a pre-trained one to obtain enhanced performance. The architecture with a pre-trained residual encoder is called the ‘Backbone’. The result with and without pre-training compared and it could be noted that using pre-trained Res2Net clearly improves performance as depicted in Table 7.

Table 7

Ablation experiments on the proposed FSS-2019-nCov on CT-1 dataset.

Methods	DSC↑	Sens↑	Spec↑	Sα↑	Eϕ↑	MAE↓
Baseline w/o pretraining	0.643	0.681	0.834	0.721	0.719	0.278
Baseline w/ pre-training	0.665	0.718	0.881	0.741	0.735	0.181
Backbone + SAC (atrous)	0.701	0.737	0.929	0.769	0.815	0.166
Backbone + SAC (SS-Conv)	0.731	0.748	0.956	0.781	0.863	0.105
Backbone + MPP	0.715	0.712	0.941	0.749	0.841	0.119
*FSS-2019-nCov	0.798	0.803	0.986	0.834	0.908	0.065

Impact of SAC module

The proposed SAC block utilizes a variety of SS-Conv organized in the form of an Inception module to extract high-level spatial representation. Thus, to investigate the effectiveness of SS-Conv, we used atrous convolution to replace the SS-Conv in the SAC block (denoted Backbone + SAC (atrous)). Table 7 shows that the proposed SAC block achieves 3% DSC improvement over the traditional atrous block (Backbone + SAC (atrous)) and reduces the MAE with 0.061 in whole infection segmentation to achieve a similar improvement in other metrics. This, in turn, demonstrates that SS-Conv effectively enables improved feature fusion to extract high-level multi-scale contextual feature maps with high resolution and hence improve segmentation performance.

Impact of MPP

In an attempt to validate the usefulness of the proposed MPP block, we experiment with our Backbone architecture with and without MPP blocks for infection segmentation, as presented in Table 7. It is obviously noted that the MPP block boosts the model performance. The ‘Backbone + MPP’ achieved a 5% improvement on DSC, and reduced the MEA with 0.057. This indicates MPP block could effectively encode the local contextual representation from the encoder generated maps feature maps. The results of evaluating the proposed FSS-2019-nCov on CT-1 using different learning paradigms. denote ‘higher is better’, denote ‘lower is better’.

Impact of semi-supervised training

In order to demonstrate the efficiency of semi-supervised training of the proposed FSS-2019-nCov, we compare performance when trained in a supervised and semi-supervised manner, and we report the corresponding results in Table 8. It can be noted that semi-supervised training shows significant performance improvements in segmenting infection lesion (i.e. DSC of 0.119, Sensitivity of 0.059, Specificity of 0.027, of 0.06, of 0.105, and MAE of 0.040. This observation provides clear evidence regarding the effectiveness of incorporating unannotated CT data for training FSS-2019-nCov.

Table 8

The results of evaluating the proposed FSS-2019-nCov on CT-1 using different learning paradigms.

Methods	Semi-supervised learning	Supervised learning
DSC↑	0.798	0.679
Sens↑	0.803	0.744
Spec↑	0.986	0.959
Sα↑	0.834	0.774
Eϕ↑	0.908	0.803
MAE↓	0.065	0.105

denote ‘higher is better’, denote ‘lower is better’.

Managerial implications

COVID-19 segmentation is the task of determining the infection area within lung CT scans. This task could be addressed as a binary classification problem or a multi-classification problem. In binary classification scenarios, we aim to distinguish between infected and uninfected areas. In a multi-class scenario, we aim to distinguish between different types of infection. The key challenge of this study is the limited amount of labeled CT scans. We propose a novel architecture that integrates pre-trained encoder, FSS, and SSL to overcome this limitation. The Res2Net50-based encoder enables improved network convergence. The FSS architecture enables learning from limited support samples and better generalization of query samples. We introduce adaptive recombination and recalibration module between the correspondence positions in the conditioner and segmentation path to facilitate knowledge representation exchange. This is established by our experiments since it can be safely claimed that RR significantly finetune knowledge interaction and hence improve the performance. Meanwhile, the CE module enables capturing contextual information of infection at different scales, facilitating the detection of different sizes of infections. Comprehensive experiments confirmed the effectiveness of each block. As a direct implication, the proposed FSS-2019-nCov in study work can be utilized to develop an automated lung infection segmentation system with scarcely annotated data.

Shortcomings and possible remedies

Extra deep learning improvement will be addressed by future work in terms of performance improvement and computational complexity reduction. We aim to investigate three crucial challenges that we regard as specifically related to the medical image analysis community. (1) The training configuration of FSS-2019-nCov denotes a challenging task since it still necessitates a comprehensive parameter improving to attain the highest results. An automatized tuning tool can be used for this. (2) The predictions usually lack laborious uncertainty quantification. We aim to develop Bayesian variants or fuzzified variants of proposed FSS-2019-nCov that could enable estimating uncertainty in prediction. (3) Although extensive analysis has provided us with a great understanding of the behavior of FSL and FSS, accountability and interpretability are considered as a downside of our FSS-2019-nCov and an attention technique could mitigate this.

Conclusion and future work

In this paper, we proposed a novel semi-supervised few-shot segmentation model for COVID-19 segmentation from axial CT scans using dual-path architecture. The two paths had a symmetric structure and comprise an encoder–decoder architecture with a smoothed context fusion module. The encoder architecture was based on pre-trained ResNet34 architecture to facilitate the learning process. We proposed to merge recombination and recalibration to transfer learned knowledge from the support set to be used for query slices segmentation. The model trained in semi-supervised strategy by incorporating unlabeled CT slices and labeling one during training, improving generalization performance. We investigated the proposed FSS-2019-nCov and numerous baselines on publicly available COVID-19 CT scans. The results showed that our model could outperform all approaches to multiple evaluation metrics. We also introduced comprehensive experiments for architectural selection concerning RR blocks, Skip connections, and the proposed building blocks. However, the segmentation performance of the proposed FSS-2019-nCov was unable to achieve a very precise segmentation due to limited supervision, which could be handled with a generative learning schema. An additional limitation was a lack of volumetric data representation, which could be alleviated by expanding our model to 3D CT volumes of COVID-19. Consequently, we aim to investigate the segmentation of COVID-19 using a large amount of volumetric 3D data in the near future.

CRediT authorship contribution statement

Mohamed Abdel-Basset: Investigation, Methodology, Resources, Visualization, Software, Writing - original draft, Writing - review & editing. Victor Chang: Conceptualization, Formal analysis, Project administration, Validation, Writing - review & editing. Hossam Hawash: Investigation, Methodology, Resources, Visualization, Software, Writing - original draft, Writing - review & editing. Ripon K. Chakrabortty: Conceptualization, Methodology, Writing -review & editing. Michael Ryan: Investigation, Validation, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

26 in total

Review 1. The COVID-19 epidemic analysis and diagnosis using deep learning: A systematic literature review and future directions.

Authors: Arash Heidari; Nima Jafari Navimipour; Mehmet Unal; Shiva Toumaj
Journal: Comput Biol Med Date: 2021-12-14 Impact factor: 6.698

2. Utilizing IoT to design a relief supply chain network for the SARS-COV-2 pandemic.

Authors: Ali Zahedi; Amirhossein Salehi-Amiri; Neale R Smith; Mostafa Hajiaghaei-Keshteli
Journal: Appl Soft Comput Date: 2021-02-24 Impact factor: 6.725

3. COVID_SCREENET: COVID-19 Screening in Chest Radiography Images Using Deep Transfer Stacking.

Authors: R Elakkiya; Pandi Vijayakumar; Marimuthu Karuppiah
Journal: Inf Syst Front Date: 2021-03-17 Impact factor: 6.191

4. A Novel Ensemble-based Classifier for Detecting the COVID-19 Disease for Infected Patients.

Authors: Prabh Deep Singh; Rajbir Kaur; Kiran Deep Singh; Gaurav Dhiman
Journal: Inf Syst Front Date: 2021-04-25 Impact factor: 6.191

5. Deming least square regressed feature selection and Gaussian neuro-fuzzy multi-layered data classifier for early COVID prediction.

Authors: Rathnamma V Mydukuri; Suresh Kallam; Rizwan Patan; Fadi Al-Turjman; Manikandan Ramachandran
Journal: Expert Syst Date: 2021-03-26 Impact factor: 2.812

6. Deep learning empowered COVID-19 diagnosis using chest CT scan images for collaborative edge-cloud computing platform.

Authors: Vipul Kumar Singh; Maheshkumar H Kolekar
Journal: Multimed Tools Appl Date: 2021-06-28 Impact factor: 2.577

7. COVSeg-NET: A deep convolution neural network for COVID-19 lung CT image segmentation.

Authors: XiaoQing Zhang; GuangYu Wang; Shu-Guang Zhao
Journal: Int J Imaging Syst Technol Date: 2021-06-04 Impact factor: 2.177

8. The Role of Artificial Intelligence in Fighting the COVID-19 Pandemic.

Authors: Francesco Piccialli; Vincenzo Schiano di Cola; Fabio Giampaolo; Salvatore Cuomo
Journal: Inf Syst Front Date: 2021-04-26 Impact factor: 5.261

9. Automatic Segmentation of Novel Coronavirus Pneumonia Lesions in CT Images Utilizing Deep-Supervised Ensemble Learning Network.

Authors: Yuanyuan Peng; Zixu Zhang; Hongbin Tu; Xiong Li
Journal: Front Med (Lausanne) Date: 2022-01-03

10. FedDPGAN: Federated Differentially Private Generative Adversarial Networks Framework for the Detection of COVID-19 Pneumonia.

Authors: Longling Zhang; Bochen Shen; Ahmed Barnawi; Shan Xi; Neeraj Kumar; Yi Wu
Journal: Inf Syst Front Date: 2021-06-15 Impact factor: 6.191