Literature DB >> 35529253

DMDF-Net: Dual multiscale dilated fusion network for accurate segmentation of lesions related to COVID-19 in lung radiographic scans.

Muhammad Owais¹, Na Rae Baek¹, Kang Ryoung Park¹.

Abstract

The recent disaster of COVID-19 has brought the whole world to the verge of devastation because of its highly transmissible nature. In this pandemic, radiographic imaging modalities, particularly, computed tomography (CT), have shown remarkable performance for the effective diagnosis of this virus. However, the diagnostic assessment of CT data is a human-dependent process that requires sufficient time by expert radiologists. Recent developments in artificial intelligence have substituted several personal diagnostic procedures with computer-aided diagnosis (CAD) methods that can make an effective diagnosis, even in real time. In response to COVID-19, various CAD methods have been developed in the literature, which can detect and localize infectious regions in chest CT images. However, most existing methods do not provide cross-data analysis, which is an essential measure for assessing the generality of a CAD method. A few studies have performed cross-data analysis in their methods. Nevertheless, these methods show limited results in real-world scenarios without addressing generality issues. Therefore, in this study, we attempt to address generality issues and propose a deep learning-based CAD solution for the diagnosis of COVID-19 lesions from chest CT images. We propose a dual multiscale dilated fusion network (DMDF-Net) for the robust segmentation of small lesions in a given CT image. The proposed network mainly utilizes the strength of multiscale deep features fusion inside the encoder and decoder modules in a mutually beneficial manner to achieve superior segmentation performance. Additional pre- and post-processing steps are introduced in the proposed method to address the generality issues and further improve the diagnostic performance. Mainly, the concept of post-region of interest (ROI) fusion is introduced in the post-processing step, which reduces the number of false-positives and provides a way to accurately quantify the infected area of lung. Consequently, the proposed framework outperforms various state-of-the-art methods by accomplishing superior infection segmentation results with an average Dice similarity coefficient of 75.7%, Intersection over Union of 67.22%, Average Precision of 69.92%, Sensitivity of 72.78%, Specificity of 99.79%, Enhance-Alignment Measure of 91.11%, and Mean Absolute Error of 0.026.

Entities: Chemical

Keywords: COVID-19 lesions segmentation; Computer-aided diagnosis; DMDF-Net; Infection quantification; Lung segmentation

Year: 2022 PMID： 35529253 PMCID： PMC9057951 DOI： 10.1016/j.eswa.2022.117360

Source DB: PubMed Journal: Expert Syst Appl ISSN： 0957-4174 Impact factor: 8.665

Introduction

Recently, the global disaster of coronavirus disease 2019 (COVID-19) has grieved millions of people and triggered a socioeconomic crisis worldwide. According to the given figures of the WHO (World Health Organization, WHO Coronavirus Disease (COVID-19) Dash-board, 2021), until June 18, 2021, approximately 176,693,988 positive cases of the COVID-19 virus, including 3,830,304 deaths (2.17% mortality rate), have been reported globally. Additionally, the advent of different variants of COVID-19 has further caused an alarming situation worldwide due to their more contagious impact. Regarding the treatment of COVID-19, different experimental vaccines (Kim, Marks, & Clemens, 2021) have finalized clinical assessments and are authorized by the European Medicines Agency (EMA) and/or the Food and Drug Administration (FDA). However, mass production and worldwide distribution of the COVID-19 vaccine is still a challenging and time-consuming task. Until now, preventive measures and early diagnosis have been the only solutions to prevent further spread of this deadly virus. In the context of diagnosis, molecular tests, such as the nucleic acid amplification test (NAAT) and reverse transcription polymerase chain reaction (RT-PCR) are being performed to identify positive cases (Ai et al., 2020). However, these subjective evaluations are performed under strict clinical conditions that can limit the use of these testing methods in outbreak regions. Recent studies (Ai et al., 2020, Fang et al., 2020) have found that chest computed tomography (CT) is a cost-effective diagnostic tool for the identification of COVID-19 infection. Fig. 1 shows a few CT images of different patients infected with the COVID-19 virus. The infected regions are indicated by red boundary lines. The quantitative results presented in (Ai et al., 2020) show that the personal evaluation of COVID-19 infection reached 97% sensitivity, in contrast with RT-PCR testing results based on lung CT images. Similar results in (Fang et al., 2020) demonstrated the diagnostic potential of chest radiography in the initial evaluation of the COVID-19 virus. Moreover, the quantitative evaluation of infection progress inside lung lobes is an important measure for medical treatment (Zhang et al., 2020). Therefore, accurate segmentation of the infected regions is an important pre-processing step for assessing the severity of COVID-19 infection. However, manual evaluation of a large volume of CT scans is a time-consuming task and increases the workload requirements of healthcare professionals.

Fig. 1

Example CT images of different patients infected with COVID-19 virus from (a) MosMed data and (b) COVID-19-CT-Seg data. The infectious regions of the lungs are shown inside the red boundary lines.

Example CT images of different patients infected with COVID-19 virus from (a) MosMed data and (b) COVID-19-CT-Seg data. The infectious regions of the lungs are shown inside the red boundary lines. Recent advances in artificial intelligence technology, especially in the field of medical diagnostics (Hou et al., 2021, Mhiri et al., 2020, Xu et al., 2020), have substituted several human-dependent diagnostic approaches with computer-aided diagnosis (CAD) tools. In the present outbreak of COVID-19, these CAD techniques can also support healthcare professionals in making timely and efficient diagnostic decisions using chest CT images. Generally, a CAD method applies a set of artificial intelligence algorithms to analyze the given data, such as CT images, and provides diagnostic results. Recently, a new set of artificial intelligence algorithms called deep learning has emerged that has significantly enhanced the diagnostic capabilities of numerous CAD techniques. These state-of-the-art algorithms can emulate the diagnostic capability of healthcare experts and make effective diagnostic decisions. Recently, convolutional neural networks (CNNs), a new variant of deep learning algorithms, have attained special attention in the development of CAD tools related to various medical domains. However, such CNN-based diagnostic methods need to be trained through supervised learning, which requires a large-scale annotated dataset. In the medical domain, data annotation is accomplished by healthcare experts, which requires sufficient time and resources. To substitute the requirement of a large-scale training dataset, transfer learning (Krizhevsky, Sutskever, & Hinton, 2012) was adopted to train a CNN-based CAD method. In this training approach, a pre-trained CNN (trained on a huge collection of natural images, such as ImageNet (Deng et al., 2009)) can be used in the medical domain. The internal structure of a CNN model consists of a series of convolutional and fully-connected (FC) layers, including other layers, such as batch normalization (BN), softmax, classification, and a rectified linear unit (ReLU) (detailed in (Heaton, 2015)). The convolutional and FC layers included learnable parameters that were initially trained with the training dataset. In response to the current pandemic, various types of CAD methods exist in the literature. These existing methods mainly use chest X-ray and/or CT images to diagnose COVID-19. Initially, most of the existing methods (Li et al., 2020, Minaee et al., 2020, Owais et al., 2021, Rathod and Khanuja, 2021) used deep classification models to make diagnostic decisions. These methods (Li et al., 2020, Minaee et al., 2020, Owais et al., 2021, Rathod and Khanuja, 2021) can only classify positive and negative patients without highlighting the infectious regions in a given radiographic scan. Later, new methods were proposed based on deep segmentation networks, which localized the infected regions in a given radiographic scan. However, most of the existing methods lack cross-data analysis, which is a prime indicator for assessing the effectiveness of a CAD method under real conditions. Limited studies (Ma et al., 2021, Owais et al., 2021, Zhang et al., 2020) have been conducted in which a cross-data analysis of the methods is performed. However, these methods showed limited results in cross-data analysis. Consequently, in this study, we address the limitations of the existing studies and develop a high-performance CAD method for the efficient and well-localized detection of COVID-19 related findings in chest CT images. The main contributions of our method are as follows. We propose a dual multiscale dilated fusion network (DMDF-Net) for the robust segmentation of small lesions in chest CT images. Our designed model utilizes the strength of grouped convolution and multiscale deep features fusion inside the encoder and decoder modules using multiscale dilated convolution to achieve better segmentation results with a reduced number of training parameters. Additional pre- and post-processing steps are introduced in the proposed method to address the generality issues and obtain superior performance in a real-world setting. Moreover, the post-processing step also provides a way to accurately estimate the proportion of the infected area of the lung (PIAL), which is an essential measure for quantifying the severity of COVID-19 infection. Our proposed method achieves state-of-the-art results for the case of cross-data analysis and outperforms various existing methods and recent deep segmentation networks. Finally, we make the proposed framework (including implementation of DMDF-Net, pre-, and post-processing steps) openly available through (https://github.com/Owais786786/DMDF-Net.git, accessed on 18 January 2022) for fair comparisons by other researchers. The remainder of this article is structured as follows. In Section 2, we briefly review the various existing CAD methods for diagnosing COVID-19 infection using chest radiographic scans. Section 3 presents our selected datasets and proposed method. The training/validation settings and quantitative results are provided in Section 4. Finally, a brief discussion and the conclusions for the proposed framework are presented in 5, 6, respectively.

Related work

In recent literature, various types of diagnostic methods have been proposed to automatically diagnose COVID-19 from chest radiographic scans. These methods primarily focus on CNN-based classification and segmentation models to make diagnostic decisions. In (Li et al., 2020, Minaee et al., 2020, Owais et al., 2021), the authors proposed CNN-based CAD frameworks that mainly classify the given radiographic scan as either positive or negative. Additionally, different training schemes were proposed in (Li et al., 2020, Minaee et al., 2020, Owais et al., 2021) to perform the optimal training of their models using a limited number of training samples. However, the methods in (Li et al., 2020, Minaee et al., 2020, Owais et al., 2021) were trained to classify only positive and negative cases of COVID-19 without detecting and localizing accurate lesion regions in a given radiographic image. In contrast, the semantic segmentation networks performed well in finding infected regions with COVID-19 in each radiographic image. However, pixel-level annotated ground truths are required to properly train and validate these segmentation networks. Such data annotation is performed by healthcare professionals, which requires sufficient time and resources. To substitute the constraints of large-scale datasets, different semi-supervised learning and data synthesis methods were proposed in literature (Fan et al., 2020, Jiang et al., 2021, Zhang et al., 2020). These methods can effectively train deep networks with limited training data. For example, (Jiang et al., 2021) proposed an image synthesis framework based on a conditional generative adversarial network (C-GAN) that can generate radiographic data samples (including both COVID-19 positive and negative CT images) for adequate training of deep networks. In addition, the conventional U-Net segmentation model (Ronneberger, Fischer, & Brox, 2015) was trained with and without using synthesized data to demonstrate the efficiency of their data synthesis approach. Subsequently, (Zhang et al., 2020) presented a new version of C-GAN, called CoSinGAN, that can synthesize high-quality CT images by learning from a single data sample. The experimental results show superior segmentation performance for 2D and 3D U-Net compared to previous reference methods based on the synthesized data of CoSinGAN. In addition, (Fan et al., 2020) proposed a semi-supervised training scheme that effectively trains the proposed deep segmentation model (Inf-Net) using unlabeled data. A novel randomly selected propagation algorithm was adopted to perform the training of Inf-Net using labeled and unlabeled training data. Moreover, the aggregation of high-level features was performed inside Inf-Net to exploit the diverse representations of the lesion regions. Later, (Ma et al., 2021) presented benchmarks for lung lobes and infection segmentation using two radiographic datasets, including CT images. Different segmentation models were trained and evaluated to achieve the best results. 3D U-Net was ranked as the best model among the different reference models. In a comparative study, (Oulefki, Agaian, Trongtirakul, & Laouar, 2021) presented a detailed analysis of traditional machine learning techniques in response to the automated diagnosis of COVID-19. Based on a limited number of data samples, the first-ranked machine learning method showed comparable results to a deep CNN model. However, recent comparative studies (Jiang et al., 2021, Li et al., 2020, Owais et al., 2021, Zhang et al., 2020) have proved that deep learning models outperform traditional machine learning methods using multi-source radiographic datasets. Furthermore, (El-Bana, Al-Kabbany, & Sharkas, 2020) developed a multitasking CAD method that comprises a classification and segmentation model to identify and segment certain types of infections in a given CT image. Initially, a pre-trained CNN model was configured to recognize the positive and negative cases of COVID-19. Subsequently, a deep segmentation network (DeepLabV3+ (Chen, Zhu, Papandreou, Schroff, & Adam, 2018)) was included to segment the infectious regions in a given CT image. Similarly, (Zheng et al., 2020) presented a multiscale identification network (MSD-Net) to segment multiclass lesions of different sizes. In a recent study, (Abdel-Basset, Chang, Hawash, Chakrabortty, & Ryan, 2021) presented a novel segmentation network, FSS-2019-nCov, to substitute the constraints of large-scale training datasets. FSS-2019-nCov contains a dual-path encoder–decoder design that mainly extracts high-level features without changing the channel information. A pre-trained residual network (ResNet34) was configured as an encoder. Later, (Selvaraj, Venkatesan, Mahesh, & Raj, 2021) developed a CAD framework based on the joint connectivity of a classification and segmentation network, similar to (El-Bana et al., 2020). Additional handcrafted features (i.e., lesion texture and structure information) were also used to efficiently train both networks. Subsequently, (Owais et al., 2021) and (Zhou, Canu, & Ruan, 2021) proposed segmentation-based CAD solutions for the effective detection of minor infectious regions in CT images caused by the COVID-19 virus. To deal with multi-plane CT data, (Kesavan, Al Naimi, Al Attar, Rajinikanth, & Kadry, 2021) applied a pre-trained Res-UNet model for identifying the COVID-19 related lesion regions from the lung CT images with various 2D planes (such as axial, coronal, and sagittal orientations). In another study, (Munusamy, Muthukumar, Gnanaprakasam, Shanmugakani, & Sekar, 2021) proposed a novel CNN model (FractalCovNet) for detecting COVID-19 infection from heterogenous radiographic data (i.e., X-ray and CT images). The proposed model was configured to perform the following two tasks: 1) classifying the lung X-ray images into COVID-19 positive and negative cases; 2) recognizing the COVID-19 related infectious regions from lung CT images. Subsequently, (Voulodimos, Protopapadakis, Katsamenis, Doulamis, & Doulamis, 2021) performed the comparative analysis of two known segmentation models (U-Net and fully convolutional networks (FCNs)) using CT data from COVID-19 patients. Comparative results indicate the following distinctive aspects of FCNs over U-Nets: 1) achieve accurate segmentation despite the class imbalance on the dataset; 2) perform well even in case of annotation errors on the boundaries of symptom manifestation areas. (Zheng, Zheng, & Dong-Ye, 2021) performed the volumetric segmentation of the whole 3D chest CT-scan using an enhanced version of U-Net named 3D CU-Net. An attention mechanism was mainly included in the encoder part of the proposed 3D CU-Net to obtain different levels of the feature representation. Additionally, a pyramid fusion module with expanded convolutions was introduced at the end of the encoder to combine multiscale context information from high-level features. Similarly, (Zhao et al., 2021) proposed a dilated dual attention U-Net (namely D2A U-Net) for accurate detection of COVID-19 related lesion regions in chest CT images. The proposed D2A U-Net utilized the strength of the dual attention strategy to improve feature maps and decrease the semantic gap between different levels of feature maps. Additionally, the hybrid dilated convolutions are included in the decoder part to achieve larger receptive fields, which improves the decoding process. Finally, Table 1 presents a comparative summary of our proposed and various existing methods to highlight the superior aspects and limitations of each study.

Table 1

Comparative review summary of the proposed and existing methods on the infection segmentation related to COVID-19.

Method	#Sli.(#Pat.)	Strengths	Limitations
C-GAN andU-Net (Jiang et al., 2021)	829(9)	C-GAN overcomes the underfitting issue	- Lack of cross-data analysis- Limited testing dataset- Computationally expensive
CoSinGAN and 2D U-Net (Zhang et al., 2020)	5,569(70)	- CoSinGAN overcomes the underfitting issue- Detect the infected areas of various sizes	- Data synthesis requires high computation power- Limited cross-data performance
Inf-Net and Semi-Inf-Net (Fan et al., 2020)	100(40)	Semi-supervised learning improves the performance	- Lack of cross-data analysis- Limited testing dataset
3D U-Net (Ma et al., 2021)	5,569(70)	Detailed performance analysis and comparison	- 3D U-Net requires high computation power- Limited cross-data performance
Modified local contrast enhancement (Oulefki et al., 2021)	275(22)	- Visualize the progression of disease- Computationally cheap	- Lack of cross-data analysis- Lack of ablation study
InceptionV3 and DeepLabV3+ (El-Bana et al., 2020)	100(40)	Joint segmentation and classification framework	- Lack of ablation study- Lack of cross-data analysis- Computationally expensive
MSD-Net (Zheng et al., 2020)	4,780(36)	- Improve small lesion segmentation- Detailed analysis	- Lack of cross-data analysis- Difficult to distinguish the patients with mild symptoms
FSS-2019-nCov (Abdel-Basset et al., 2021)	939(69)	Perform optimal training with a limited dataset	- Lack of cross-data analysis- Data synthesis required high computation
CNN (Selvaraj et al., 2021)	80(N/A)	Perform optimal training with a limited dataset	- Lack of ablation study- Limited testing dataset- Lack of cross-data analysis
DAL-Net (Owais et al., 2021)	5,569(70)	- Address generality issue- Improve small lesion segmentation	Include pre-processing stage
U-Net + Attention mechanism (Zhou et al., 2021)	473(69)	- Improve small lesion segmentation- Deal with small lesion segmentation	- Lack of cross-data analysis- Lack of ablation study
Res-UNet (Kesavan et al., 2021)	200(10)	Detect the infected areas in various 2D planes of lung CT slices	- Limited dataset- Lack of cross-data analysis
FractalCovNet (Munusamy et al., 2021)	473(N/A)	Detection of COVID-19 cases using both chest X-ray and CT images	- Lack of ablation study- Lack of cross-data analysis
U-Net and FCNs (Voulodimos et al., 2021)	939(10)	Overcome the effect of class imbalance and annotation errors.	- Lack of comparison with state-of-the-art models- Limited dataset
Improved 3D CU-Net (Zheng et al., 2021)	5,569(70)	Perform well in case of uneven distribution of lesions	- Lack of ablation study- Limited cross-data performance
D2A U-Net (Zhao et al., 2021)	1,765(N/A)	Perform well in case of blurred edges of infection	- Limited testing dataset- Lack of cross-data analysis- A large number of parameters
DMDF-Net(Proposed)	5,569(70)	- Computationally efficient- Address generality issue- Effectively segment out the small lesion	Include the steps of pre- and post-processing

#Sli.: Number of CT scan slices; #Pat.: Number of patients.

Comparative review summary of the proposed and existing methods on the infection segmentation related to COVID-19. #Sli.: Number of CT scan slices; #Pat.: Number of patients.

Proposed method

Datasets and experimental setup

A total of two openly available datasets, MosMed (Morozov, Andreychenko, Pavlov, Vladzymyrskyy, Ledikhova, & Gombolevskiy, 2020) and COVID-19-CT-Seg (Jun et al., 2021, Ma et al., 2021), were selected to assess the performance of the proposed DMDF-Net and various baseline networks for a fair comparison. The MosMed dataset was made available by municipal hospitals in Moscow, Russia, which includes a total of 50 CT scans of different patients with COVID-19 infection. The entire dataset includes a total of 2,049 images and corresponding ground truths as segmentation masks. All segmentation masks were annotated by medical experts from the Moscow Health Care Department. In each CT image, all the findings related to COVID-19 infection are marked as white ‘1′ pixels in the corresponding annotated mask. Similarly, all the remaining pixels (other than the lesion regions) are marked as black ‘0′ in the annotated mask. The second dataset, COVID-19-CT-Seg, includes 20 CT scans of different patients with COVID-19 infection. This dataset includes a total of 3,520 images and corresponding ground truths as separate segmentation masks of the left lung ROI, right lung ROI, and infectious regions. Thus, COVID-19-CT-Seg includes three separate segmentation ground truths for each CT image. All segmentation masks (including left lung ROI, right lung ROI, and infectious regions) were annotated by junior data annotators and validated by three medical professionals. Fig. 2 presents a few CT images and their corresponding ground truths for both datasets.

Fig. 2

COVID-19 positive CT images and corresponding ground truths as segmentation masks from (a) MosMed, and (b) COVID-19-CT-Seg.

COVID-19 positive CT images and corresponding ground truths as segmentation masks from (a) MosMed, and (b) COVID-19-CT-Seg. MATLAB (version R2020b), which is a well-known coding framework, was used to implement and simulate the proposed DMDF-Net and other baseline models. All the experiments were performed using a personal desktop computer with an Intel Core i7 CPU with a Nvidia GeForce GPU (GTX 1070), 16-GB RAM, and a Windows 10 operating system.

Overview of proposed method

As shown in Fig. 3 , the proposed CAD framework mainly consists of the following four stages: 1) data pre-processing step; 2) lung segmentation network (DMDF-Net-1); 3) infection segmentation network (DMDF-Net-2); and 4) post-processing step. In the first stage, the color and contrast of the input CT image are adjusted according to the training dataset by applying a simple Reinhard transformation (RT) (Reinhard, Adhikhmin, Gooch, & Shirley, 2001). Mathematically, each testing image is transformed into an enhanced image by applying the transformation , where presents the RT as a mapping function and is the mapping parameter that incorporates the color and contrast information of training image. Subsequently, the second and third stages process the enhanced image (obtained after pre-processing) using two independent DMDF-Nets, and generate the segmented image of the lung region of interest (ROI) and infectious regions , respectively. Our proposed DMDF-Nets (named as DMDF-Net-1 and DMDF-Net-2 in Fig. 3) mainly perform semantic segmentation and classify each pixel of input CT image either as black ‘0′ or white ‘1′. In the output of DMDF-Net-1, the white ‘1′ pixels represent the “lung region” and black ‘0’ pixels corresponding to “background.” Similarly, the output of DMDF-Net-2 presents the “infectious region” and “normal/background region” as white ‘1’ and black ‘0’ pixels, respectively. Finally, the post-processing stage further refines (the output of DMDF-Net-2 in the third stage) and generates the final output by performing post-ROI fusion (i.e., ) of both networks, as shown in Fig. 3. The final output provides well-localized information about the infectious regions inside the lung lobes as that can be further used for the severity assessment of COVID-19 infection. The addition of the post-processing stage reduces the false-positive pixels in (the output of DMDF-Net-2) and further provides a way to accurately quantify the severity of COVID-19 infection in terms of PIAL score. The PIAL score is calculated by dividing the area of the infected region (i.e., the total number of red pixels in final output image ) over the total area of lung lobes (i.e., the total number of red and green pixels in final output image ) as shown in Fig. 3. The subsequent sections present the detailed design, workflow, and selected training loss of the proposed DMDF-Net.

Fig. 3

Complete workflow diagram of the proposed diagnostic framework including pre-processing step, lung segmentation network (DMDF-Net-1), infection segmentation network (DMDF-Net-2) and, finally, post-processing step. (: RT mapping function; : Total number of green pixels in final output image ; : Total number of red pixels in final output image ).

Design of the proposed DMDF-Net

The architecture of our proposed DMDF-Net is designed to meet the following objectives: 1) efficient memory consumption; 2) low number of trainable parameters; and 3) minimum performance degradation in terms of segmentation results. To accomplish these milestones, we primarily utilize the strength of the grouped-convolutional (G-Conv) and dilated convolutional (D-Conv) layers to develop the overall structure of the proposed network. The use of the G-Conv layer results in efficient memory consumption and fast processing speed owing to the decreased number of learnable parameters (Heaton, 2015). In detail, a conventional convolutional layer (Heaton, 2015) processes an input tensor and generates an output tensor by employing a kernel of size . The entire process requires a total processing cost of (Heaton, 2015). However, a G-Conv layer requires a total processing cost of for a similar operation and reduces the processing cost by a factor of . In our network design, most of the G-Conv layers contain a kernel size of ). Consequently, the average processing cost of the G-Conv layer is approximately eight to nine times lower than that of the conventional convolutional layer. Additionally, the D-Conv layers also result in better segmentation performance owing to the characteristic of exploiting the multiscale deep features without substantially affecting the computation cost (Chen et al., 2018). The complete layer-wise design of the proposed DMDF-Net is shown in Fig. 4 . The network design mainly comprises an encoder part, followed by a decoder module. The encoder part mainly exploits the multiscale deep features from the given image and represents it as a 3D tensor that includes the main features. Subsequently, the decoder module upsamples the 3D tensor (encoder output) and generates a binary image as the final output. The following subsections provide a detailed explanation of the encoder/decoder structure and workflow.

Fig. 4

Overall design and workflow diagram of the proposed DMDF-Net (including both encoder and decoder modules).

Encoder design and workflow

To achieve efficient memory utilization and a low number of trainable parameters, we used the basic structural units of MobileNetV2 (Sandler, Howard, Zhu, Zhmoginov, & Chen, 2018) (labeled as A-Block and B-Block in Fig. 4 and Table 2 ) to develop an efficient encoder design. In addition, a set of four multiscale D-Conv layers (Chen et al., 2018) (labeled as C-Block in Fig. 4 and Table 2) was also included to exploit and fuse a more diversified representation of the input image. The encoder structure includes a total of four A-Blocks, three B-Blocks, one C-Block, and some other layers, as indicated in Fig. 4. Both A- and B-Blocks consist of the following three layers. 1) Expansion layer: a 1 × 1 convolutional layer that upsamples the depth size of the input tensor by a factor of 6 and generates an output tensor . 2) Feature extraction layer: a 3 × 3 G-Conv layer that exploits the deep features from and produces an intermediate output tensor if stride =1, or if stride = 2. 3) Projection layer: a 1 × 1 convolutional layer that downsamples the depth of by a factor of 6 and generates the final output tensor or (depending on the stride value in the preceding layer). In addition, a residual connection is included in B-Block that differentiates it from A-Block and prevents the vanishing gradient problem during the training process (Sandler et al., 2018). Subsequently, the C-Block mainly comprises four parallel D-Conv layers with dilation rate (DR) factors of 1, 6, 12, and 18 (in each layer). For effective computation, each D-Conv layer is followed by a projection layer (1 × 1 convolutional layer) that projects the depth of the output tensor of each D-Conv layer from 320 to 256 channels. To exploit multiscale features, four D-Conv layers process the input tensor and generate a total of four output tensors , , , and . Consequently, four projection layers further reduce the depth of these intermediate outputs and generate new output tensors , , , and . Ultimately, a depth concatenation layer performs multiscale deep features fusion by combining these four output tensors and provides the final output tensor . Mathematically, input tensor undergoes the followingtransformations after passing through these structural blocks: where , , and represent the operations of A-, B-, and C-Block as transfer functions, respectively. In Eqs. (1), (2), , , and are the training parameters of the expansion layer (1 × 1 convolutional layer), the feature extraction layer (3 × 3 G-Conv layer), and the projection layer (1 × 1 convolutional layer), respectively. Similarly, in Eq. (3), , , , and are the training parameters of the four parallel D-Conv layers with DR factors of 1, 6, 12, and 18, respectively. The symbol denotes the depth-wise feature concatenation operation in Eq. (3); and and represent the convolution and dilated convolution operations, respectively. For , dilated convolution, , performs similarly to the standard convolution, .

Table 2

Complete layer-wise structure, configuration, and parametric information of the proposed DMDF-Net.

	Layer	Input Size	Kernel Size	Kernel Depth	Stride	Output Size	#Par.
Encoder	Image Input	288 × 352 × 3	–	–	–	–	–
	Conv 1	288 × 352 × 3	32	32	2	144 × 176 × 32	960
	G-Conv 1	144 × 176 × 32	32	32	1	144 × 176 × 32	384
	Conv 2	144 × 176 × 32	12	16	1	144 × 176 × 16	560
	A-Block 1	144 × 176 × 16	12,32,12	96,96,24	1,2,1	72 × 88 × 24	5,352
	B-Block 1	72 × 88 × 24	12,32,12	144,144,24	1,1,1	72 × 88 × 24	9,144
	A-Block 2	72 × 88 × 24	12,32,12	144,144,32	1,2,1	36 × 44 × 32	10,320
	B-Block 2	36 × 44 × 32	12,32,12	192,192,32	1,1,1	36 × 44 × 32	15,264
	A-Block 3	36 × 44 × 32	12,32,12	192,192,64	1,2,1	18 × 22 × 64	21,504
	B-Block 3	18 × 22 × 64	12,32,12	384,384,64	1,1,1	18 × 22 × 64	55,104
	A-Block 4	18 × 22 × 64	12,32,12	384,384,320	1,1,1	18 × 22 × 320	154,176
	C-Block 1	18 × 22 × 320	12,32,32,32(1,6,12,18)**12,12,12,12	320,320,320,320256,256,256,256	1,1,1,11,1,1,1	18 × 22 × 1024	340,992
	Conv 3	18 × 22 × 1024	12	256	1	18 × 22 × 256	262,912

Decoder	TP-Conv 1*	18 × 22 × 256	82	256	4	72 × 88 × 256	4,194,560
	Conv 4*	72 × 88 × 144	12	48	1	72 × 88 × 48	7,056
	Depth Concatenation	72 × 88 × 25672 × 88 × 48	–	–	–	72 × 88 × 304	–
	G-Conv 2	72 × 88 × 304	32	304	1	72 × 88 × 304	3,040
	Conv 5	72 × 88 × 304	12	256	1	72 × 88 × 256	78,592
	G-Conv 3	72 × 88 × 256	32	256	1	72 × 88 × 256	2,560
	Conv 6	72 × 88 × 256	12	320	1	72 × 88 × 320	82,880
	C-Block 2	72 × 88 × 320	12,32,32,32(1,6,12,18)**12,12,12,12	320,320,320,320256,256,256,256	1,1,1,11,1,1,1	72 × 88 × 1024	340,992
	Conv 7	72 × 88 × 1024	12	256	1	72 × 88 × 256	262,912
	Conv 8	72 × 88 × 256	12	2	1	72 × 88 × 2	514
	TP-Conv 2	72 × 88 × 2	82	2	4	288 × 352 × 2	258
	Softmax	288 × 352 × 2	–	–	–	288 × 352 × 2	–
	Pixel Classification	–	–	–	–	–	–

*Output tensors of these layers are fed to depth concatenation layer; **dilation rate (DR); #Par.: Total number of parameters; ‘–‘: Not applicable; .

Complete layer-wise structure, configuration, and parametric information of the proposed DMDF-Net. *Output tensors of these layers are fed to depth concatenation layer; **dilation rate (DR); #Par.: Total number of parameters; ‘–‘: Not applicable; . The complete configuration and parametric details of the encoder module are listed in Table 2. Initially, the input image (obtained after pre-processing) is processed through a stack of multiple layers (including convolutional, BN, and ReLU layers) and transformed into a 3D tensor of size 18 × 22 × 256. In detail, the first 3 × 3 convolutional layer (labeled as Conv 1 in Table 2) explores the input image in both horizontal and vertical directions and converts it into an output tensor of size 144 × 176 × 32. Subsequently, the second and third convolutional layers (labeled as G-Conv 1 and Conv 2 in Table 2) process the output of the previous layer and transform the output of Conv 1 into a tensor of size 144 × 176 × 16. Consequently, a stack of seven structural blocks (labeled as A-Blocks 1,2,3,4 and B-Blocks 1,2,3 in Table 2) consecutively process the output of the previous layer/block to obtain a more diverse representation of the input image as a high-level abstraction. Eventually, these seven structural blocks convert the output tensor of Conv 2 into an output of size 18 × 22 × 320. Additionally, C-Block 1 applies the strength of multiscale dilated convolution and further explores the output of A-Block 4 at four different scales (with DR factors of 1, 6, 12, and 18) and provides diversified multiscale feature maps of size 18 × 22 × 1024 after performing multiscale deep features fusion. For efficient computation on the decoder side, a projection layer (labeled as Conv 3 in Table 2) further transforms these high-level features (output of C-Block 1) in a low-dimensional space. In detail, the Conv 3 layer reduces its (output of C-Block 1) depth by a factor of 4 and gives a final output tensor of size 18 × 22 × 256, which contains diverse semantic information.

Decoder design and workflow

The decoder part of the proposed DMDF-Net mainly includes two transposed convolutional (TP-Conv) layers, one C-Block (labeled as C-Block 2 in Fig. 4 and Table 2), some other layers named Conv and G-Conv, softmax, and pixel classification in Table 2. Our main contribution in the decoder part is the addition of multiscale D-Conv layers (Chen et al., 2018) (labeled as C-Block-2 in Fig. 4 and Table 2), which captures the multiscale representation of deep features from the unsampled output of the TP-Conv 1 layer by performing multiscale deep features fusion and provides additional performance gain. Moreover, a residual connection (extracted from B-Block 1 of the encoder module) was included in the decoded output of the TP-Conv 1 layer (before C-Block 2) to enrich the edge information. Most of the existing studies (Chen et al., 2018, Owais et al., 2021) have shown a better performance of residual skip connection (from encoder to decoder), particularly in the case of minor output objects. The detailed layer-wise configuration of the decoder module is presented in Table 2. Initially, an 8 × 8 TP-Conv layer (labeled as TP-Conv 1 in Table 2) bilinearly upsamples the encoder final output (output tensor of size 18 × 22 × 256 after Conv 3 layer) with an upsampling factor of 4 and generates an upsampled tensor of size 72 × 88 × 256. Subsequently, a depth concatenation layer combines a residual tensor of size 72 × 88 × 48 (obtained from B-Block 1 and further processed by Conv 4) with the output tensor of TP-Conv 1 and gives a concatenated tensor of size 72 × 88 × 304. Furthermore, a total of four convolutional layers (labeled as G-Conv 2, Conv 5, G-Conv 3, and Conv 6 in Table 2) further transform the output of the previous layer into a new output tensor of size 72 × 88 × 320. Consequently, C-Block 2 applies the strength of multiscale dilated convolution and further processes the output of the preceding layer (Conv 6) at four different scales (with DR factors of 1, 6, 12, and 18) and provides diversified multiscale feature maps of size 72 × 88 × 1024 after performing multiscale deep features fusion. These feature maps are further projected into a low-dimensional space from 72 × 88 × 1024 to 72 × 88 × 2 by processing through two convolutional layers (labeled as Conv 7 and Conv 8 in Table 2). Next, a second 8 × 8 TP-Conv layer (labeled as TP-Conv 2 in Table 2) further upsamples the output of Conv 8 using an upsampling factor of 4 and gives a final output tensor of size 288 × 352 × 2. Finally, a pixel classification layer in conjunction with the softmax layer generates the pixel-wise prediction of the given input image as the final output of our model. The softmax layer applies a softmax function (Heaton, 2015) that transforms the output of the TP-Conv 2 layer into a probability. Subsequently, the pixel classification layer provides a class label (either black ‘0’ or white ‘1’) to each pixel of the input CT image and generates a binary image as a final output.

Loss function

To perform better segmentation of minor lesion regions, a balanced cross-entropy (BCE) loss was selected for the training of the proposed DMDF-Net. BCE shows better performance than conventional cross-entropy (CE) loss (Heaton, 2015, Jadon, 2020), mainly in the case of small segmentation objects or lesion regions (Li et al., 2021, Ni et al., 2020, Owais et al., 2021, Roth et al., 2018). Additionally, we took advantage of transfer learning (Krizhevsky et al., 2012) to perform timely and efficient training of the proposed network. The basic structural units of MobileNetV2 (labeled as A-Block and B-Block in Fig. 4) were used to develop the encoder design. Therefore, the initial training parameters of our encoder module (backbone network) were obtained from the MobileNetV2 network that was initially trained with an ImageNet dataset (Deng et al., 2009) using the conventional CE loss function (Heaton, 2015, Jadon, 2020) and stochastic gradient descent (SGD) optimization method (Li, 2018). Therefore, a related variant of the conventional CE loss function named as BCE was selected to perform sufficient training of the proposed DMDF-Net for the target domain. The mathematical interpretation of our selected BCE loss function is given as follows:where and are the training data sample and its ground-truth mask, respectively. Subsequently, , , and represent the proposed DMDF-Net as a transfer function, the total number of data samples, and the total initial training parameters, respectively. Finally, is the class balancing factor between black ‘0′ and white ‘1′ pixels, and is calculated as the fraction of dominant pixels (i.e., black ‘0′ pixels) in the entire training dataset (Jadon, 2020).

Results

In this section, we present a detailed explanation of training, validation, and quantitative results of our method, including a detailed ablation study. Finally, we compare the performance of the proposed DMDF-Net (including both DMDF-Net-1 and DMDF-Net-2) with various state-of-the-art methods.

Training and validation

Based on existing studies (Kandel and Castelli, 2020, Prabowo and Herwanto, 2019), an SGD optimizer with a small learning rate factor of 0.001 was selected to perform efficient training of the proposed model. Generally, a small learning rate factor can achieve the global minimum; however, it requires many epochs to perform sufficient training of a segmentation network (Johnson & Zhang, 2013). However, a large value of the learning rate factor can skip the global minimum (Johnson & Zhang, 2013). Therefore, a small learning rate factor of 0.001 was selected to achieve optimal convergence of the proposed DMDF-Net. Additionally, we used the default settings provided by MATLAB R2020b for other hyperparameters. The overall training procedure of the proposed DMDF-Net is given in Algorithm 1 as pseudo-code.In the first experiment (hereinafter referred to as Exp#1), we performed lung segmentation using 70% (14/20), 10% (2/20), and 20% (4/20) of the COVID-19-CT-Seg data for training, validation, and testing, respectively. For a fair evaluation in Exp#1, a five-fold cross-validation was performed. Subsequently, in the second experiment (hereinafter referred to as Exp#2), we performed COVID-19 infection segmentation using two different datasets for training, validation, and testing. Such a cross-data analysis highlights the generality of the proposed framework. In Exp#2, we used 80% (16/20) and 20% (4/20) of the COVID-19-CT-Seg dataset for training and validation, respectively, and 100% (50/50) of the MosMed dataset for testing. In the MosMed dataset, the ground truths for lung segmentation are not given; therefore, we did not perform cross-validation in Exp#2 (i.e., using MosMed in training and COVID-19-CT-Seg in testing). Fig. 5 shows the training/validation accuracies and losses of the proposed DMDF-Net for lung segmentation (Exp#1) and COVID-19 infection segmentation (Exp#2). To avoid the overfitting problem, we included independent validation datasets in the training procedure for both Exp#1 and Exp#2. Consequently, we selected the best models based on the maximum validation accuracies for Exp#1 and Exp#2. Thus, the training of DMDF-Net-1 was continued till 15 epochs with a mini-batch size of eight. Subsequently, the second DMDF-Net-2 was trained till 13 epochs with the same mini-batch size as eight. To make a fair comparison, the same datasets and training protocols (such as training hyperparameter and loss function) were adopted to evaluate the results of different baseline models in both experiments. Finally, the quantitative results of the proposed and different baseline models were evaluated using the following seven performance evaluation metrics: the average Dice similarity coefficient (DICE), Intersection over Union (IoU), Average Precision (AP), Sensitivity (SEN), Specificity (SPE), Enhance-Alignment Measure (), and Mean Absolute Error (MAE) (Fan et al., 2018, Fan et al., 2020, Fan et al., 2021, Margolin et al., 2014, Zhaobin et al., 2020).

Fig. 5

Training/validation accuracies and losses of the proposed DMDF-Net for (a) lung segmentation (Exp#1), and (b) COVID-19 infection segmentation (Exp#2).

Testing results (Ablation Studies)

Our proposed framework mainly includes a lung segmentation network (DMDF-Net-1 in Fig. 3) and an infection segmentation network (DMDF-Net-2 in Fig. 3) to segment lung lobes and infectious regions in a given CT image, respectively. Primarily, the output of DMDF-Net-1 is used in the post-processing step to refine the output of DMDF-Net-2 and generate well-localized information about the infectious regions in the CT image. Therefore, accurate segmentation of the lung lobes is important in the proposed framework. Thus, we evaluated the quantitative results of both DMDF-Net-1 and DMDF-Net-2, along with a comprehensive analysis of the pre- and post-processing stages as an ablation study. Table 3 presents the average segmentation results of the left, right, and both lung lobes (Exp#1) based on DMDF-Net-1. In addition, these results (Table 3) also highlight the significance of multiscale deep features fusion using multiscale dilated convolution (addition of C-Blocks) and transfer learning in developing and training the proposed network, respectively. In all cases (i.e., segmentation of left, right, and both lung lobes), DMDF-Net-1 (including C-Blocks) showed superior results in terms of all the performance metrics through transfer learning. In particular, the use of C-Blocks and transfer learning gave an average gain with DICE scores of 51% [94.86%–43.86%], 55.43% [98.52%–43.09%], and 50.14% [98.66%–48.52%], IoU scores of 57.35% [90.59%–33.24%], 64.87% [97.11%–32.24%], and 61.63% [97.38%–35.75%] and scores of 57.83% [94.67%–36.84%], 62.29% [98.78%–36.49%], and 55.52% [98.73%–43.21%] for the left, right, and both lung lobes, respectively. Besides the significant performance gains of our model, the segmentation results of the “both lung” dataset are higher than the “left lung” and “right lung” datasets. Such performance difference occurs due to similar shape and texture patterns of two lung lobes in both “left lung” and “right lung” datasets. Therefore, in the case of “left lung” and “right lung” datasets, it was more challenging for a CNN model to distinguish left and right lung lobes with similar shape and texture patterns. However, in case of “both lung” dataset, the segmentation of both lung lobes was quite simple for a CNN model compared to the individual segmentation of left and right lung lobes. Consequently, the performance of our model in case of the “both lung” dataset is higher than the individual results of the “left lung” and “right lung” datasets.

Table 3

Training Option	Option(#Par.)	Dataset	DICE (Std)	IoU(Std)	AP(Std)	SEN(Std)	SPE(Std)	Eφ(Std)	MAE(Std)
Without Transfer Learning	WithoutC-Blocks(4.71 M)	Left lung	43.86 (5.42)	33.24 (5.38)	52.38 (1.84)	63.39 (7.42)	60.12 (7.93)	36.84 (4.79)	3.971 (0.788)
		Right lung	43.09 (2.4)	32.24(2.53)	52.02 (0.87)	62.04 (5.11)	58.44 (4.28)	36.49 (2.86)	4.137 (0.428)
		Both lung	48.52(3.85)	35.75(3.97)	54.08 (1.89)	62.34 (5.29)	60.3 (5.87)	43.21 (4.37)	3.95 (0.578)
	WithC-Blocks(5.85 M)	Left lung	41.88 (3.61)	31.37(3.72)	51.52 (1.12)	59.0 (5.28)	57.63 (6.12)	35.17 (3.9)	4.231 (0.608)
		Right lung	44.56 (5.11)	33.82(5.24)	52.43 (1.74)	62.4 (7.8)	60.93 (8.3)	38.2 (5.05)	3.899 (0.826)
		Both lung	52.65 (4.69)	39.74 (4.55)	56.28 (2.38)	67.32 (4.53)	65.28 (5.8)	47.6 (5.01)	3.451 (0.564)

With Transfer Learning	WithoutC-Blocks(4.71 M)	Left lung	94.45(0.65)	89.92(1.08)	90.75 (0.97)	99.15 (0.17)	98.89 (0.24)	94.36 (1.7)	0.11 (0.023)
		Right lung	97.76(0.84)	95.71(1.55)	96.3 (1.11)	99.1 (1.41)	99.57 (0.14)	97.81 (1.22)	0.046 (0.02)
		Both lung	98.41(0.3)	96.91(0.57)	97.6 (0.59)	99.07 (1.02)	99.48 (0.08)	98.29 (0.5)	0.057 (0.01)
	WithC-Blocks(5.85 M)	Left lung	94.86 (0.4)	90.59 (0.66)	91.29 (0.66)	99.43 (0.15)	98.98 (0.1)	94.67 (1.54)	0.1 (0.01)
		Right lung	98.52 (0.24)	97.11 (0.45)	97.42 (0.43)	99.64 (0.28)	99.71 (0.04)	98.78 (0.24)	0.03 (0.004)
		Both lung	98.66 (0.17)	97.38 (0.33)	97.82 (0.26)	99.57 (0.12)	99.51 (0.08)	98.73 (0.35)	0.048 (0.008)

#Par.: Number of parameters; M: Million; Std: Standard deviation; The best results are shown in boldface.

Average five-fold performance of the proposed lung segmentation network (DMDF-Net-1) for lung segmentation (Exp#1). These results also highlight the significance of multiscale deep features fusion using multiscale dilated convolution (C-Blocks) and transfer learning in Exp#1 as an ablation study. (unit: %). #Par.: Number of parameters; M: Million; Std: Standard deviation; The best results are shown in boldface. The encoder design (backbone network) of the proposed DMDF-Net includes the basic structural units of MobileNetV2. Therefore, we also compared the performance of the proposed encoder design with that of the original MobileNetV2 as a backbone network for lung segmentation (Exp. #1). Table 4 presents the comparative results of the proposed versus original MobileNetV2 as backbone networks for Exp. #1. Our encoder design outperforms MobileNetV2, yielding average performance gains with DICE scores of 0.8% [94.86%–94.06%], 3.03% [98.52%–95.49%], and 2.4% [98.66%–96.26%], IoU scores of 1.62% [90.59%–88.97%], 1.78% [97.11%–95.33%], and 1.17% [97.38%–96.21%], and scores of 0.6% [94.67%–94.07%], 1.07% [98.78%–97.71%], and 0.66% [98.73%–98.07%] for left, right, and both lung lobes, respectively. Moreover, the proposed encoder design (in DMDF-Net-1) contains fewer training parameters (specifically, 0.88 million) than the original MobileNetV2 (specifically, 2.24 million). Table 4 also includes a performance comparison of our selected BCE loss versus conventional CE loss function. In comparison of two different losses (i.e., our BCE loss versus conventional CE loss), our BCE gave an average gain with DICE scores of 0.24% [94.86%–94.62%], 0.13% [98.52%–98.39%], and 0.15% [98.66%–98.51%] and IoU scores of 0.4% [90.59%–90.19%], 0.25% [97.11%–96.86%], and 0.29% [97.38%–97.09%] for left, right, and both lung lobes, respectively. Owing to a slight class imbalance problem between two classes (lung ROI and background), the performance difference of our BCE loss is minimal in comparison with conventional CE loss. However, we observed a significant performance gain of our BCE loss in COVID-19 infection segmentation (Exp#2).

Table 4

Comparative results of the proposed encoder design versus original MobileNetV2 as backbone networks and adopted BCE loss versus conventional CE loss function for lung segmentation (Exp#1). (unit: %).

Backbone/Loss		Dataset	DICE (Std)	IoU(Std)	AP(Std)	SEN(Std)	SPE(Std)	Eφ(Std)	MAE(Std)
Backbone Networks	MobileNetV2(Sandler et al., 2018)	Left Lung	94.06 (1.65)	88.97 (1.13)	89.97 (1.16)	98.77 (0.25)	98.79 (0.18)	94.07 (1.68)	0.121 (0.016)
	Proposed	Left Lung	94.86 (0.4)	90.59 (0.66)	91.29 (0.66)	99.43 (0.15)	98.98 (0.1)	94.67 (1.54)	0.1 (0.01)
	MobileNetV2(Sandler et al., 2018)	Right Lung	95.49 (1.88)	95.33 (0.7)	95.83 (0.5)	99.4 (0.43)	99.5 (0.11)	97.71 (0.51)	0.05 (0.013)
	Proposed	Right Lung	98.52 (0.24)	97.11 (0.45)	97.42 (0.43)	99.64 (0.28)	99.71 (0.04)	98.78 (0.24)	0.03 (0.004)
	MobileNetV2(Sandler et al., 2018)	Both Lung	96.26 (1.63)	96.21 (0.46)	96.86 (0.27)	99.34 (0.39)	99.28 (0.15)	98.07 (0.41)	0.072 (0.016)
	Proposed	Both Lung	98.66 (0.17)	97.38 (0.33)	97.82 (0.26)	99.57 (0.12)	99.51 (0.08)	98.73 (0.35)	0.048 (0.008)

Loss Functions	CE (Heaton, 2015, Jadon, 2020)	Left Lung	94.62 (0.48)	90.19 (0.79)	90.94 (0.83)	99.35 (0.23)	98.93 (0.12)	94.48 (1.6)	0.105 (0.011)
	BCE (our)	Left Lung	94.86 (0.4)	90.59 (0.66)	91.29 (0.66)	99.43 (0.15)	98.98 (0.1)	94.67 (1.54)	0.1 (0.01)
	CE (Heaton, 2015, Jadon, 2020)	Right Lung	98.39 (0.24)	96.86 (0.44)	97.16 (0.42)	99.69 (0.25)	99.68 (0.04)	98.71 (0.03)	0.032 (0.005)
	BCE (our)	Right Lung	98.52 (0.24)	97.11 (0.45)	97.42 (0.43)	99.64 (0.28)	99.71 (0.04)	98.78 (0.24)	0.03 (0.004)
	CE (Heaton, 2015, Jadon, 2020)	Both Lung	98.51 (0.12)	97.09 (0.22)	97.55 (0.18)	99.61 (0.15)	99.44 (0.1)	98.75 (0.22)	0.054 (0.009)
	BCE (our)	Both Lung	98.66 (0.17)	97.38 (0.33)	97.82 (0.26)	99.57 (0.12)	99.51 (0.08)	98.73 (0.35)	0.048 (0.008)

#Fold: Fold number; Avg.: Average results; Std: Standard deviation; The best results are shown in boldface.

Comparative results of the proposed encoder design versus original MobileNetV2 as backbone networks and adopted BCE loss versus conventional CE loss function for lung segmentation (Exp#1). (unit: %). #Fold: Fold number; Avg.: Average results; Std: Standard deviation; The best results are shown in boldface. Table 5 presents the quantitative results of the proposed DMDF-Net-2 for COVID-19 infection segmentation (Exp#2), along with a detailed ablation study. It can be observed from Table 5 that the proposed framework shows the best results (75.7%, 67.22%, 69.92%, 72.78%, 99.79%, 91.11%, and 0.026 for average DICE, IoU, AP, SEN, SPE, , and MAE, respectively) with the addition of pre- and post-processing stages after performing transfer learning. After training the final proposed DMDF-Net-2 (including C-Blocks) for Exp#2 via transfer learning, both pre- and post-processing stages resulted in additional gains of 39.11% [75.7%–36.59%], 38.53% [67.22%–28.69%], 19.89% [69.92%–50.03%], 22.51% [72.78%–50.27%], 42.59% [99.79%–57.2%], 46.07% [91.11%–45.04%], 4.255 [4.281–0.026] for average DICE, IoU, AP, SEN, SPE, , and MAE, respectively. These results show that transfer learning, pre-processing, post-processing, and multiscale deep features fusion applying multiscale dilated convolution (C-Blocks) work in a mutually beneficial way to enhance the overall performance of the proposed diagnostic framework. Additionally, Fig. 6 shows the visual outputs of the proposed framework with and without including the pre- and post-processing stages for COVID-19 infection segmentation (Exp#2). It can be observed (Fig. 6) that both pre- and post-processing stages mutually contribute to reducing the number of false-positive and false-negative pixels and correctly segment the lesion regions in a given CT image.

Table 5

TrainingOption	Option(#Par. (M))	Pre-Processing	Post-Processing	DICE	IoU	AP	SEN	SPE	Eφ	MAE
Without Transfer Learning	WithoutC-Blocks(4.71)	×	×	36.59	28.69	50.03	50.27	57.2	45.04	4.281
		×	✓	43.37	38.06	50.02	27.14	76	57	2.409
		✓	×	36.48	28.53	50.05	55.5	56.86	44.87	4.314
		✓	✓	43.42	38.11	50.04	31.84	76.06	57.04	2.403
	WithC-Blocks(5.85)	×	×	33.62	25.2	50.01	53.56	50.23	40.59	4.976
		×	✓	44.99	40.65	50.02	21.61	81.2	60.5	1.892
		✓	×	33.76	25.34	50.02	54.7	50.52	40.75	4.948
		✓	✓	44.72	40.12	50.06	29.62	80.07	59.92	2.003

With Transfer Learning	WithoutC-Blocks(4.71)	×	×	71.94	63.98	68.99	52.13	99.84	89.98	0.026
		×	✓	71.63	63.73	69.13	49.94	99.84	89.98	0.025
		✓	×	73.96	65.67	67.74	74.32	99.74	89.98	0.031
		✓	✓	74.41	66.06	68.28	73.88	99.75	90.15	0.03
	WithC-Blocks(5.85)	×	×	72.91	64.78	68.38	61.07	99.8	91.02	0.028
		×	✓	73.13	64.96	68.74	60.66	99.8	91.13	0.027
		✓	×	75.3	66.86	69.43	72.95	99.78	91.01	0.027
		✓	✓	75.7	67.22	69.92	72.78	99.79	91.11	0.026

×: Not included; ✓: Included; #Par.: Number of parameters; M: Million; The best results are shown in boldface.

Fig. 6

Visual output results of the proposed framework with and without including the pre- and post-processing stages for COVID-19 infection segmentation (Exp#2).

Quantitative results of the proposed infection segmentation network (DMDF-Net-2) for COVID-19 infection segmentation (Exp#2). These results also highlight the significance of transfer learning, pre-processing, post-processing, and multiscale deep features fusion using multiscale dilated convolution (C-Blocks) in Exp#2 as an ablation study. (unit: %). ×: Not included; ✓: Included; #Par.: Number of parameters; M: Million; The best results are shown in boldface. Visual output results of the proposed framework with and without including the pre- and post-processing stages for COVID-19 infection segmentation (Exp#2). We also compared the performance of the proposed encoder design (in DMDF-Net-2) with that of the original MobileNetV2 as a backbone network for COVID-19 infection segmentation (Exp#2). Table 6 presents the comparative results of the proposed versus original MobileNetV2 for Exp#2. It can be observed (Table 6) that both pre- and post-processing steps show higher results for our proposed and original MobileNetV2 compared to those without including any pre- and post-processing steps. However, our model (including both pre- and post-processing stages) still outperforms MobileNetV2, yielding average gains of 2.3% [75.7%–73.4%], 2.04% [67.22%–65.18%], 2.5% [69.92%–67.42%], 1.06% [72.78%–71.72%], 0.05% [99.79%–9.74%], 1.57% [91.11%–89.54%], and 0.005 [0.031–0.026] for average DICE, IoU, AP, SEN, SPE, , and MAE, respectively. In addition, Table 6 also comprises a performance comparison of our selected BCE loss versus conventional CE loss function for Exp#2. In comparison of two different losses (i.e., our BCE loss versus conventional CE loss), our BCE gave average gains of 3.33% [75.7%–72.37%], 2.92% [67.22%–64.3%], 4.16% [69.92%–65.76%], 0.12% [99.79%–99.67%], 1.58% [91.11%–89.53%], and 0.011 [0.037–0.026] for average DICE, IoU, AP, SPE, , and MAE, respectively. Our BCE loss addressed the class imbalance problem in our selected dataset and resulted in additional performance gains for COVID-19 infection segmentation task (Exp#2).

Table 6

Backbone/Loss		Option	DICE	IoU	AP	SEN	SPE	Eφ	MAE
Backbone Networks	MobileNetV2(Sandler et al., 2018)	Without Pre- and Post-Processing	73.2	65.02	69	59.79	99.81	88.44	0.027
	Proposed	Without Pre- and Post-Processing	72.91	64.78	68.38	61.07	99.8	91.02	0.028
	MobileNetV2(Sandler et al., 2018)	WithPre-Processing	72.84	64.7	66.76	72.21	99.72	89.35	0.033
	Proposed	WithPre-Processing	75.3	66.86	69.43	72.95	99.78	91.01	0.027
	MobileNetV2(Sandler et al., 2018)	With Pre- and Post-Processing	73.4	65.18	67.42	71.72	99.74	89.54	0.031
	Proposed	With Pre- and Post-Processing	75.7	67.22	69.92	72.78	99.79	91.11	0.026

Loss Functions	CE (Heaton, 2015, Jadon, 2020)	Without Pre- and Post-Processing	69.37	61.87	63.26	72.92	99.61	88.83	0.044
	BCE (our)	Without Pre- and Post-Processing	72.91	64.78	68.38	61.07	99.8	91.02	0.028
	CE (Heaton, 2015, Jadon, 2020)	WithPre-Processing	71.86	63.87	65.23	78.33	99.65	89.25	0.039
	BCE (our)	WithPre-Processing	75.3	66.86	69.43	72.95	99.78	91.01	0.027
	CE (Heaton, 2015, Jadon, 2020)	With Pre- and Post-Processing	72.37	64.3	65.76	77.89	99.67	89.53	0.037
	BCE (our)	With Pre- and Post-Processing	75.7	67.22	69.92	72.78	99.79	91.11	0.026

The best results are shown in boldface.

Comparative results of the proposed encoder design versus original MobileNetV2 as backbone networks and adopted BCE loss versus conventional CE loss function for COVID-19 infection segmentation (Exp#2). (unit: %). The best results are shown in boldface. In the post-processing step for Exp#2, the lung ROI (output of DMDF-Net-1) was applied over the output of DMDF-Net-2 to reduce the number of false-positive pixels, which is referred to as post-ROI fusion in this work. To highlight the significance of our post-ROI fusion-based post-processing step, we also evaluated the performance of its counterpart, which is named as pre-ROI fusion. In this pre-ROI fusion-based post-processing step, the lung ROI mask (output of DMDF-Net-1) was applied over the input CT image to obtain the lung ROI image. The lung ROI image was then further processed by DMDF-Net-2 to segment the infection regions. Table 7 shows the quantitative performance comparison of pre-ROI fusion versus post-ROI fusion with and without applying the pre-processing step for Exp#2. After training the proposed DMDF-Net-2 through transfer learning (without including the data pre-processing stage), post-ROI fusion outperforms pre-ROI fusion (Table 7), yielding the average gains of 21.04% [73.13%–52.09%], 15.5% [64.96%–49.46%], 17.09% [68.74%–51.65%], 4.16% [99.8%–95.64%], 15.37% [91.13%–75.76%], and 0.412 [0.439–0.027] for average DICE, IoU, AP, SPE, , and MAE, respectively. In conjunction with the pre-processing stage, post-ROI fusion also outperforms pre-ROI fusion (Table 7), yielding the average gains of 8.24% [75.7%–67.46%], 6.77% [67.22%–60.45%], 7.27% [69.92%–62.65%], 15.84% [72.78%–56.94%], 0.11% [99.79%–99.68%], 1.72% [91.11%–89.39%], and 0.015 [0.041–0.026] for average DICE, IoU, AP, SEN, SPE, , and MAE, respectively. Similar performance disparities can be observed in Table 7 after training the proposed DMDF-Net-2 from scratch (i.e., without transfer learning). Performance differences of our DMDF-Net-2 of no pre-and post-processing (the 1st and 7th rows of Table 7), only pre-processing (the 2nd and 8th rows of Table 7), combined pre- and post-processing (the 6th and 12th rows of Table 7) versus only post-processing (the 4nd and 10th rows of Table 7) can be observed in Table 7. As shown in this table, we confirm that the combined pre- and post-processing, only pre-processing, only post-processing, and no pre-and post-processing show the 1st–4th highest accuracies, respectively. Subsequently, Fig. 7 presents the visual output results of pre-ROI fusion versus post-ROI fusion (for Exp#2) after applying the pre-processing step and training DMDF-Net-2 through transfer learning. It can be observed (Fig. 7) that the post-ROI fusion-based post-processing step effectively contributes to reducing the number of false-positive and false-negative pixels and correctly segmenting the lesion regions in a given CT image.

Table 7

Quantitative performance comparison of pre-ROI fusion versus post-ROI fusion with and without applying pre-processing step for COVID-19 infection segmentation (Exp#2). (unit: %).

Training Option	Pre-Processing	Pre-ROI Fusion	Post-ROI Fusion	DICE	IoU	AP	SEN	SPE	Eφ	MAE
Without Transfer Learning	×	×	×	33.62	25.2	50.01	53.56	50.23	40.59	4.976
	✓	×	×	33.76	25.34	50.02	54.7	50.52	40.75	4.948
	×	✓	×	36.08	28.09	50.0	43.67	56.05	44.29	4.398
	×	×	✓	44.99	40.65	50.02	21.61	81.2	60.5	1.892
	✓	✓	×	36.3	28.35	50.0	43.38	56.57	44.59	4.345
	✓	×	✓	44.72	40.12	50.06	29.62	80.07	59.92	2.003

With Transfer Learning	×	×	×	72.91	64.78	68.38	61.07	99.8	91.02	0.028
	✓	×	×	75.3	66.86	69.43	72.95	99.78	91.01	0.027
	×	✓	×	52.09	49.46	51.65	77.89	95.64	75.76	0.439
	×	×	✓	73.13	64.96	68.74	60.66	99.8	91.13	0.027
	✓	✓	×	67.46	60.45	62.65	56.94	99.68	89.39	0.041
	✓	×	✓	75.7	67.22	69.92	72.78	99.79	91.11	0.026

×: Not included; ✓: Included; The best results are shown in boldface.

Fig. 7

Visual output results of pre-ROI fusion versus post-ROI fusion for COVID-19 infection segmentation (Exp#2) after applying a pre-processing step and training DMDF-Net-2 through transfer learning.

Quantitative performance comparison of pre-ROI fusion versus post-ROI fusion with and without applying pre-processing step for COVID-19 infection segmentation (Exp#2). (unit: %). ×: Not included; ✓: Included; The best results are shown in boldface. Visual output results of pre-ROI fusion versus post-ROI fusion for COVID-19 infection segmentation (Exp#2) after applying a pre-processing step and training DMDF-Net-2 through transfer learning. We further analyze the effect of pre- and post-processing stages on the same dataset (COVID-19-CT-Seg) to show the comparative results of two different data distributions. In this experiment, we evaluated the average performance of the COVID-19-CT-Seg dataset with our proposed DMDF-Net-2 by performing the five-fold cross-validation. Table 8 shows these comparative results with and without applying pre- and post-processing steps. It can be observed (Table 8) that the addition of pre- and post-processing stages gave a marginal gain of 0.48%, 0.5%, 0.55%, and 0.06% for average DICE, IoU, AP, and SPE, respectively, in case of the same dataset (COVID-19-CT-Seg). In comparison, the effect of pre- and post-processing stages is significantly high in case of cross-dataset (MosMed) with the average gains of 2.79%, 2.44%, 1.54%, and 11.71% for average DICE, IoU, AP, and SEN, respectively. These comparative results (Table 8) show the significant contribution of pre- and post-processing stages in cross-dataset (having different data distributions) and validate the generality of our proposed solution.

Table 8

Comparative results of same versus cross datasets for COVID-19 infection segmentation (Exp#2). (unit: %).

Dataset	Pre-/Post-Processing	DICE (Std)	IoU(Std)	AP(Std)	SEN(Std)	SPE(Std)	Eφ(Std)	MAE(Std)
COVID-19-CT-Seg(Same-Dataset)	×	81.51 (7.52)	73.35 (7.81)	75.63 (8.85)	90.73 (5.28)	98.89 (1.09)	87.69 (5.56)	0.121 (0.106)
COVID-19-CT-Seg(Same-Dataset)	✓	81.99 (7.24)	73.85 (7.59)	76.18 (8.66)	90.51 (5.21)	98.95 (1.01)	88.08 (5.37)	0.115 (0.099)

COVID-19-CT-Seg/MosMed(Cross-Dataset)	×	72.91	64.78	68.38	61.07	99.8	91.02	0.028
COVID-19-CT-Seg/MosMed(Cross-Dataset)	✓	75.7	67.22	69.92	72.78	99.79	91.11	0.026

The best results are shown in boldface.

Comparative results of same versus cross datasets for COVID-19 infection segmentation (Exp#2). (unit: %). The best results are shown in boldface. Our first MosMed dataset comprises a total of 2,049 images (having a total of 785 infected and 1264 non-infected slices). Subsequently, the second COVID-19-CT-Seg dataset includes a total of 3,520 images (with a total of 1843 infected and 1677 non-infected slices). Owning to a small number of data samples, the number of infected versus non-infected slices influenced the training of our proposed model by causing an under-fitting problem. To address this problem, we utilized the strength of transfer learning (Krizhevsky et al., 2012) to perform timely and efficient training of our model using a small dataset such as COVID-19-CT-Seg in all the experiments. Table 3, Table 5 show the significant gains of transfer learning in Exp#1 and Exp#2, respectively. In addition, we also observed a class imbalance problem (particularly in case of the MosMed dataset) owning to a small ratio of infected lung regions in each infected slice, which ultimately resulted in poor testing results (Table 6). This problem was further addressed by using a BCE loss in all the experiments. Table 4, Table 6 show the comparative performance gains of our BCE loss over the conventional CE loss function in Exp#1 and Exp#2, respectively.

Comparisons with the State-of-the-Art methods

In this section, we perform a detailed comparative analysis of the proposed method with state-of-the-art segmentation networks proposed for COVID-19 (Ma et al., 2021, Owais et al., 2021, Zhang et al., 2020) and general image segmentation domains (Badrinarayanan et al., 2017, Chen et al., 2018, Long et al., 2015, Ronneberger et al., 2015, Sandler et al., 2018). The proposed diagnostic framework includes DMDF-Net-1 and DMDF-Net-2 to extract lung ROI (Exp#1), and segment infectious regions (Exp#2) in a given CT image, respectively. Consequently, the performance of both DMDF-Net-1 and DMDF-Net-2 is separately compared with different baseline models. In (Ma et al., 2021, Owais et al., 2021, Zhang et al., 2020), the authors used the same datasets as selected in our method; therefore, we directly compared our method with the given results in (Ma et al., 2021, Owais et al., 2021, Zhang et al., 2020) for both lung segmentation (Exp#1) and COVID-19 infection segmentation (Exp#2). Whereas in the case of other segmentation networks (Badrinarayanan et al., 2017, Chen et al., 2018, Long et al., 2015, Ronneberger et al., 2015, Sandler et al., 2018) proposed for general image segmentation applications, a direct comparison was not possible. Therefore, to make a fair comparison, we evaluated the segmentation results of these models (Badrinarayanan et al., 2017, Chen et al., 2018, Long et al., 2015, Ronneberger et al., 2015, Sandler et al., 2018) using the same datasets as those selected in this study. These baseline models include U-Net (Ronneberger et al., 2015), DeepLabV3+ (based on ResNet (Chen et al., 2018)), MobileNetV2 (Sandler et al., 2018), SegNet (based on VGG16 and VGG19 (Badrinarayanan et al., 2017)), and FCNs (Long et al., 2015). Table 9 shows the comparative results of the proposed DMDF-Net-1 and all these baseline models for the lung segmentation task (Exp#1). It can be observed (Table 9) that the proposed DMDF-Net-1 shows superior results (in terms of average DICE and IoU scores), including a lower number of training parameters compared to the other models. DAL-Net (Owais et al., 2021) ranked as the second-best network based on the second highest DICE and IoU scores among all the baseline methods. The proposed DMDF-Net-1 outperforms (Owais et al., 2021) (second-best), yielding the average gains of 0.36% [97.35–96.99%], 0.64% [95.03%–94.39%], 0.45% [95.51%–95.06%], 0.37% [99.55%–99.18%], 0.07% [99.4%–99.33%], 0.25% [97.39%–97.14%], and 0.01 [0.07–0.06] for average DICE, IoU, AP, SEN, SPE, , and MAE, respectively. These results show the average gains of the performance of “left lung”, “right lung”, and “both lung” datasets. Additionally, in a t-test analysis (proposed versus (Owais et al., 2021)), we obtain an average p-value of less than 0.05 (specifically, a ρ-value of 0.013) that distinguishes our model from (Owais et al., 2021) with a 95% confidence score. In addition, the number of training parameters of the proposed network is lower than (Owais et al., 2021). To be specific, our DMDF-Net-1 includes approximately 0.8 million fewer parameters than (Owais et al., 2021) (i.e., 5.85 million [proposed] ≪ 6.65 million (Owais et al., 2021)). Table 9 also includes the floating point operations per second (FLOPs) and execution speed of our DMDF-Net-1 and other baseline models. Our proposed DMDF-Net-1 requires 37.45 Giga FLOPs with an average execution speed of 25.64 frames per second (FPS).

Table 9

Quantitative results of the proposed DMDF-Net-1 compared with different state-of-the-art segmentation networks for lung segmentation task (Exp#1). (unit: %).

Methods		#Par.(M)	FLOPs(G)	PT(FPS)	DICE (Std)	IoU(Std)	AP(Std)	SEN(Std)	SPE(Std)	Eφ(Std)	MAE(Std)
Left Lung	3D U-Net (Ma et al., 2021)	–	–	–	85.8 (10.5)	–	–	–	–	–	–
	CoSinGAN + 2D U-Net (Zhang et al., 2020)	–	–	–	93.9(5.6)	–	–	–	–	–	–
	SegNet (VGG16) (Badrinarayanan et al., 2017)	29.44	123.91	10.31	92.91 (0.78)	87.43 (1.23)	88.79 (1.14)	97.78 (0.25)	98.61 (0.29)	93.16 (1.66)	0.143 (0.027)
	SegNet (VGG19) (Badrinarayanan et al., 2017)	40.07	157.54	9.62	93.37 (0.66)	88.16 (1.03)	89.6 (1.14)	97.43 (0.58)	98.76 (0.09)	93.98 (0.57)	0.13 (0.01)
	U-Net (Encoder Depth:4) (Ronneberger et al., 2015)	31.03	155.69	23.26	90.85 (3.98)	84.54 (5.74)	87.98 (6.84)	93.38 (11.06)	98.46 (1.08)	90.96 (4.48)	0.183 (0.081)
	FCN (Up Sampling Factor: 32) (Long et al., 2015)	134.29	187.34	18.52	91.39 (0.57)	85.07 (0.85)	86.97 (0.91)	96.15 (0.58)	98.35 (0.27)	91.15 (1.56)	0.175 (0.025)
	DeepLabV3+(ResNet) (Chen et al., 2018)	20.61	58.42	31.25	95.78 (0.67)	89.97 (0.81)	90.89 (0.85)	98.85 (0.17)	98.93 (0.1)	94.79 (1.12)	0.107 (0.009)
	DAL-Net (Owais et al., 2021)	6.65	35.81	23.26	94.52 (0.44)	90.03 (0.73)	90.89 (0.67)	99.05 (0.24)	98.92 (0.17)	94.81 (1.44)	0.108 (0.017)
	DeepLabV3+(MobileNetV2) (Sandler et al., 2018)	6.78	30.89	20.41	94.06 (1.65)	88.97 (1.13)	89.97 (1.16)	98.77 (0.25)	98.79 (0.18)	94.07 (1.68)	0.121 (0.016)
	DMDF-Net-1 (Proposed)	5.85	37.45	25.64	94.86 (0.4)	90.59 (0.66)	91.29 (0.66)	99.43 (0.15)	98.98 (0.1)	94.67 (1.54)	0.1 (0.01)

Right Lung	3D U-Net (Ma et al., 2021)	–	–	–	87.9(9.3)	–	–	–	–	–	–
	CoSinGAN + 2D U-Net (Zhang et al., 2020)	–	–	–	94.6(3.8)	–	–	–	–	–	–
	SegNet (VGG16) (Badrinarayanan et al., 2017)	29.44	123.91	10.31	96.47 (0.3)	93.36 (0.53)	94.01 (0.52)	99.27 (0.37)	99.27 (0.08)	96.54 (0.83)	0.072 (0.007)
	SegNet (VGG19) (Badrinarayanan et al., 2017)	40.07	157.54	9.62	96.44 (0.26)	93.3 (0.46)	94.06 (0.5)	98.98 (1.07)	99.28 (0.05)	96.46 (0.46)	0.072 (0.008)
	U-Net (Encoder Depth:4) (Ronneberger et al., 2015)	31.03	155.69	23.26	94.78 (5.21)	91.01 (7.84)	96.69 (1.65)	87.89 (16.73)	99.67 (0.24)	93.3 (6.51)	0.091 (0.074)
	FCN (Up Sampling Factor: 32) (Long et al., 2015)	134.29	187.34	18.52	94.04 (0.44)	89.22 (0.7)	90.4 (0.85)	98.35 (0.53)	98.74 (0.05)	93.56 (0.82)	0.128 (0.006)
	DeepLabV3+(ResNet) (Chen et al., 2018)	20.61	58.42	31.25	96.74 (0.73)	95.94 (0.28)	96.44 (0.27)	99.32 (0.56)	99.59 (0.03)	98.15 (0.21)	0.043 (0.005)
	DAL-Net (Owais et al., 2021)	6.65	35.81	23.26	98.13 (0.32)	96.39 (0.59)	96.91 (0.49)	99.21 (0.94)	99.65 (0.02)	98.33 (0.44)	0.037 (0.007)
	DeepLabV3+(MobileNetV2) (Sandler et al., 2018)	6.78	30.89	20.41	95.49 (1.88)	95.33 (0.7)	95.83 (0.5)	99.4 (0.43)	99.5 (0.11)	97.71 (0.51)	0.05 (0.013)
	DMDF-Net-1 (Proposed)	5.85	37.45	25.64	98.52 (0.24)	97.11 (0.45)	97.42 (0.43)	99.64 (0.28)	99.71 (0.04)	98.78 (0.24)	0.03 (0.004)

Both Lung	SegNet (VGG16) (Badrinarayanan et al., 2017)	29.44	123.91	10.31	97.1 (0.26)	94.45 (0.47)	95.38 (0.47)	99.07 (0.34)	98.92 (0.09)	97.34 (0.5)	0.103 (0.009)
	SegNet (VGG19) (Badrinarayanan et al., 2017)	40.07	157.54	9.62	97.16 (0.53)	94.57 (0.96)	95.46 (1.01)	99.12 (0.55)	98.95 (0.13)	97.09 (0.92)	0.102 (0.012)
	U-Net (Encoder Depth:4) (Ronneberger et al., 2015)	31.03	155.69	23.26	85.76 (18.93)	80.99 (19.36)	90.18 (3.79)	80.73 (42.08)	98.04 (1.62)	82.37 (22.69)	0.4 (0.411)
	FCN (Up Sampling Factor: 32) (Long et al., 2015)	134.29	187.34	18.52	94.21 (0.58)	89.42 (0.96)	91.12 (1.06)	98.33 (0.38)	97.73 (0.19)	93.31 (0.61)	0.221 (0.015)
	DeepLabV3+(ResNet) (Chen et al., 2018)	20.61	58.42	31.25	96.68 (0.89)	96.42 (0.27)	97.12 (0.34)	99.17 (0.59)	99.36 (0.02)	98.28 (0.22)	0.067 (0.007)
	DAL-Net (Owais et al., 2021)	6.65	35.81	23.26	98.33 (0.25)	96.75 (0.47)	97.37 (0.25)	99.28 (0.55)	99.41 (0.09)	98.28 (0.51)	0.061 (0.014)
	DeepLabV3+(MobileNetV2) (Sandler et al., 2018)	6.78	30.89	20.41	96.26 (1.63)	96.21 (0.46)	96.86 (0.27)	99.34 (0.39)	99.28 (0.15)	98.07 (0.41)	0.072 (0.016)
	DMDF-Net-1 (Proposed)	5.85	37.45	25.64	98.66 (0.17)	97.38 (0.33)	97.82 (0.26)	99.57 (0.12)	99.51 (0.08)	98.73 (0.35)	0.048 (0.008)

#Par.: Number of parameters; M: Million; G: Giga; PT: Processing time; FPS: Frame per second;’–‘: Not available; Std: Standard deviation; The best results are shown in boldface.

Quantitative results of the proposed DMDF-Net-1 compared with different state-of-the-art segmentation networks for lung segmentation task (Exp#1). (unit: %). #Par.: Number of parameters; M: Million; G: Giga; PT: Processing time; FPS: Frame per second;’–‘: Not available; Std: Standard deviation; The best results are shown in boldface. Figure 8 shows the final segmentation results of the proposed lung segmentation model (DMDF-Net-1) versus different state-of-the-art segmentation networks (Badrinarayanan et al., 2017, Chen et al., 2018, Long et al., 2015, Owais et al., 2021, Ronneberger et al., 2015, Sandler et al., 2018). It can be observed in Fig. 8 that our proposed DMDF-Net-1 and the methods of (Chen et al., 2018, Owais et al., 2021, Sandler et al., 2018) show comparable visual results and outperform the other three baseline methods (Badrinarayanan et al., 2017, Long et al., 2015, Ronneberger et al., 2015). However, the overall quantitative performance of Table 10 of our DMDF-Net-1 is better than all the baseline methods.

Fig. 8

Visual comparison of lung segmentation results of the proposed framework with the other state-of-the-art deep segmentation models.

Table 10

Quantitative results of the proposed DMDF-Net-2 compared with different state-of-the-art segmentation networks for infection segmentation task (Exp#2). (unit: %).

Methods		#Par.(M)	FLOPs(G)	P.T.(FPS)	DICE	IoU	AP	SEN	SPE	Eφ	MAE
Without Pre- and Post-Processing	3D U-Net (Ma et al., 2021)	–	–	–	58.8	–	–	–	–	–	–
	CoSinGAN + 2D U-Net (Zhang et al., 2020)	–	–	–	47.4	–	–	–	–	–	–
	U-Net (Encoder Depth:4) (Ronneberger et al., 2015)	31.03	155.69	23.26	59.49	55.13	58.93	20.44	99.82	82.41	0.083
	SegNet (VGG19) (Badrinarayanan et al., 2017)	40.07	157.54	9.62	66.32	59.61	61.64	55.4	99.65	85.35	0.044
	FCN (Up Sampling Factor: 32) (Long et al., 2015)	134.29	187.34	18.52	68.16	60.97	63.35	57.3	99.7	83.72	0.039
	SegNet (VGG16) (Badrinarayanan et al., 2017)	29.44	123.91	10.31	66.81	60	63.97	42.44	99.79	84.65	0.032
	DeepLabV3+(ResNet) (Chen et al., 2018)	20.61	58.42	31.25	68.12	60.97	65.81	42.63	99.82	88.26	0.029
	DAL-Net (Owais et al., 2021)	6.65	35.81	23.26	69.76	62.18	63.7	71.88	99.63	88.13	0.042
	DeepLabV3+(MobileNetV2) (Sandler et al., 2018)	6.78	30.89	20.41	73.2	65.02	69	59.79	99.81	88.44	0.027
	DMDF-Net-2 (Proposed)	5.85	37.45	25.64	72.91	64.78	68.38	61.07	99.8	91.02	0.028

With Pre-Processing	U-Net (Encoder Depth:4) (Ronneberger et al., 2015)	31.03	155.69	9.9	64.36	58.26	61.34	39.52	99.74	86.95	0.078
	SegNet (VGG19) (Badrinarayanan et al., 2017)	40.07	157.54	6.17	64.5	58.24	59.04	77.28	99.32	83.27	0.072
	FCN (Up Sampling Factor: 32) (Long et al., 2015)	134.29	187.34	8.93	66.14	59.44	60.48	72.29	99.47	82.54	0.058
	SegNet (VGG16) (Badrinarayanan et al., 2017)	29.44	123.91	6.45	66.14	59.46	60.92	63.17	99.56	85.24	0.051
	DeepLabV3+(ResNet) (Chen et al., 2018)	20.61	58.42	11.11	70.1	62.46	64.64	64.72	99.7	89.18	0.037
	DAL-Net (Owais et al., 2021)	6.65	35.81	9.9	72.5	64.41	66.01	76.42	99.69	89.65	0.036
	DeepLabV3+(MobileNetV2) (Sandler et al., 2018)	6.78	30.89	9.35	72.84	64.7	66.76	72.21	99.72	89.35	0.033
	DMDF-Net-2 (Proposed)	5.85	37.45	10.31	75.3	66.86	69.43	72.95	99.78	91.01	0.027

With Pre- and Post-Processing(Applying same DMDF-Net-1)	U-Net (Encoder Depth:4) (Ronneberger et al., 2015)	36.88	193.14	7.14	67.64	60.62	66.06	39.3	99.84	85.07	0.057
	SegNet (VGG19) (Badrinarayanan et al., 2017)	45.92	194.99	4.98	66.9	59.98	60.93	76.83	99.47	84.68	0.058
	FCN (Up Sampling Factor: 32) (Long et al., 2015)	140.14	224.79	6.62	68.95	61.55	62.95	71.87	99.6	84.08	0.045
	SegNet (VGG16) (Badrinarayanan et al., 2017)	35.29	161.36	5.15	69.54	62.03	64.25	62.77	99.7	86.53	0.038
	DeepLabV3+(ResNet) (Chen et al., 2018)	26.46	95.87	7.75	70.53	62.81	65.11	64.56	99.71	89.24	0.036
	DAL-Net (Owais et al., 2021)	12.5	73.26	7.14	73.17	64.98	66.72	76.09	99.71	89.87	0.034
	DeepLabV3+(MobileNetV2) (Sandler et al., 2018)	12.63	68.34	6.85	73.56	65.32	67.57	71.98	99.74	89.54	0.031
	DMDF-Net-2 (Proposed)	11.7	74.9	7.35	75.7	67.22	69.92	72.78	99.79	91.11	0.026

#Par.: Number of parameters; M: Million; G: Giga; PT: Processing time; FPS: Frame per second; ‘–‘: Not available; The best results are shown in boldface.

Visual comparison of lung segmentation results of the proposed framework with the other state-of-the-art deep segmentation models. Quantitative results of the proposed DMDF-Net-2 compared with different state-of-the-art segmentation networks for infection segmentation task (Exp#2). (unit: %). #Par.: Number of parameters; M: Million; G: Giga; PT: Processing time; FPS: Frame per second; ‘–‘: Not available; The best results are shown in boldface. Furthermore, Table 10 shows the comparative results of the proposed DMDF-Net-2 versus different baseline models for the infection segmentation task (Exp#2). It can be observed from Table 10 that the proposed DMDF-Net-2 also provides superior performance and includes a lower number of training parameters compared to other methods. In contrast, DeepLabV3+ (MobileNetV2) (Sandler et al., 2018) is ranked as the second-best network among the other networks. With the addition of only the pre-processing step, the proposed DMDF-Net-2 outperforms (Sandler et al., 2018) (second-best), yielding the average gains of 2.46% [75.3%–72.84%], 2.16% [66.86%–64.7%], 2.67% [69.43%–66.76%], 0.74% [72.95%–72.21%], 0.06% [99.78%–99.72%], 1.66% [91.01%–89.35%], and 0.006 [0.033–0.027] for average DICE, IoU, AP, SEN, SPE, , and MAE, respectively. Similarly, after including both pre- and post-processing steps (applying the same lung segmentation network of DMDF-Net-1), the proposed DMDF-Net-2 outperforms (Sandler et al., 2018) (second-best), yielding the average gains of 2.14% [75.7%–73.56%], 1.9% [67.22%–65.32%], 2.35% [69.92%–67.57%], 0.8% [72.78%–71.98%], 0.05% [99.79%–99.74%], 1.57% [91.11%–89.54%], and 0.005 [0.031–0.026] for average DICE, IoU, AP, SEN, SPE, , and MAE, respectively. However, without the pre- and post-processing steps, (Sandler et al., 2018) shows better results than our DMDF-Net-2 with small increases of 0.29% [73.2%–72.91%], 0.24% [65.02%–64.78%], 0.62% [69%–68.38%], 0.01% [99.81%–99.8%], and 0.001 [0.028v0.027] for average DICE, IoU, AP, SPE, and MAE, respectively. Still, our model shows high SEN and in comparison with (Sandler et al., 2018) (second-best), yielding an average gain in SEN score of 1.28% [61.07%–59.79%], and the score of 2.58% [91.02%–88.44%], respectively. Additionally, in a t-test analysis (proposed versus (Sandler et al., 2018)), we obtained an average p-value of less than 0.01 (specifically, a ρ-value of 0.0072) that distinguishes our model from (Sandler et al., 2018) with a 99% confidence score. Similarly, the number of training parameters of the proposed network is also lower than (Sandler et al., 2018). To be specific, our DMDF-Net-2 includes approximately 1.6 million fewer parameters than (Sandler et al., 2018) (i.e., 11.7 million [proposed] ≪ 13.56 million (Sandler et al., 2018)). Consequently, these results (Table 9, Table 10) highlight the superior performance of our model. Table 10 also includes the FLOPs and execution speed of our DMDF-Net-2 and other baseline models. After including both pre- and post-processing steps, our final infection segmentation framework (DMDF-Net-1 and DMDF-Net-2) requires 74.9 Giga FLOPs with an average execution speed of 7.35 frames per second. Figure 9 presents the visual comparative results of our proposed framework with the other state-of-the-art deep segmentation models. Fig. 9a presents the comparative results without including pre- and post-processing stages. Fig. 9b shows the visual outputs of all the methods by including only pre-processing stage. Finally, Fig. 9c visualizes the comparative performance by including both pre- and post-processing stages (applying the same lung segmentation network). It can be observed (Fig. 9) that the proposed network generates well-localized segmentation outputs for the input CT images. However, several reference models (Badrinarayanan et al., 2017, Chen et al., 2018, Long et al., 2015, Ronneberger et al., 2015) generate inadequate segmentation results, which are marked as false-positive (i.e., incorrectly recognize the normal regions as infectious regions) and false-negative (i.e., not recognizing the infected regions) pixels in Fig. 9. Nevertheless, (Owais et al., 2021) and (Sandler et al., 2018) showed better performance than (Badrinarayanan et al., 2017, Chen et al., 2018, Long et al., 2015, Ronneberger et al., 2015). However, the average segmentation results (Table 9, Table 10) show a higher performance of our method than that of (Owais et al., 2021) and (Sandler et al., 2018). Primarily, the superior performance of our method is attained by the addition of C-Blocks in both encoder and decoder modules, which mainly exploit diverse representations of lung/lesion patterns from the given data by performing multiscale deep features fusion. Moreover, a residual connection (extracted from B-Block 1 of the encoder module) further contributes to the low-level contextual information in the decoding part to refine the edge information of the final output.

Fig. 9

Visual comparison of COVID-19 infection segmentation results of the proposed framework with the other state-of-the-art deep segmentation models (a) without pre- and post-processing, (b) with pre-processing, and (c) with pre- and post-processing (applying the same lung segmentation network).

Discussion

This section describes the distinctive aspects of the proposed method, with possible limitations that can influence the diagnostic performance of our system. Finally, it includes a brief roadmap for future work to address these constraints and improve the overall performance.

Principal findings

This study leveraged the strengths of recent deep learning techniques in chest CT image analysis to identify lung lesions associated with COVID-19 infection. The proposed framework mainly includes a lung segmentation network (DMDF-Net-1) and an infection segmentation network (DMDF-Net-2) to extract the lung ROIs and infected areas from a CT image. The output of DMDF-Net-1 is mainly used in a later post-processing step to improve the infection segmentation results of DMDF-Net-2 and to provide a quantitative evaluation of the infected area in the CT image. Accurate detection and quantification of infected lung regions are essential for measuring infection severity in individual lung lobes and to find suitable personalized treatments (Zhang et al., 2020). Fig. 10 presents the infection quantification results of our proposed diagnostic framework for some typical CT images, including both positive (Fig. 10a) and negative (Fig. 10b) data samples. Additionally, the intermediate outputs (in Fig. 10, after pre-processing, DMDF-Net-1, DMDF-Net-2, and post-processing) further present the diagnostic workflow of the proposed framework. In Fig. 10, the PIAL score represents the quantification of the infectious regions in each CT image, which is calculated by dividing the area of the infected region by the total area of lung lobes (i.e., ).

Fig. 10

Infection quantification results of the proposed diagnostic framework for some CT images, including both (a) infectious and (b) normal data samples. (PIAL: Proportion of the infected area of the lung). Furthermore, in our network design, we mainly utilized the strength of grouped convolution and multiscale deep features fusion using multiscale dilated convolution (C-Blocks) to achieve better segmentation results with a reduced number of training parameters (specifically, 5.85 million). The encoder design of the proposed DMDF-Net contains fewer training parameters (specifically, 0.88 million) than the original MobileNetV2 (specifically, 2.24 million). Owing to the optimal size of the encoder design, the average execution time of our model was lower than that of the original MobileNetV2. Specifically, the average execution time of our DMDF-Net is approximately 25.64 frames per second, while the original MobileNetV2 (which showed the second-best accuracies in Table 10) processes 20.41 frames per second. The average execution time (in terms of number of processed frames per second) was determined using the computing environment described in Section 2.1. Consequently, the optimal design of our model achieves state-of-the-art performance and utilizes low-cost hardware resources without influencing the overall diagnostic performance. To visualize the internal workflow of the proposed DMDF-Net, we also show the multiscale class activation maps (CAMs) (Zhou, Khosla, Lapedriza, Oliva, & Torralba, 2016) extracted from five different layers (labeled as Conv 2, B-Block 1, B-Block 2, Conv 3, and Conv 8 in Table 2) of the network. The input image of size 288 × 352 was downsampled into four spatial sizes (i.e., 144 × 176, 72 × 88, 36 × 44, and 18 × 22) after passing through the encoder module. Subsequently, the encoded output of size 18 × 22 is upsampled into two spatial sizes (i.e., 72 × 88 and 288 × 352) after passing through the decoder module. The decoded output with a size of 288 × 352 is the final output of the proposed network. Therefore, a total of five layers were selected for multiscale CAM visualization based on the distinctive spatial sizes of their outputs inside the encoder and decoder modules. Fig. 11 shows the multiscale CAM visualization of the proposed DMDF-Net-1 (lung segmentation network) and DMDF-Net-2 (infection segmentation network) for testing the CT images. It can be observed (Fig. 11) that the class-specific regions (lung ROIs or infectious regions) become increasingly discriminative after passing through successive layers. Finally, a binary image is obtained as the final output that presents the “Lung/infectious region” and “normal/background region” as white ‘1’ and black ‘0’ pixels, respectively.

Fig. 11

Multiscale class activation maps (CAM) visualization of the proposed (a) DMDF-Net-1 for lung segmentation (Exp#1) and (b) DMDF-Net-2 for COVID-19 infection segmentation (Exp#2).

Multiscale class activation maps (CAM) visualization of the proposed (a) DMDF-Net-1 for lung segmentation (Exp#1) and (b) DMDF-Net-2 for COVID-19 infection segmentation (Exp#2). Although several online datasets are available related to COVID-19, most datasets are related to the classification problem. A few segmentation datasets related to COVID-19 are publicly available, which only include the segmentation masks for infectious regions. This study presents a CAD framework for automatic detection and quantification of COVID-19 related findings in lung CT scans. The proposed method includes an additional post-processing step that also requires a lung segmentation mask for accurate segmentation and quantification of infected lung areas. Therefore, we selected the COVID-19 CT Seg dataset that includes the ground truths for both lung and infection regions of each slice. Secondly, we aimed to develop a CAD tool to segment trivial infected regions in the lung efficiently. Therefore, we selected the MosMed dataset that includes trivial lung tissue abnormalities with COVID-19 (pulmonary parenchymal involvement = <25%) (Morozov et al., 2020). In our future work, we will explore additional segmentation datasets related to the detection and quantification of COVID-19 related findings and develop a more efficient CAD solution.

Limitations and future work

Despite the promising results of the proposed method compared to existing methods, the current research still has some limitations. First, the performance of cross-datasets is still limited and can be further improved. Therefore, in future work, we will strive to increase the cross-data performance of the method, including multi-source CT data. Second, the proposed network can only segment lesions associated with COVID-19. In future work, we will collect more datasets, including those for multiple diseases, and propose a new CAD method that can detect and distinguish between COVID-19 and different types of diseases, such as other viral and bacterial infections.

Conclusions

In this study, we proposed a fully automated CAD framework for the effective recognition and quantification of COVID-19 related findings in a chest CT image. We mainly proposed a deep segmentation network (named DMDF-Net) that includes additional pre- and post-processing steps for accurate segmentation of infectious regions in CT images. The pre-processing step was included to address the generality issues considering a real-world scenario. The post-processing step generates a well-localized ROI of infectious regions and further provides the quantification of lesion regions in terms of the PIAL score. In detail, our designed network utilizes the strength of grouped convolution and multiscale deep features fusion using multiscale dilated convolution to achieve better segmentation results with a reduced number of learnable parameters (specifically, 5.85 million). The optimal size of our model utilizes low-cost hardware resources and provides effective diagnostic results. The first DMDF-Net-1 exhibited average DICE scores of 94.86%, 98.52%, and 98.66%, IoU scores of 90.59%, 97.11%, and 97.38%, and scores of 94.67%, 98.78%, and 98.73% for the segmentation of the left, right, and both lung lobes (Exp#1), respectively. Similarly, the second DMDF-Net-2 (including both pre- and post-processing steps) exhibited average performance of 75.7%, 67.22%, 69.92%, 72.78%, 99.79%, 91.11%, and 0.026 for average DICE, IoU, AP, SEN, SPE, , and MAE, respectively, for COVID-19 infection segmentation (Exp#2). Finally, a detailed comparative study for both Exp#1 and Exp#2 validates the superior results of our method over various state-of-the-art deep segmentation models.

CRediT authorship contribution statement

Muhammad Owais: Methodology, Writing – original draft. Na Rae Baek: Data curation. Kang Ryoung Park: Supervision, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Algorithm 1: Training procedure of our DMDF-Net (pseudo-code)
Input:〈Ikk=1N,Mkk=1N〉: N training data samples;〈Ikk=1V,Mkk=1V〉: V validation data samples; θ: initial training paramters; α: initial learning rate; K: maximum number of epochs; B: mini-batch size; β: class balancing factorIitiallization: Initialize training parameters; θ (ImageNet pretrained weights)
1	// Proceed the network training processforn=1,2,3,⋯,Kdo // Number of epochs
2	// Split the whole training data into N/B mini-batches of size B:〈Ikk=1B,Mkk=1B〉1, 〈Ikk=1B,Mkk=1B〉2, …,〈Ikk=1B,Mkk=1B〉N/B
3	fori=1,2,3,⋯,N/Bdo // Number of iterations
4	update:θ=θ-α.∇LBCE〈Ikk=1B,Mkk=1B〉i,β// Loss as Eq. (4)
5	end
6	obtain:Mk'k=1M=ψ(Fkk=1M,θ)//ψ(.): Transfer function of our DMDF-Net
7	// Check validation accuracy after every epoch to avoid overfitting ifvalidationAccuracy(Mk'k=1V, Mkk=1V)≈converged (i.e., not increasing)do: Stop training proess for remaining epochsend
8	end
9	Output: finally trained parameters; θ' // optimal weights of our model

29 in total

DMDF-Net: Dual multiscale dilated fusion network for accurate segmentation of lesions related to COVID-19 in lung radiographic scans.

Introduction

Related work

Proposed method

Datasets and experimental setup

Overview of proposed method

Design of the proposed DMDF-Net

Encoder design and workflow

Decoder design and workflow

Loss function

Results

Training and validation

Testing results (Ablation Studies)

Comparisons with the State-of-the-Art methods

Discussion

Principal findings

Limitations and future work

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

1. Toward data-efficient learning: A benchmark for COVID-19 CT lung and infection segmentation.

2. MSCS-DeepLN: Evaluating lung nodule malignancy using multi-scale cost-sensitive neural networks.

3. COVID-19 CT Image Synthesis with a Conditional Generative Adversarial Network.

4. Transformation-Consistent Self-Ensembling Model for Semisupervised Medical Image Segmentation.

5. Automatic COVID-19 CT segmentation using U-Net integrated spatial and channel attention mechanism.

6. A Few-Shot U-Net Deep Learning Model for COVID-19 Infected Area Segmentation in CT Images.

7. An integrated feature frame work for automated segmentation of COVID-19 infection from lung CT images.

8. D2A U-Net: Automatic segmentation of COVID-19 CT slices based on dual attention and hybrid dilated convolution.

9. FractalCovNet architecture for COVID-19 Chest X-ray image Classification and CT-scan image Segmentation.

Review 1. A Comprehensive Review of Machine Learning Used to Combat COVID-19.