Jasjit S Suri1,2, Sushant Agarwal3,4, Luca Saba5, Gian Luca Chabert5, Alessandro Carriero6, Alessio Paschè5, Pietro Danna5, Armin Mehmedović7, Gavino Faa8, Tanay Jujaray3,9, Inder M Singh10, Narendra N Khanna11, John R Laird12, Petros P Sfikakis13, Vikas Agarwal14, Jagjit S Teji15, Rajanikant R Yadav16, Ferenc Nagy17, Zsigmond Tamás Kincses18, Zoltan Ruzsa19, Klaudija Viskovic7, Mannudeep K Kalra20. 1. Stroke Diagnostic and Monitoring Division, AtheroPoint™, Roseville, CA, USA. jasjit.suri@atheropoint.com. 2. Advanced Knowledge Engineering Centre, GBTI, Roseville, CA, USA. jasjit.suri@atheropoint.com. 3. Advanced Knowledge Engineering Centre, GBTI, Roseville, CA, USA. 4. Department of Computer Science Engineering, Pranveer Singh Institute of Technology, Kanpur, Uttar Pradesh, India. 5. Department of Radiology, Azienda Ospedaliero Universitaria (A.O.U.), Cagliari, Italy. 6. Depart of Radiology, "Maggiore Della Carità" Hospital, University of Piemonte Orientale, Via Solaroli 17, 28100, Novara, Italy. 7. University Hospital for Infectious Diseases, Zagreb, Croatia. 8. Department of Pathology - AOU of Cagliari, Cagliari, Italy. 9. Dept of Molecular, Cell and Developmental Biology, University of California, Santa Cruz, CA, USA. 10. Stroke Diagnostic and Monitoring Division, AtheroPoint™, Roseville, CA, USA. 11. Department of Cardiology, Indraprastha APOLLO Hospitals, New Delhi, India. 12. Heart and Vascular Institute, Adventist Health St. Helena, St Helena, CA, USA. 13. Rheumatology Unit, National Kapodistrian University of Athens, Athens, Greece. 14. Dept. of Immunology, SGPIMS, Lucknow, UP, India. 15. Ann and Robert H. Lurie Children's Hospital of Chicago, Chicago, USA. 16. SGPIMS, Uttar Pradesh, Lucknow, India. 17. Internal Medicine Department, University of Szeged, Szeged, 6725, Hungary. 18. Department of Radiology, University of Szeged, Szeged, 6725, Hungary. 19. Invasive Cardiology Division, University of Szeged, Budapest, Hungary. 20. Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, USA.
Abstract
Variations in COVID-19 lesions such as glass ground opacities (GGO), consolidations, and crazy paving can compromise the ability of solo-deep learning (SDL) or hybrid-deep learning (HDL) artificial intelligence (AI) models in predicting automated COVID-19 lung segmentation in Computed Tomography (CT) from unseen data leading to poor clinical manifestations. As the first study of its kind, "COVLIAS 1.0-Unseen" proves two hypotheses, (i) contrast adjustment is vital for AI, and (ii) HDL is superior to SDL. In a multicenter study, 10,000 CT slices were collected from 72 Italian (ITA) patients with low-GGO, and 80 Croatian (CRO) patients with high-GGO. Hounsfield Units (HU) were automatically adjusted to train the AI models and predict from test data, leading to four combinations-two Unseen sets: (i) train-CRO:test-ITA, (ii) train-ITA:test-CRO, and two Seen sets: (iii) train-CRO:test-CRO, (iv) train-ITA:test-ITA. COVILAS used three SDL models: PSPNet, SegNet, UNet and six HDL models: VGG-PSPNet, VGG-SegNet, VGG-UNet, ResNet-PSPNet, ResNet-SegNet, and ResNet-UNet. Two trained, blinded senior radiologists conducted ground truth annotations. Five types of performance metrics were used to validate COVLIAS 1.0-Unseen which was further benchmarked against MedSeg, an open-source web-based system. After HU adjustment for DS and JI, HDL (Unseen AI) > SDL (Unseen AI) by 4% and 5%, respectively. For CC, HDL (Unseen AI) > SDL (Unseen AI) by 6%. The COVLIAS-MedSeg difference was < 5%, meeting regulatory guidelines.Unseen AI was successfully demonstrated using automated HU adjustment. HDL was found to be superior to SDL.
Variations in COVID-19 lesions such as glass ground opacities (GGO), consolidations, and crazy paving can compromise the ability of solo-deep learning (SDL) or hybrid-deep learning (HDL) artificial intelligence (AI) models in predicting automated COVID-19 lung segmentation in Computed Tomography (CT) from unseen data leading to poor clinical manifestations. As the first study of its kind, "COVLIAS 1.0-Unseen" proves two hypotheses, (i) contrast adjustment is vital for AI, and (ii) HDL is superior to SDL. In a multicenter study, 10,000 CT slices were collected from 72 Italian (ITA) patients with low-GGO, and 80 Croatian (CRO) patients with high-GGO. Hounsfield Units (HU) were automatically adjusted to train the AI models and predict from test data, leading to four combinations-two Unseen sets: (i) train-CRO:test-ITA, (ii) train-ITA:test-CRO, and two Seen sets: (iii) train-CRO:test-CRO, (iv) train-ITA:test-ITA. COVILAS used three SDL models: PSPNet, SegNet, UNet and six HDL models: VGG-PSPNet, VGG-SegNet, VGG-UNet, ResNet-PSPNet, ResNet-SegNet, and ResNet-UNet. Two trained, blinded senior radiologists conducted ground truth annotations. Five types of performance metrics were used to validate COVLIAS 1.0-Unseen which was further benchmarked against MedSeg, an open-source web-based system. After HU adjustment for DS and JI, HDL (Unseen AI) > SDL (Unseen AI) by 4% and 5%, respectively. For CC, HDL (Unseen AI) > SDL (Unseen AI) by 6%. The COVLIAS-MedSeg difference was < 5%, meeting regulatory guidelines.Unseen AI was successfully demonstrated using automated HU adjustment. HDL was found to be superior to SDL.
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is an infectious disease that has infected 385 million individuals and has killed 5.7 million people globally as of 3rd February 2022 [1]. On March 11th, 2020, the World Health Organization (WHO) declared COVID-19 a global pandemic (the novel coronavirus) [2]. COVID-19 [3, 4] has proven to be worse in individuals with comorbidities such as coronary artery disease [3, 5], diabetes [6], atherosclerosis [7], fetal [8], etc. [9-11]. It has also caused architectural distortion with the interactions between alveolar and vascular changes [12] and affected relationships with daily usage such as nutrition [13]. Pathology has shown that even after vaccine immunization (ChAdOx1 nCoV-19), vaccine-induced immune thrombotic thrombocytopenia (VITT) was triggered [14]. It was also observed that adults who are born small, so-called intrauterine growth restriction (IUGR), are also likely to get affected by COVID-19 [8].One of the gold standards for COVID-19 detection is the "reverse transcription-polymerase chain reaction" commonly known as the RT-PCR test. Nonetheless, the RT-PCR test takes time and has low sensitivity [15-17]. This is where we use the image-based analysis for COVID-19 patients by using Chest radiographs and Computed Tomography (CT) [18-20] to diagnose the disease and work as a reliable complement to RT-PCR [21]. In the general diagnosis of COVID-19 and body imaging, CT has shown high sensitivity and reproducibility [20–, 22–24]. The primary benefit of CT [25, 26] is the imaging capacity to identify anomalies/opacities such as ground-glass opacity (GGO), consolidation, and other opacities [27-29] seen in COVID-19 patients [30–, 31–35].DL is a branch of AI that employs deep layers to provide fully automatic feature extraction, classification, and segmentation of the input data [36, 37]. Our team has developed the COVLIAS system, which has used deep learning models for lung segmentation [38-40]. In these previous studies, only one cohort was used when applying cross-validation, leading to bias in the performance since both the training and testing data were taken from the same CT machine, same hospital settings, and same geographical region [41-43]. To overcome this weakness, we introduce a multicentre study where training is conducted on one set of data coming from Croatia and testing was conducted using another data set taken from another source. This source was from Italy, the so-called “Unseen AI” (or vice-versa), which is one of the innovations of the proposed study. Just recently, there has been more visibility on “Unseen AI” [38, 44].Due to variations in COVID-19 lesions such as GGO, Consolidations, and Crazy Paving, the ability of AI models to predict the automated COVID-19 lung segmentation in CT Unseen data has led to poor clinical manifestations (see Fig. 1). This happens when the Hounsfield Units (HU) [45] of CT images are not consistent between the training and testing paradigms, which leads to over-and under-estimation of the prediction region. This can be prevented via normalization right before AI deployment [46, 47]. We embed such normalization in our AI framework automatically, which is another innovation besides the Unseen AI model design.
Fig. 1
Overlay of segmentation results (red) from the ResNet-SegNet HDL models trained without adjusting the HU level. The white arrow represents the region where the ResNet-SegNet HDL model under-estimates the lung region
Overlay of segmentation results (red) from the ResNet-SegNet HDL models trained without adjusting the HU level. The white arrow represents the region where the ResNet-SegNet HDL model under-estimates the lung regionRecent advances in deep learning, such as hybrid deep learning (HDL) have shown promising results [38–40–, 48–52]. Using this premise, we hypothesize that HDL models are superior to solo DL (SDL) models for segmentation. In this study, we have designed nine SDL and HDL models that are trained and tested for COVID-19-based lung segmentation on multicentre databases. We further offer insight into how 9 models of AI reciprocate to COVID-19 data sets, which is another unique contribution of the proposed study. The analysis includes attributes such as (i) the size of the model, (ii) the number of layers in AI architecture, (iii) the segmentation model utilizes, and (iv) the encoder part of the AI model. These can be used for a comparison between the nine AI models. Lastly, to prove the effectiveness of the AI models, we present performance evaluation using (i) Dice Similarity (DS), (ii) Jaccard Index (JI) [53], (iii) Bland–Altman plots (BA) [54, 55], (iv) Correlation coefficient (CC) plot [56, 57], and (v) Figure of Merit. Finally, as part of scientific validation, we compare the performance of COVLIAS 1.0-Unseen against MedSeg [58], a web-based lung segmentation tool.
Literature Survey
Artificial intelligence (AI) has been in existence for a while especially in the field of medical imaging [59, 60]. AI can play a vital role in the investigation of CT and X-ray images, assisting in the detection of COVID-19 type and overcoming the shortage of expert workers. It started with the role of machine learning moving into different application of point-based models such as diabetes [61, 62], neonatal and infant mortality [63], gene analysis [64] and image-based machine learning models such as carotid plaque classification [65-69], thyroid [70], liver [71], stroke [24], coronary [72], ovarian [73], prostate [74], skin cancer [75, 76], Wilson disease [77], ophthalmology [78], etc. The major challenge with these models is the feature extraction process which is ad-hoc in nature and, therefore, very time taking [79]. It has been recently shown that this weakness is being overcome by the deep learning (DL) models [59, 60].Paluru et al. [80] proposed AnamNet, a hybrid of UNet and ENet to segment COVID-19-based lesions using 4,300 images (using 69 patients with 5122 resolution size) [81]. The authors compared the models against ENet [82], UNet + + [83], SegNet, and LEDNet [84]. The DSC for the lesion detection turned out to be 0.77. Saood et al. [85] used a set of 100 images downscaled to 2562 to compare the results between the two models, namely UNet and SegNet, and showed the DS score of 0.73 and 0.74, respectively. Cai et al. [86] established a tenfold CV protocol on 250 images using 99 patients and adopted the UNet model with a DS of 0.77. They also suggested a method for predicting the duration of an intensive care unit (ICU) stay. Suri et al. [40] benchmarked NIH [87] (a conventional model) against the three AI models, namely, SegNet, VGG-SegNet, and ResNet-SegNet using nearly 5000 CT scans using 72 patients in an image resolution of 7682. Concluding that ResNet-SegNet was the best performing model. In an inter-variability study by Suri et al. [39], three models, namely, PSPNet, VGG-SegNet, ResNet-SegNet were used. The authors showed HDL models outperformed SDL models, by ~ 5% for all the performance evaluation metrics using 5000 CT slices (taken from 72 patients), in an image resolution of 7682. A recent study by the same authors [38] presented VGG-SegNet, and ResNet-SegNet compared to their COVLIAS 1.0 system against MedSeg. This study used HDL models and demonstrated standard Mann–Whitney, Paired t-Test, and Wilcoxon tests to prove the system's stability.
Method and Methodology
Demographics and Data Acquisition
The proposed study utilizes two different cohorts from different countries. The first dataset contains 72 adult Italian patients (approximately 5000 images, Fig. 2), 46 males, and the remainder were female. A total of 60 people tested positive for RT-PCR in which broncho-alveolar lavage [88] was used with 12 individuals. This Italian cohort had an average GGO of 2.1 which was considered low. The second cohort consisted of 80 Croatian patients (approximately 5000 images, Fig. 3), of which 57 were male and the rest female patients. This cohort had a mean age of 66 and an average GGO of 4.1, which was considered high.
Fig. 2
Sample CT scans taken from raw CRO data sets
Fig. 3
Sample CT scans taken from raw ITA data sets
Sample CT scans taken from raw CRO data setsSample CT scans taken from raw ITA data setsFor the patients in the Italian cohort, CT data were acquired using Philips' automatic tube current modulation – Z-DOM), while Croatia's CT volumes were acquired using the FCT Speedia HD 64-detector MDCT scanner (Fujifilm Corporation, Tokyo, Japan, 2017). The exclusion criteria consisted of patients having metallic items or poor image quality without artifacts or blurriness induced by patient movement during scan execution [38].
AI Architectures Adapted
The proposed study uses a total of nine AI models, of which (i) PSPNet (see Supplemental A.1), (ii) SegNet, and (iii) UNet are SDL models and (iv) VGG-PSPNet (Fig. 4), (v) ResNet-PSPNet (Fig. 5), (vi) VGG-SegNet (see Supplemental A.2), (vii) ResNet-SegNet (see Supplemental A.3), (viii)VGG-UNet (Fig. 6), and (ix) ResNet-UNet (Fig. 7) are the HDL models. The difference between the SDL and HDL is that the traditional backbone or encoder part of the SDL model is replaced with a new model like VGG and ResNet. Suri et al. [39, 40, 48, 49, 89] Recent findings show that employing HDL models over SDL models in the medical sector helps learn complicated imaging features rapidly and reliably. Using this knowledge of the performance of HDL > SDL, we here introduce four new HDL models, namely, VGG-PSPNet, ResNet-PSPNet, VGG-UNet, and ResNet-UNet for lung segmentation of COVID-19-based CT images.
Fig. 4
VGG-PSPNet architecture
Fig. 5
ResNet-PSPNet architecture
Fig. 6
VGG-UNet architecture
Fig. 7
ResNet-UNet architecture
VGG-PSPNet architectureResNet-PSPNet architectureVGG-UNet architectureResNet-UNet architectureUNet [90] was the first medical segmentation model that consisted of mainly two sections (i) encoder, where the model tries to learn the features in the images, and (ii) decoder, the part of the model that up-samples the image to produce the desired output like a segmented binary lung mask in this study. Another model used in this paper is SegNet [91], which transfers only the pooling indices from the compression (encoder) path to the expansion (decoder) path, thereby using low memory. The Pyramid Scene Parsing Network (PSPNet) [92] is a semantic segmentation network that considers the full context of an image using its pyramid pooling module. PSPNet extracts the feature map from an input image using a pretrained CNN and the dilated network technique. The size of the resulting feature map is 1/8 that of the input image. Finally, the collection of these features is used to generate the output binary mask.Residual networks (ResNet) [93] use a sequential technique of "skip connections" and "batch normalization" to train deep layers without sacrificing efficiency, permitting gradients to bypass a set number of levels. This solves the vanishing gradient problem which is not present in VGGNet [94]. The primary attributes of the AI models such as the backbone used in the architecture, the number of layers in the training models, the total number of parameters in the architecture, and the final size of the trained models are further discussed and compared in the discussion section.
Experimental Protocol
This study involves two datasets from different centers, each of ~ 5000 lung CT images for COVID-19 patients. We have utilized a fivefold cross-validation [95, 96] technique for the training of AI models without overlap. The training and testing performance was determined by the accuracy score of the binary output of the trained AI model and gold standard [39, 40], respectively.The accuracy of the system was computed using standardized protocol given the true positive, true negative, false negative, and false positive. Finally, to assess the model's training during the backpropagation, the cross-entropy loss function was employed. The plots of the accuracy and loss function can be seen in Figs. 8 and 9.
Fig. 8
Accuracy and loss plot for the nine AI models for the training on the CRO dataset
Fig. 9
Accuracy and loss plot for the nine AI models for the training on the ITA dataset
Accuracy and loss plot for the nine AI models for the training on the CRO datasetAccuracy and loss plot for the nine AI models for the training on the ITA dataset
Results and Performance Evaluations
Results
To prove our hypothesis that the performance of the HDL > SDL models in the proposed study, we present a comparison between (i) SDL and HDL models and (ii) the difference in training the models using high-GGO and low-GGO lung CT images. The accuracy and loss plots for the nine AI model for CRO and ITA dataset is presented in Figs. 8 and 9. Using overlays (Figs. 10, 11, 12 and 13), we present a visual representation of the results from the AI models by comparing against four different scenarios, namely, seen analysis using (i) train on Croatia data (CRO) and test on CRO, (ii) train on Italy data (ITA) and test on ITA. Similarly for Unseen analysis, (iii) train on CRO and test on ITA, and finally (iv) train on ITA and test on CRO. This study makes use of two different datasets (i) CRO with ~ 5000 CT images of COVID-19 patients who are considered as patients with high-GGO and (ii) ITA with ~ 5000 COVID-19 CT images regarded as low-GGO patients.
Fig. 10
Visual overlays (set 1) showing the AI (green) output against the GT (red) for Seen analysis
Fig. 11
Visual overlays (set 2) showing the AI (green) output against the GT (red) for Seen analysis
Fig. 12
Visual overlays (set 1) showing the AI (green) output against the GT (red) for Unseen analysis
Fig. 13
Visual overlays (set 2) showing the AI (green) output against the GT (red) for Unseen analysis
Visual overlays (set 1) showing the AI (green) output against the GT (red) for Seen analysisVisual overlays (set 2) showing the AI (green) output against the GT (red) for Seen analysisVisual overlays (set 1) showing the AI (green) output against the GT (red) for Unseen analysisVisual overlays (set 2) showing the AI (green) output against the GT (red) for Unseen analysis
Performance Evaluation
This study presents (i) DS, (ii) JI, (iii) BA, (iv) CC plots, and (v) Figure of Merit (FoM) as part of performance evaluation for nine AI models under Seen and Unseen settings. The cumulative frequency distribution (CFD) plot for DS and JI is presented in Figs. 14, 15, 16 and 17 at a threshold cutoff of 80%. Figures 16, 17, 18 and 19 show the BA plot with mean and standard deviation (SD) line for the estimated lung area against the AI models and ground truth tracings. Similarly, CC plots with a cutoff of 80% are displayed in Figs. 18, 19, 20 and 21. We present a summary, mean, SD, and percentage improvement for all six AI models for DS, JI, and CC values in Tables 1, 2 and 3. When comparing four scenarios for Seen and Unseen settings against SDL and HDL, the DS score is better by 1%, 3%, 1%, and 1%, the JI score is better by 3%, 5%, 3%, and 2%, and finally, for CC, the performance is better by 2%, 1%, 1%, and 6%, thus proving the hypothesis for COVID-19 lungs that performance of HDL > SDL. The standard deviation for all the AI models lies in the range of 0.01 to 0.06, which is considered stable because of the values being in the second decimal place.
Fig. 14
Cumulative frequency plot for Dice using Seen analysis
Fig. 15
Cumulative frequency plot for Dice using Unseen analysis
Fig. 16
Cumulative frequency plot for Jaccard using Seen analysis
Fig. 17
Cumulative frequency plot for Jaccard using Unseen analysis
Fig. 18
BA plot for Seen analysis
Fig. 19
BA plot for Unseen analysis
Fig. 20
CC plot for Seen analysis
Fig. 21
CC plot for Unseen analysis
Table 1
Dice Similarity table for the nine AI models
Dice Similarity: Solo Deep Learning
Seen-AI
Unseen-AI
Model
CRO-CRO
ITA-ITA
CRO-ITA
ITA-CRO
PSPNet
0.93
0.93
0.93
0.88
SegNet
0.95
0.96
0.89
0.91
UNet
0.93
0.95
0.92
0.9
µ
0.94
0.95
0.91
0.90
σ
0.01
0.02
0.02
0.02
Dice Similarity: Hybrid Deep Learning
Seen-AI
Unseen-AI
Model
CRO-CRO
ITA-ITA
CRO-ITA
ITA-CRO
VGG-PSPNet
0.93
0.94
0.94
0.9
VGG-SegNet
0.94
0.96
0.95
0.92
VGG-UNet
0.96
0.95
0.93
0.84
ResNet-PSPNet
0.95
0.95
0.95
0.91
ResNet-SegNet
0.96
0.97
0.95
0.94
ResNet-UNet
0.96
0.97
0.94
0.93
µ
0.95
0.96
0.94
0.91
σ
0.01
0.01
0.01
0.04
% Improvement
1%
1%
3%
1%
Table 2
Jaccard Index table for the nine AI models
Jaccard Index: Solo Deep Learning
Seen-AI
Unseen-AI
Model
CRO-CRO
ITA-ITA
CRO-ITA
ITA-CRO
PSPNet
0.86
0.87
0.87
0.8
SegNet
0.9
0.93
0.8
0.83
UNet
0.87
0.92
0.87
0.83
µ
0.88
0.91
0.85
0.82
σ
0.02
0.03
0.04
0.02
Jaccard Index: Hybrid Deep Learning
Seen-AI
Unseen-AI
Model
CRO-CRO
ITA-ITA
CRO-ITA
ITA-CRO
VGG-PSPNet
0.85
0.95
0.9
0.81
VGG-SegNet
0.89
0.93
0.85
0.86
VGG-UNet
0.92
0.9
0.88
0.74
ResNet-PSPNet
0.89
0.91
0.9
0.83
ResNet-SegNet
0.93
0.94
0.91
0.88
ResNet-UNet
0.93
0.95
0.89
0.88
µ
0.90
0.93
0.89
0.83
σ
0.03
0.02
0.02
0.05
% Improvement
3%
3%
5%
2%
Table 3
Correlation Coefficient (P < 0.0001) for the nine AI models
CC: Solo Deep Learning
Seen-AI
Unseen-AI
Models
CRO-CRO
ITA-ITA
CRO-ITA
ITA-CRO
PSPNet
0.98
0.98
0.97
0.77
SegNet
0.99
0.99
0.98
0.97
UNet
0.95
0.97
0.95
0.97
µ
0.97
0.98
0.97
0.90
σ
0.02
0.01
0.02
0.12
CC: Hybrid Deep Learning
Seen-AI
Unseen-AI
Models
CRO-CRO
ITA-ITA
CRO-ITA
ITA-CRO
VGG-PSPNet
0.99
0.98
0.98
0.92
VGG-SegNet
0.98
0.99
0.96
0.98
VGG-UNet
0.99
0.98
0.98
0.85
ResNet-PSPNet
0.99
1
0.99
0.99
ResNet-SegNet
0.99
1
0.98
0.99
ResNet-UNet
0.99
1
0.97
0.99
µ
0.99
0.99
0.98
0.95
σ
0.00
0.01
0.01
0.06
% Improvement
2%
1%
1%
6%
Cumulative frequency plot for Dice using Seen analysisCumulative frequency plot for Dice using Unseen analysisCumulative frequency plot for Jaccard using Seen analysisCumulative frequency plot for Jaccard using Unseen analysisBA plot for Seen analysisBA plot for Unseen analysisCC plot for Seen analysisCC plot for Unseen analysisDice Similarity table for the nine AI modelsJaccard Index table for the nine AI modelsCorrelation Coefficient (P < 0.0001) for the nine AI models
Scientific Validation
The results from the MedSeg tool were compared against gold standard tracings of the two datasets used in the study. Figure 22 shows a cumulative frequency plot of DS for the segmented lungs using the MedSeg tool for Italian and Croatian datasets using COVLIAS. Similarly, Figs. 23 and 24 show the JI and CC plot of the results from the MedSeg compared to the ground truth tracings of the two datasets, with ITA on the left and CRO on the right. The percentage difference between the DS, JI, and CC score of the COVLIAS AI models in comparison to MedSeg is < 5%, thus proving the applicability of the proposed AI models in the clinical domain. Finally, the mean and standard deviation of the lung area error is presented in Fig. 25 using the BA plot and is used in the same notion with ITA on the left and CRO on the right. For the determination of the system’s error, Table 4 presents Figure of Merit for the nine AI models of Seen and Unseen analysis. Finally, to prove the reliability of the AI-based segmentation system COVLIAS, statistical test such as Mann–Whitney, Paired t-Test, and Wilcoxon test is presented for Seen (Table 5) and Unseen (Table 6) analysis. MedCalc software (Osteen, Belgium) was used to carry out all the tests.
Fig. 22
Cumulative frequency plot of DS for MedSeg for ITA (left) and CRO (right) data sets
Fig. 23
Cumulative frequency plot of JI for MedSeg for ITA data (left) and CRO data (right)
Fig. 24
CC plot for MedSeg vs. GT for ITA (left) and CRO (right)
Fig. 25
BA plot for MedSeg vs. GT for ITA (left) and CRO (right)
Table 4
The Figure of Merit for the nine AI models for Seen-AI vs. Unseen-AI
Seen-AI
Unseen-AI
Models
CRO-CRO
ITA-ITA
CRO-ITA
ITA-CRO
PSPNet
90.93
94.41
91.84
96.47
SegNet
92.76
96.51
95.89
82.25
UNet
87.51
94.21
99.12
80.85
VGG-PSPNet
85.67
96.84
96.68
99.06
VGG-SegNet
92.48
98.79
81.33
91.56
VGG-UNet
98.74
91.63
88.60
72.49
ResNet-PSPNet
95.19
95.86
93.21
82.96
ResNet-SegNet
95.99
97.24
92.06
85.38
ResNet-UNet
99.85
99.26
86.83
94.77
Table 5
Statistical tests for Seen-AI analysis on nine AI models
CRO-CRO
ITA-ITA
Models
Paired t-Test
Mann–Whitney
Wilcoxon
Paired t-Test
Mann–Whitney
Wilcoxon
PSPNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
SegNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
UNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
VGG-PSPNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
VGG-SegNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
VGG-UNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
ResNet-PSPNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
ResNet-SegNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
ResNet-UNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
Table 6
Statistical tests for Unseen-AI analysis on nine AI models
CRO-ITA
ITA-CRO
Models
Paired t-Test
Mann–Whitney
Wilcoxon
Paired t-Test
Mann–Whitney
Wilcoxon
PSPNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
SegNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
UNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
VGG-PSPNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
VGG-SegNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
VGG-UNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
ResNet-PSPNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
ResNet-SegNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
ResNet-UNet
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
P < 0.0001
Cumulative frequency plot of DS for MedSeg for ITA (left) and CRO (right) data setsCumulative frequency plot of JI for MedSeg for ITA data (left) and CRO data (right)CC plot for MedSeg vs. GT for ITA (left) and CRO (right)BA plot for MedSeg vs. GT for ITA (left) and CRO (right)The Figure of Merit for the nine AI models for Seen-AI vs. Unseen-AIStatistical tests for Seen-AI analysis on nine AI modelsStatistical tests for Unseen-AI analysis on nine AI modelsNine AI architectures and their comparisonMB MegaBytes, M Million, NN Neural Network*in minutes
Discussion
This proposed study presented nine automated CT lung segmentation techniques in AI framework using three SDL, namely, (i) PSPNet, (ii) SegNet, (iii) UNet and six HDL models, namely, (iv) VGG-PSPNet, (v) VGG-SegNet, (vi) VGG-UNet, (vii) ResNet-PSPNet, (viii) ResNet-SegNet, (ix) ResNet-UNet. To prove our hypothesis, we use automated HU adjustment to optimize values of (1600, -400) and train our AI models to predict on test data (Fig. 26). After HU adjustment for DS, JI, and CC, the percentage improvement for Seen AI is 1%, 3%, and 6%, and for the Unseen AI is ~ 4%, ~ 5%, and 6%, respectively. We concluded that Unseen AI is possible using automated HU adjustment. Further, HDL was found to be superior to SDL (Table 1, 2 and 3).
Fig. 26
Overlay of segmentation results from the ResNet-SegNet model trained without adjusting the HU level (red) and after adjusting the HU level (green). The white arrow represents the under-estimated region and the red arrows represent the same region estimated accurately by the ResNet-SegNet model
Overlay of segmentation results from the ResNet-SegNet model trained without adjusting the HU level (red) and after adjusting the HU level (green). The white arrow represents the under-estimated region and the red arrows represent the same region estimated accurately by the ResNet-SegNet modelLeft: Number of NN layers. Right: Size of the final AI models used in COVLIAS 1.0
Comparison and Contrast of the Nine AI Models
The proposed study uses a total of nine AI architectures with three SDL (PSPNet, SegNet and UNet) and six HDL models (VGG-PSPNet, VGG-SegNet, VGG-UNet, ResNet-PSPNet, ResNet-SegNet, and ResNet-UNet). ResNet-PSPNet was the AI model with the highest # of NN layers and model size, equally. The training for all the AI models was implemented on NVIDIA DGX V100 using python [97] and adapting multiple GPUs to speed up the training time (Table 7 and Fig. 27).
Table 7
Nine AI architectures and their comparison
3 SDL Models
6 HDL Models
SN
Attributes
PSPNet
SegNet
UNet
VGG-PSPNet
VGG-SegNet
VGG-UNet
ResNet-PSPNet
ResNet-SegNet
ResNet-UNet
1
Backbone
NA
VGG-19
VGG-19
VGG-16
VGG-16
VGG-16
Res-50
Res-50
Res-50
2
Loss Function
CE
CE
CE
CE
CE
CE
CE
CE
CE
3
# Parameters
~ 4.4 M
~ 3.8 M
~ 4.6 M
~ 18.2 M
~ 11.6 M
~ 12.4 M
~ 31 M
~ 15 M
~ 16.5 M
4
# NN Layers
54
39
42
47
33
36
202
160
165
5
Size (MB)
50
43
52
209
133
142
355
171
188
6
# Epoch
50
50
50
50
50
50
50
50
50
7
Batch Size
8
8
8
2
4
4
2
4
4
8
Training Time*
~ 17
~ 15
~ 16
~ 60
~ 50
~ 50
~ 70
~ 60
~ 60
9
Prediction Time
< 2 s
< 2 s
< 2 s
< 2 s
< 2 s
< 2 s
< 2 s
< 2 s
< 2 s
MB MegaBytes, M Million, NN Neural Network
*in minutes
Fig. 27
Left: Number of NN layers. Right: Size of the final AI models used in COVLIAS 1.0
Benchmarking
Table 8 shows the benchmarking table using CT imaging. Our proposed study (row #7) took 10,000 CT scans of 152 patients and implemented 9 different models that consisted of three SDL, namely, PSPNet, SegNet, UNet, and six HDL models, namely, VGG-PSPNet, VGG-SegNet, VGG-UNet, ResNet-PSPNet, ResNet-SegNet, ResNet-UNet. The four scenarios (CRO-CRO, ITA-ITA, CRO-ITA, and ITA-CRO) correspond to SDL and HDL.
Table 8
Benchmarking table
-
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
R#
Author
# Patients
# Images
Image Dim
#M
Model Types
Solo vs. HDL
Dim
AE
DS
JI
BA
ACC
R1
Paluru et al. [80]
69
~ 4339
5122
1
AnamNet
Solo
2D
✖
✔
✖
✖
✔
R2
Saood and Hatem [85]
-
~ 100
2562
2
UNet, SegNet
Solo
2D
✖
✔
✖
✖
✔
R3
Cai et al. [86]
99
~ 250
-
1
UNet
Solo
2D
✖
✔
✔
✖
✖
R4
Suri et al. [40]
72
~ 5000
7682
4
NIH,
SegNet,
VGG-SegNet,
ResNet-SegNet
Both
2D
✔
✔
✔
✔
✔
R5
Suri et al. [39]
72
~ 5000
7682
3
PSPNet,
VGG-SegNet,
ResNet-SegNet
Both
2D
✔
✔
✔
✔
✔
R6
Suri et al. [38]
79
~ 5500
7682
2
VGG-SegNet,
ResNet-SegNet
HDL
2D
✔
✔
✔
✔
✔
R7
Suri et al. (Proposed)
152
> 10,000
5122
9
PSPNet,
SegNet,
UNet,
VGG-PSPNet, VGG-SegNet,
VGG-UNet,
ResNet-PSPNet,
ResNet-SegNet,
ResNet-UNet
Both
2D
✔
✔
✔
✔
✔
# number, HDL Hybrid Deep Learning, AE Area Error, DS Dice Similarity, JI Jaccard Index, BA Bland–Altman, ACC Accuracy, Dim Dimension (2D vs. 3D), R# Row number, #M number of AI models
Benchmarking tableNIH,SegNet,VGG-SegNet,ResNet-SegNetPSPNet,VGG-SegNet,ResNet-SegNetVGG-SegNet,ResNet-SegNetPSPNet,SegNet,UNet,VGG-PSPNet, VGG-SegNet,VGG-UNet,ResNet-PSPNet,ResNet-SegNet,ResNet-UNet# number, HDL Hybrid Deep Learning, AE Area Error, DS Dice Similarity, JI Jaccard Index, BA Bland–Altman, ACC Accuracy, Dim Dimension (2D vs. 3D), R# Row number, #M number of AI models
A Special note on Tissue Characterization
Lung segmentation can be considered as a tissue characterization (TC) process and was tried before using ML such as in plaque TC [66, 98], lung TC [99], coronary artery disease characterization [100], liver TC [101], or in cancer application such as skin cancer [102], ovarian cancer [103]. Other types of advanced TC can be using hybrid models such as [24, 36, 51].
Strength, Weakness, and Extensions
This proposed study, COVLIAS 1.0-Unseen proves our two hypotheses, (i) contrast adjustment is vital for AI, and (ii) HDL is superior to SDL using nine models considering 5,000 CT scans. The system was validated against MedSeg and tested for reliability and stability.It can also be noted that while training the AI model for COVID-19 infected lungs, it is necessary to adjust the HU levels to get the results of the segmentation accurately. Even though we used HU adjustments (i) it can be extended by adjusting the contrast, removing noise, and adjusting the window level [104]. (ii) Multimodality cross-validation such as ultrasound [105]. (iii) More advanced image processing tools such as level sets [106], stochastic segmentation [107], and computer-aided diagnostic tools [108, 109] can be integrated with AI models for lung segmentation. (iv) Recently, there have been studies to compute the bias in AI and it would be interesting to evaluate the bias models using AP(ai)Bias (AtheroPoint, Roseville, CA, USA) and other competitive models [42]. (v) CVD assessment of patients during the CT imaging [110].
Conclusions
The proposed research compares three SDL models, namely, PSPNet, SegNet, UNet, and six HDL models, namely, VGG-PSPNet, VGG-SegNet, VGG-UNet, ResNet-PSPNet, ResNet-SegNet, and ResNet-UNet against MedSeg for CT lung segmentation. It also performed the benchmarking of three SDL and 6 HDL models against MedSeg. The multicentre CT data was collected from Italy (ITA) with low-GGO, and Croatia (CRO with high-GGO hospitals, each with ~ 5000 COVID-19 images. These CT images were annotated by two trained, blinded senior radiologists, thus creating an inter-variable multicentre dataset. To prove our hypothesis, we use an automated Hounsfield Units (HU) adjustment methodology to train the AI models, leading to four combinations of two Unseen sets: train-CRO:test-ITA, train-ITA:test-CRO, and two Seen sets: train-CRO:test-CRO, train-ITA:test-ITA. To keep the test set unique for each fold, we adapted a five-fold cross-validation technique. Five types of performance metrics, namely, (i) DS, (ii) JI, (iii) BA plots, (iv) CC plots, and (v) Figure-of-Merit. For DS and JI, HDL (Unseen AI) > SDL (Unseen AI) by 4% and 5%, respectively. For CC, HDL (Unseen AI) > SDL (Unseen AI) by 6%. The COVLIAS-MedSeg difference was < 5%, thus proving the hypothesis and making it fit in clinical settings. Statistical tests such as Paired t-Test, Mann–Whitney, and Wilcoxon were used to demonstrate the stability and reliability of the AI system.Below is the link to the electronic supplementary material.Supplementary file1 (DOCX 255 KB)
Authors: L Saba; C Gerosa; D Fanni; F Marongiu; G La Nasa; G Caocci; D Barcellona; A Balestrieri; F Coghe; G Orru; P Coni; M Piras; F Ledda; J S Suri; A Ronchi; F D'Andrea; R Cau; M Castagnola; G Faa Journal: Eur Rev Med Pharmacol Sci Date: 2020-12 Impact factor: 3.507
Authors: T Congiu; R Demontis; F Cau; M Piras; D Fanni; C Gerosa; C Botta; A Scano; A Chighine; E Faedda; R Cau; P Van Eyken; F Marongiu; D Barcellona; L Saba; G Orrù; F Coghe; J S Suri; G Faa; E d'Aloja Journal: Eur Rev Med Pharmacol Sci Date: 2021-12 Impact factor: 3.507
Authors: G Faa; C Gerosa; D Fanni; D Barcellona; G Cerrone; G Orrù; A Scano; F Marongiu; J S Suri; R Demontis; M Nioi; E D'Aloja; G La Nasa; L Saba Journal: Eur Rev Med Pharmacol Sci Date: 2021-10 Impact factor: 3.507
Authors: D Fanni; L Saba; R Demontis; C Gerosa; A Chighine; M Nioi; J S Suri; A Ravarino; F Cau; D Barcellona; M C Botta; M Porcu; A Scano; F Coghe; G Orrù; P Van Eyken; Y Gibo; G La Nasa; E D'aloja; F Marongiu; G Faa Journal: Eur Rev Med Pharmacol Sci Date: 2021-08 Impact factor: 3.507
Authors: Niklas Alexander Kämpfer; Andrea Naldi; Nicola Luigi Bragazzi; Klaus Fassbender; Martin Lesmeister; Piergiorgio Lochner Journal: Acta Biomed Date: 2021-11-03
Authors: Riccardo Cau; Pier Paolo Bassareo; Lorenzo Mannelli; Jasjit S Suri; Luca Saba Journal: Int J Cardiovasc Imaging Date: 2020-11-19 Impact factor: 2.357
Authors: Jasjit S Suri; Anudeep Puvvula; Mainak Biswas; Misha Majhail; Luca Saba; Gavino Faa; Inder M Singh; Ronald Oberleitner; Monika Turk; Paramjit S Chadha; Amer M Johri; J Miguel Sanches; Narendra N Khanna; Klaudija Viskovic; Sophie Mavrogeni; John R Laird; Gyan Pareek; Martin Miner; David W Sobel; Antonella Balestrieri; Petros P Sfikakis; George Tsoulfas; Athanasios Protogerou; Durga Prasanna Misra; Vikas Agarwal; George D Kitas; Puneet Ahluwalia; Raghu Kolluri; Jagjit Teji; Mustafa Al Maini; Ann Agbakoba; Surinder K Dhanjil; Meyypan Sockalingam; Ajit Saxena; Andrew Nicolaides; Aditya Sharma; Vijay Rathore; Janet N A Ajuluchukwu; Mostafa Fatemi; Azra Alizad; Vijay Viswanathan; Pudukode R Krishnan; Subbaram Naidu Journal: Comput Biol Med Date: 2020-08-14 Impact factor: 4.589