SARS-CoV-2 has infected over ∼165 million people worldwide causing Acute Respiratory Distress Syndrome (ARDS) and has killed ∼3.4 million people. Artificial Intelligence (AI) has shown to benefit in the biomedical image such as X-ray/Computed Tomography in diagnosis of ARDS, but there are limited AI-based systematic reviews (aiSR). The purpose of this study is to understand the Risk-of-Bias (RoB) in a non-randomized AI trial for handling ARDS using novel AtheroPoint-AI-Bias (AP(ai)Bias). Our hypothesis for acceptance of a study to be in low RoB must have a mean score of 80% in a study. Using the PRISMA model, 42 best AI studies were analyzed to understand the RoB. Using the AP(ai)Bias paradigm, the top 19 studies were then chosen using the raw-cutoff of 1.9. This was obtained using the intersection of the cumulative plot of "mean score vs. study" and score distribution. Finally, these studies were benchmarked against ROBINS-I and PROBAST paradigm. Our observation showed that AP(ai)Bias, ROBINS-I, and PROBAST had only 32%, 16%, and 26% studies, respectively in low-moderate RoB (cutoff>2.5), however none of them met the RoB hypothesis. Further, the aiSR analysis recommends six primary and six secondary recommendations for the non-randomized AI for ARDS. The primary recommendations for improvement in AI-based ARDS design inclusive of (i) comorbidity, (ii) inter-and intra-observer variability studies, (iii) large data size, (iv) clinical validation, (v) granularity of COVID-19 risk, and (vi) cross-modality scientific validation. The AI is an important component for diagnosis of ARDS and the recommendations must be followed to lower the RoB.
SARS-CoV-2 has infected over ∼165 million people worldwide causing Acute Respiratory Distress Syndrome (ARDS) and has killed ∼3.4 million people. Artificial Intelligence (AI) has shown to benefit in the biomedical image such as X-ray/Computed Tomography in diagnosis of ARDS, but there are limited AI-based systematic reviews (aiSR). The purpose of this study is to understand the Risk-of-Bias (RoB) in a non-randomized AI trial for handling ARDS using novel AtheroPoint-AI-Bias (AP(ai)Bias). Our hypothesis for acceptance of a study to be in low RoB must have a mean score of 80% in a study. Using the PRISMA model, 42 best AI studies were analyzed to understand the RoB. Using the AP(ai)Bias paradigm, the top 19 studies were then chosen using the raw-cutoff of 1.9. This was obtained using the intersection of the cumulative plot of "mean score vs. study" and score distribution. Finally, these studies were benchmarked against ROBINS-I and PROBAST paradigm. Our observation showed that AP(ai)Bias, ROBINS-I, and PROBAST had only 32%, 16%, and 26% studies, respectively in low-moderate RoB (cutoff>2.5), however none of them met the RoB hypothesis. Further, the aiSR analysis recommends six primary and six secondary recommendations for the non-randomized AI for ARDS. The primary recommendations for improvement in AI-based ARDS design inclusive of (i) comorbidity, (ii) inter-and intra-observer variability studies, (iii) large data size, (iv) clinical validation, (v) granularity of COVID-19 risk, and (vi) cross-modality scientific validation. The AI is an important component for diagnosis of ARDS and the recommendations must be followed to lower the RoB.
Covid-19 or Coronavirus is a disease that was declared a “public health emergency of international concern” or “pandemic” by the International Health Regulations Emergency Committee of the World Health Organization (WHO) on January 30, 2020. As of 20th May 2021, the WHO statistics showed more than 165 million people have been infected causing Acute Respiratory Distress Syndrome (ARDS), and nearly 3.4 million have lost their lives due to this virus [1]. There is a dire necessity to flatten the pandemic curve and prevent this severe illness during the “long-COVID-19” (beyond the COVID-19 era). The SARS-CoV-2 virus directly affects the human lungs, travels through the respiratory system and into the body [2]. However, the mutated ribonucleic acid (RNA) present in the virus makes it difficult to treat the infected patient. As per the Journal of the American College of Cardiology (JACC), cardiac troponin may help determine the risk of myocarditis [3], signaling a positive COVID-19 diagnosis [4]. Imaging, therefore, also plays a vital role in predicting and validating the severity of the infection [5]; however, the patient's vital clinical information further improves its ability to predict the severity better and lowers the mortality rate [6].Artificial Intelligence (AI) has been helpful in combating such diseases because of its ability to model extensive and non-linear covariates against COVID-19 deaths in a big data framework. In previous pandemics, models were developed to flatten their mortality curves. For example, the Zika epidemic [7], Influenza type A, the H1N1 pandemic [8], and the Chikungunya epidemic [9] had shown a correlation between their data streaming using telemedicine and the pandemic curve. More recently, in China, similar telemedicine models have been adapted [10]. Therefore, we firmly believe that predicting the severity of COVID-19 using AI through computational models will be significantly useful to address the lack of software “verification, scientific and clinical validation” (discussed in section VII.B) capabilities worldwide. Note that AI-based COVID-19 severity is determined by either (i) classifying the COVID-19 pneumonia patient scans against controls or other kinds of pneumonia or (ii) locating the diseased region in the lung scan(s). Ground Glass Opacities (GGO) can be used to validate the COVID-19 severity. This study is focused on (a) ARDS that deals with the lung gas exchange disorder caused by the SARS-CoV-2, and (b) imaging of infected lungs using Computed Tomography (CT) and Chest X-rays (CXR).In 2020, there have been six AI-based systematic reviews (aiSR) on ARDS [11]–[16]. However, they are incomplete, not well focused, and lack practical recommendations for safe and effective AI design for ARDS analysis. A detailed comparison among the six aiSR is discussed in the benchmarking section VII.A. In general, there are limited aiSR, which rank these studies, compute their mean scores, and determine the AI studies with low RoB. Further, this aiSR establishes a link between the AtheroPoint's artificial intelligence-based Bias (AP(ai)Bias) and previous RoB paradigms such as the Risk of Bias in Non-randomized Studies of Interventions (ROBINS-I) or Prediction model Risk Of Bias ASsessment Tool (PROBAST) for handling ARDS via CT or CXR.This aiSR uses Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) model for study selection. These studies are then analyzed to understand the role of AI in the detection of COVID-19 severity in radiological images and evaluate the RoB using AI attributes. It then presents the pathophysiology of COVID-19 leading to ARDS [17], followed by different AI techniques used in the lung segmentation and classification of the disease severity. Furthermore, to help understand the non-randomized trials’ outcome, we have proposed AP(ai)Bias, a novel bias estimation, that rates each of the ten AI attributes, computes cumulative and mean scores, and ranks the selected studies. Further, AP(ai)Bias is compared against qualitative paradigms such as ROBINS-I and PROBAST. Lastly, the aiSR presents several significant recommendations for lowering the RoB in the AI design for ARDS.
Methodology
Search Strategy
A detailed search was performed using PubMed, IEEE Xplore, ScienceDirect, ArXiv and Google Scholar. The keywords used for selecting studies were COVID-19, ARDS, deep learning, lung segmentation, classification, lung CT, X-ray, and AI. Fig. 1 shows the PRISMA model consisting of the studies used in this review. A total of 339 studies were identified, and duplicates were removed using the feature called “Find Duplicates” in EndNote software by Clarivate Analytics [18], thus, retaining 304 records. The three exclusion criteria were (i) studies not related to AI, (ii) non-relevant articles, and (iii) having insufficient data. This excluded 55, 56, and 104 studies (marked as E1, E2 (non-AI, but COVID-19), and E3 in Fig. 1), leading to the final selection of 89 studies.
Fig. 1.
Search strategy using the PRISMA model.
Search strategy using the PRISMA model.
Hypothesis and the Acceptability Criteria
We hypothesize that “non-randomized Artificial Intelligence-based attributes can (a) detect, (b) classify, (c) estimate severity of the COVID-19 risk, and (d) meets the performance standards in lung infected ARDS patients.” Three acceptability criteria were: (i) for the AP(ai)Bias-based ranking method, the mean score must be greater than or equal to 80% for an AI-based study while taking into consideration all the AI attributes [19]. This was due to the consensus of the experienced team and five different classes for each AI attribute based on its strength (such as low, moderate, high-moderate, low-of-a-high, and high-of-a-high, ranging from 1 to 5). Similarly, for the (ii) ROBINS-I and (iii) PROBAST paradigms, our acceptability criterion must meet the score of 80% or above for AI-based studies to be in the low-RoB zone [19], [20].
Pathophysiology of ARDS
The number of publications and cases discussing ARDS infected by SARS-CoV-2 has been increasing over time [21]. The first set of cases in Wuhan, China, reported having COVID-19 patients hospitalized for lower respiratory tract (LRT) complaints [22]. They also indicated that the symptoms of COVID-19 are incredibly diverse, ranging from minimal LRT symptoms to significant hypoxia due to ARDS [22]. Further, Huang et al.
[23] reported that the time gap between the onset of minimal LRT symptoms to ARDS was as short as nine days, suggesting that minimal LRT symptoms could progress rapidly in these patients. Recent studies support that in COVID-19 patients, ARDS has higher rates than extrapumonary complications [24]–[26]. The pathophysiology of ARDS in COVID-19 patients begins with SARS-CoV-2 entering into the lung by aerosol transmission [27], as outlined in Fig. 2 (note, numbers 1-12 correspond to letter D1-D12). The attachment of SARS-CoV-2 to the host cells occurs via the anchoring of its virion spike proteins (S1 and S2) to the angiotensin-converting enzyme 2 (ACE2) receptor on the surface of the type 2 pneumocytes on the pulmonary alveolar epithelium (D1). This causes respiratory symptoms to present as the earliest clinical presentation of COVID-19 [28].
Fig. 2.
Stages of ARDS (Courtesy of AtheroPoint, CA, USA).
Stages of ARDS (Courtesy of AtheroPoint, CA, USA).Due to infection, the inflammatory process begins and leads to inflammatory mediators’ production [29], [30] (D2). Moreover, these inflammatory mediators stimulate alveolar macrophages in producing polymorphic neutrophils (PMNS) and cytokines such as IL-1, IL-6, and TNF-a (D3, D4a, and D4b). Additionally, cytokine hyperproduction causes a cytokine storm. The sequence of steps in the systemic inflammatory response, cytokine storm, and multiple organ failure plays a critical role in ARDS development [31].Previous coronaviruses had also observed the same process of cytokine storm and ARDS development [32]. The produced PMNS recruits platelets and forms Neutrophil extracellular traps (NET) that causes endothelial dysfunction [33]. NET's are web-like structures of deoxyribonucleic acid (DNA) and proteins created after PMNS activation and infection. The key enzymes that help develop NET's are neutrophil elastase, type 4 peptidyl arginine deiminase, and gasdermin-D [34]. Although NET's have a beneficial role in the host defense against pathogens, they also have a detrimental role. They are created by facilitating micro thrombosis, resulting in permanent organ damage to the lung, heart, and kidney [35]. Additionally, cytokine storm causes endothelial dysfunction and creates gaps between cells, increasing vascular permeability [22], [36] (D5, D7, and D8). Increased vascular permeability causes fluid leakage, resulting in increased alveolar space and diffuse inflammatory alveolar exudate, causing alveolar edema [37]–[39] (D8 and D9), typically seen in CT lung scans. Alveolar edema then leads to increased alveolar surface tension, a critical feature of diffuse alveolar atelectasis (i.e., damage or collapse) [40], [41] (D10 and D11). Following the alveolar collapse, ventilation-to-perfusion mismatch, ARDS occurs, that results in an impairment of carbon dioxide excretion due to the increased alveolar dead space [42] (D12).
Artificial Intelligence Building Blocks for ARDS: Lung Segmentation and Classification
For a comprehensive RoB analysis, it is customary to investigate the basic building blocks of the ARDS pipeline. As shown in Fig. 3, the two major components are lung segmentation and COVID-19 severity classification. We will briefly study AI models’ statistical distribution and AI architectures for these two components.
Fig. 3.
AI-based ARDS pipeline in lung CT for COVID-19 severity prediction (Courtesy of AtheroPoint, Roseville, CA, USA).
AI-based ARDS pipeline in lung CT for COVID-19 severity prediction (Courtesy of AtheroPoint, Roseville, CA, USA).
Statistical Distribution of AI Models
Even though PRISMA selected 89 studies, only 42 were AI-based studies [43]–[84]. AI-based image classification occurred in 85% [43]–[50], [52], [53], [55], [56], [58-62], [64-70], [73-80], [83] of the selected studies, while lung segmentation with or without classification was 3% [51], and 12% [57], [63], [71], [72], [84], respectively (Fig. 4(a)). In terms of risk granularity, AI-based on binary classification (BC) showed 46%, while multiclass (MC) paradigm for classification showed 39%, and hybrid (combination of segmentation and classification) were 15% (Fig. 4(b)). Most of the studies used 2-D (82%) [43-52], [54], [57-62], [64-69], [71-80], while others were used 3-D (12%) [53], [63], [70], [83], [84], and 6% [55] used both 2-D and 3-D (Fig. 5(a)). The total images used were 51% in 2-D, only 1% in 3-D, and 48% used both (Fig. 5(b)).
Fig. 4.
(a) #studies for lung segmentation vs. classification; (b) #studies for BC and MC frameworks.
Fig. 5.
(a) #studies in 2-D vs. 3-D; (b) #images in 2-D vs. 3-D.
(a) #studies for lung segmentation vs. classification; (b) #studies for BC and MC frameworks.(a) #studies in 2-D vs. 3-D; (b) #images in 2-D vs. 3-D.The landscape for AI models for image classification consisted mainly of machine learning (ML), deep learning (DL), transfer learning (TL), recurrent learning (RL), [4], and an amalgamation of these learning paradigms [85] (Fig. 6). The granular division of the AI-based classification paradigm consisted of DL (50%) [43], [46], [47], [49], [50], [52], [53], [55], [56], [59-61], [64-66], [69], [78-80], [83], DL and TL (12%) [67], [68], [75-77], TL (10%) [44], [46], [58], [74], DL and ML (5%) [48], [73], ML (5%) [45], [62], and RL combined with DL (3%) [70]. Note only six studies (15%) [51], [57], [63], [71], [72], [84] were focused on lung segmentation.
Fig. 6.
Four kinds of AI models (with and without segmentation) in the selected 42 AI-based studies.
Four kinds of AI models (with and without segmentation) in the selected 42 AI-based studies.
Lung Segmentation Architectures for COVID-19 ARDS
Lung segmentation is a process of extraction of the lung region in CXR projections or in CT-slices [86]. In the 2-D paradigm, (i) Rajaraman et al.
[72] implemented UNet architecture with Gaussian dropout in CXR (Fig. 7). As the shape resembles the letter U, the contraction path (also known as encoder) is the left arm of U and captures the context of the image using traditional convolutions and max-pooling layers. The expanding path (also known a decoder) is the right arm of U that enables precise localization using transposed convolutions. The depth of the UNet is the number of layers in the UNet architecture and is responsible for the performance, while having a tradeoff between accuracy and computational cost [72]. (ii) Oh et al.
[57] implemented Fully connected (FC)-DenseNet103 transfer learning model (architecture shown in Fig. 8 and results shown in Figure A(left), see Supplemental D) offered the advantage of saving time by using the pre-trained weights. In the 3-D paradigm, Wang et al.
[63] implemented DenseNet121-FPN (Figure A(right), see Supplemental D), the only study that performed 3-D lung segmentation using CT volumes (Fig. 9). In FC-DenseNet models, 2-D or 3-D, the feature maps created by the preceding dense block are up-sampled to prevent a large number of computations and parameters. Note that, all the above architectures adapted a skip connection between the down-sampling path and up-sampling paths for transferring weights. 2-D CXR is preferred over 3-D CT volume imaging due to lower costs. The in-depth comparison between 2-D and 3-D segmentation is shown in Table I.
Fig. 7.
UNet architecture for lung segmentation (Courtesy of AtheroPointTM, Roseville, CA, USA).
Fig. 8.
Lung segmentation using DenseNet [57] using CXR (Reproduced with permission).
Fig. 9.
(a) 3-D segmentation of the CT lung images and (b) heat map using a DL system [63] (Reproduced with permission).
TABLE I
2-D vs. 3-D Lung Segmentation in COVID-19 ARDS
SN
Attributes
2-D ARDS Segmentation
3-D ARDS Segmentation
1
Acquisition
2-D CXR and 2-D CT slices
3-D Volume
2
Modality
CXR and CT slice
3-D CT or PET
3
Frame Processing
Single frame processing at a time
Spatially oriented joint processing
4
Statistical Processing
Pixel processing
Voxel processing
5
Segmented Coverage
Limited
Complete
6
Architecture Type
FC-DenseNet103 [57], UNet [72]
DenseNet121-FPN [63]
7
Arch. Complexity
Less complex (fewer #layers)
More complex (more #layers)
8
Parameters Adapted
∼11.6 million
∼8 million
9
Convolution Filters
2-D Convolution filter
3-D Convolution filter
10
Processing Speed
Faster, due to low volume
Slower, due to the large volume
11
Memory Size
Consumes low memory
Consumes high memory
UNet architecture for lung segmentation (Courtesy of AtheroPointTM, Roseville, CA, USA).Lung segmentation using DenseNet [57] using CXR (Reproduced with permission).(a) 3-D segmentation of the CT lung images and (b) heat map using a DL system [63] (Reproduced with permission).
Lung Classification Architectures for COVID-19 ARDS
The pipeline for ARDS diagnosis consists of the classification of lung scans based on COVID-19 risk severity. The AI model using ground truth with two classes leads to binary class (BC) framework, while models using multiple ground truths yielded multiclass (MC). Studies that adapted both segmentation and classification (using either BC or MC) were categorized into hybrid class (HC). The statistical distribution between the three types of classes was 46%, 39%, and 15%, respectively. Most of the studies that conducted BC used CXR [44], [75], [76], [87] while MC (such as DenseNet [52] and truncated Inception Net for CXR [46], COVNet (a modified Resnet-50 architecture) [83] and COVIDNet-CT [79] for CT, and VGG-16 for Ultrasound [74]) adapted transfer learning (TL) architecture ranging from 5-fold to 10-fold cross-validation paradigms (Figure B, see Supplemental D). The best accuracies of 96% and 99% in BC and MC were obtained by Brunese et al.
[44] and Gunraj et al. [79]. A unique observation was seen in Ozturk et al.[59] for validating the COVID-19 output that consisted of the superimposition of colored heat maps on grayscale CXR using GRAD-CAM [59]. In HC, three studies had used 2-D CXR [57], [59], [72] while two studies had used 3-D CT [53], [63]. Wu et al.
[64] implemented a new HC multi-view fusion CT technique using the modified Resnet-50 on a patient size of 495, demonstrating accuracy of 83.33% and AUC∼0.905 (p<0.001). This was 6% better compared to the single view paradigm. Oh et al.
[57] also showed that better classification performance could be achieved when trained using FC-DenseNet103 on segmented infected lung region in 2-D CXR. It is however noteworthy to explore factors that affects the performance of AI models which we discuss next.
Performance Evaluation of AI Techniques
Image Quality Assessment: CT vs. CXR
The image quality plays an important role when it comes to computer-aided diagnosis (CAD) performance [86], [88], [89]. The CT image quality can also be fuzzy or degraded due to the radiation dosage [90]. Further, due to high COVID-19 severity, there is a fluid leakage or alveolar edema, which causes lung-scans to show hyperintensity distribution (due to high consolidation and ground-glass opacity). Digital CXR is preferred over conventional CXR due to high-resolution imaging [92]. Several methods were designed to assess signal-to-noise ratio (SNR) and contrast-to-noise ratio (CNR) in CAD-based static and motion imagery [93]. Image registration is essential during the image quality assessment [94], [95]. However, the effect of an image quality assessment on AI for COVID-19 has not been well researched. The AI studies used in this aiSR did not demonstrate AI performance due to image degradation. The lack of image quality testing or the effect of image quality assessment on the AI performance will lower the overall score estimation during the comprehensive evaluation.
AI-Based Performance Parameters
Comparison of the AI models in the studies was based on accuracy, sensitivity (SEN), specificity (SPE), F1-score, precision, and recall metrics. The missing data was calculated based on the studies existing data, such as SEN, SPE, patient size, and sometimes using the true positive rate. The mean accuracy of the AI models extracted was 93.05±5.7%. The best accuracy was 98% [59], while the sensitivity and specificity had a mean of 92.01±10.25% and 89.80±15.42%, respectively. All the studies combined achieved a mean F1-score and precision of 93.72±6.36% and 93.51±4.45%, respectively. Thus these performance metrics show high values, so certainly, these attributes will have strong contributions to the overall score when computing AP(ai)Bias. Note that the accuracies computed above considered the augmentation technique during data preparation process (also shown in recent studies [96-98]).
Three Paradigms for the Risk-of-Bias Estimation
Our current novel ranking strategy, AP(ai)Bias helps to identify AI-based studies that are comprehensive, complete, error-free, safe, and effective. On the contrary, ROBINS-I [99] and PROBAST [100] helps in identification of factors that contribute to RoB. This section is therefore focused on three kinds of paradigms for RoB in non-randomized AI for ARDS.
AP(ai)Bias: Ranking Paradigm for RoB
We have considered 10 attributes for AI evaluation in AP(ai)Bias model that includes (i) segmentation and classification, (ii) cross-validation (CV) protocol, (iii) inter- & intra-observer variability (IIV), (iv) benchmarking, (v) scientific validation (SV) and clinical evaluation (CE), (vi) learning paradigm (LP), (vii) data preparation (DP), (viii) data size, (ix) performance evaluation (PE), and (x) innovation. Note that DP attribute was further subdivided into (vii.a) augmentation protocol [96], [97], [101], (vii.b) region-of-volume selection, given the CT volume, (vii.c) format conversion declaration, (vii.d) manual tracings for ground truth or goal standard to generate binary shapes, and (vii.e) baseline characteristics demonstration. Each attribute (i-x) can earn up to a maximum of five points as presented in the section “Hypothesis and the Acceptability Criteria”. The value obtained in each attribute is pro-rated based on the threshold adapted for the attributes. These values of each attribute are then added, leading to an accumulated score. Finally, these 42 scores (corresponding to 42 studies) were then ranked in decreasing order (low-bias to high-bias). The raw-cutoff (1.9) for the selection of AI studies was determined based on the intersection of the “cumulative plot of mean score vs. studies” and the “descending order of distribution of the scores of the studies” (Fig. 10). Table V shows the AP(ai)Bias model-based ranking of the best 19 AI-based studies taken from a pool of 42 studies. According to the ranks, the color was assigned, where green is given to the group of the high-rank (low-bias), yellow for mid-rank (moderate-bias), and red for the low-rank (high-bias). Note that the maximum rank a study can obtain is 50 (10 attributes multiplied by a maximum of 5 points for each attribute). Using the above strategy and applying to all the 10 attributes to 19 studies, the top three contenders came out to be Wu et al.
[64], Alberto et al.
[71], and Ouchicha et al.
[58].
Fig. 10.
Selected 19 studies out of 42 based on the raw-cutoff of a mean score greater than equal to 1.9.
TABLE V
AP(ai)Bias Methodology Based Ranking of the AI Studies for ARDS
Selected 19 studies out of 42 based on the raw-cutoff of a mean score greater than equal to 1.9.
Interpretation of the AP(ai)Bias Strategy
To interpret these results, we analyze the mean scores for each of the ten attributes over all the 19 studies (see Table V: C1 to C10). The mean and the standard deviation of these attributes are shown in Fig. 11, which can also be used for computing the percentage contribution of these attributes. (i) Learning paradigm (C6) was the highest scorer amongst all the attributes, since 86% of the studies were supervised. (ii) Inter- & intra-observer variability (C3, label as IIV) showed the lowest mean value amongst all the studies. This demonstrated that the evaluation of AI techniques was undermined and inconclusive. (iii) Data size (C8) attained a mean value of 2.7, meaning that most of the data size was in the range of 100 to 500 subjects. (iv) Four attributes scored mediocre mean values, namely segmentation and classification (C1), benchmarking (C4), scientific validation and clinical evaluation (C5), and performance evaluation (C9), contributing to be 44%, 26%, 45.5%, and 30%, respectively. This clearly shows the segmentation and classification were binary. Only 26% of the studies implemented benchmarking. Amongst all the studies, 45.5% did the clinical evaluation with the radiologist or used a pre-evaluated cohort for the training. The attribute performance evaluation (C9) scored 31.5%, indicating that the accuracy of the AI models was around 93%. Our observation showed that only 6 studies ([49], [58], [59], [64], [66], [71]) (Table V) were in the low-bias pool (bias-cutoff greater than 2.5, C12, Table V) ((6/19) * 100∼32%), that passed the hypothesis.
Fig. 11.
Mean score using ten attributes on top 19 studies.
Mean score using ten attributes on top 19 studies.
ROBINS-I
The objective of ROBINS-I is to mimic the randomization of the non-randomized studies. It covers seven distinct attributes (domains) divided into three intervention factors (marked parameters) for bias during (a) “Pre-Intervention,” (b) “At-Intervention,” and (c) “Post-Intervention,” through which bias can be studied. Table II shows the bias due to (i) confounding factors (data size and data source); (ii) selection of participants (training and testing protocols); (iii) classification of interventions (data augmentation and use of imaging features); (iv) deviations from intended interventions (demographics, multicenter data, and comorbidity); (v) missing data (SWAB test, patient follow-up); (vi) measurement of outcomes (innovation, optimization, validation by the radiologist, COVID-19 data size); and (vii) selection of the reported result (prevents it from being included in the meta-analysis). Table II (seven-column marked C1 to C7) below discusses the outcome of 19 studies (“Study” column). The three-color scheme is adopted to depict the outcome of the qualitative analysis. Red color means high-bias, indicating a severe issue in the study concerning the factors taken into consideration. Moderate-bias is depicted by yellow, indicating that the study holds good on the given set of non-randomized data, and the green means low-bias, as the study performs comparatively well on the testing parameters and the input data. We conclude from ROBINS-I (Table II), “Study column, shaded in pink color” that ∼73% ((14/19) * 100) have at least one of the attributes with high-bias. Only three studies ([64], [71], [80]) passed the hypothesis test ((3/19) * 100 ∼16%). Further, we also conclude that 60% ((79/132) * 100) were low-moderate bias (green and yellow color) using the cell division approach.
TABLE II
ROBINS-I Method
Low Risk
Moderate Risk
High Risk
Unclear Risk
Study
C1
C2
C3
C4
C5
C6
C7
[64]
L
M
L
L
M
L
L
[46]
H
H
M
M
H
M
H
[58]
H
H
H
M
H
M
H
[59]
M
L
L
L
H
L
M
[49]
H
H
M
M
H
M
H
[66]
H
H
L
M
H
M
H
[43]
H
H
L
M
H
M
H
[47]
L
L
L
M
L
U
U
[71]
L
L
L
M
M
L
L
[75]
H
H
H
M
H
M
H
[62]
H
H
L
L
H
H
H
[48]
H
H
H
U
U
H
H
[50]
L
L
H
L
L
U
M
[45]
L
L
L
U
L
U
M
[73]
L
H
L
M
L
M
M
[77]
M
H
H
M
H
M
M
[80]
L
L
M
M
L
M
M
[79]
L
H
H
U
L
L
M
[81]
M
L
M
M
L
L
H
PROBAST
It is a popular AI-based RoB assessment tool. It uses four attributes as shown in Table III, where (a) participants, were the source of the image database and whether the radiologist verified them, (b) predictors, demographic data availability (Yes/No), and imaging features. (c) outcomes, consisted of the factors that whether different datasets were combined and if the reverse transcription-polymerase chain reaction (RT-PCR) test was conducted for the cohort, and (d) analysis, covers the cohort size, number of COVID-19 patients, optimization techniques used, validation, and the innovation in the design. We used the same 19 studies adapted for the AP(ai)Bias analysis ranking (Table V). Using PROBAST, we conclude that ∼47% (9 out of 19) studies were high-bias (marked as H in red color); some even had an unclear RoB. Five studies [64], [71], [79-81] out of the 19 selected studies passed the hypothesis ((5/19) * 100 ∼26%). Using the cell division approach, PROBAST showed 67% ((51/76) * 100) were low-moderate bias.
TABLE III
PROBAST Method
Low Risk
Moderate Risk
High Risk
Unclear Risk
ID
Participants
Predictors
Outcomes
Analysis
[64]
L
L
L
L
[46]
H
M
M
M
[58]
H
H
M
H
[59]
M
L
M
U
[49]
H
U
M
H
[43]
H
M
M
H
[66]
H
M
M
U
[47]
L
L
M
U
[71]
L
M
M
L
[75]
H
H
M
U
[62]
H
L
M
H
[48]
H
H
M
H
[50]
L
M
L
U
[45]
L
M
L
U
[73]
M
L
M
M
[77]
M
H
M
U
[80]
L
M
L
M
[79]
M
M
L
M
[81]
M
L
L
M
Analysis of RoB Using Venn Diagram
Following steps were adapted to generate the Venn diagram (Fig. 12). (i) Conversion of ROBINS-I (Supplemental A) and PROBAST (Supplemental C) from qualitative scheme to quantitative scheme, using the conversion scores of low-biases to 5, moderate-bias to 3, high-bias to 1, and unclear-bias to 0. (ii) Selection of common studies between (a) ROBINS-I and PROBAST (left: [43], [48], [49], [58], [66], [75], [77], right: [43], [46], [48], [49], [58], [62], [66], [75], [77]), (b) PROBAST and AP(ai)Bias ranking (left: [48], [75], [77], right: [43], [46], [48], [62], [75], [77]). (c) ROBINS-I and AP(ai)Bias (left: [48], [62], [75], [77], right: [43], [46], [48], [62], [75], [77]) and (d) between all the three paradigms (top: [48], [75], [77], bottom: [43], [46], [48], [62], [75], [77]).
Fig. 12.
Venn diagram shows the three bias paradigms using: (a) high-bias; (b) moderate-high bias scenarios.
Venn diagram shows the three bias paradigms using: (a) high-bias; (b) moderate-high bias scenarios.This digital count is shown in the Venn diagram, Fig. 12 for high-bias and moderate-high bias. These counts are 7, 3, 4, 3 for high-bias, and 9, 6, 6, 6 for moderate-high bias, respectively. (iii) These digital counts are then converted into percentage by normalizing it by total studies (19), shown in Fig. 12. These percentages are 39%, 17%, 23%, and 17%, and 50%, 33%, 33%, 33% for high-bias and moderate-high bias, respectively. The top two low-bias studies [64], [71] were the same in all three paradigms. It was interesting to note that reference [66] was low-bias using AP(ai)Bias, while it was in moderate-high bias for ROBINS-I and PROBAST. This is because the AP(ai)Bias takes into consideration attributes like CE, LP and DP. Note that Fig. 12 uses bias-cutoff of 2.2 and 2.5 for high-bias and moderate-high bias, respectively.
Discussions
The main contributions of the aiSR first showed the basic pipeline of the ARDS framework (Fig. 3). The study then showed the statistical distribution visually for (a) different AI models, (b) image modalities, and (c) image dimension type (2-D vs. 3-D). The crux of the aiSR was to understand the Risk-of-Bias (RoB) in a non-randomized AI trial for handling ARDS using three paradigms: AP(ai)Bias, ROBINS-I, and PROBAST. AP(ai)Bias consisted of ten main attributes and several sub-attributes, while the ROBINS-I and PROBAST consisted of seven and four attributes. These two qualitative assessments were quantified using the same strategy as in AP(ai)Bias. This framework further allowed us to show how the AP(ai)Bias based ranking strategy can be compared against ROBINS-I and PROBAST using high-RoB and low-RoB cutoff's and pictorially represented using Venn diagram. The study further presented 12 (six primary and six secondary) recommendation for high-RoB studies for better AI designs for ARDS. Finally, the study corelated the pathophysiology of ARDS and lung damage process represented by GGO. In terms of imaging modality, CXR [46], [57], [59], [72], [80] is more economical, and it is seen that more than 50% of the studies had used them. GGO has been one of the most common manifestations in CT image volumes, but these may vary from person-to-person. It has been noted that data from the patients with rapid progression of the disease show a faster change in lung lesions, but to incorporate this in the AI framework, more data for the AI-model is required to train them. It also requires a follow-up on the patient's condition. The evaluation of the AI models has not been done using randomized trials, so to overcome this, we have proposed the usage of novel methods for the RoB analysis using ROBINS-I and PROBAST.
Benchmarking: Comparative Study on aiSR
Table IV shows a comparison between the previous aiSR, where 12 types of attributes were chosen to compare the five studies [11-15]. The proposed study is in the last column, labelled “Suri.” Note that we offer “✓” in places for unique contribution in the “Suri model.” We also offer recommendations in clinical validation, inter-and intra-observer variability, comorbidity, and risk granularity.
TABLE IV
Benchmarking Table for Systematic Reviews for COVID-19 ARDS
SN
Authors
Nagendran et al.[13]
Roberts et al.[14]
Wynants et al.[15]
Bao et al.[12]
Albahari et al.[11]
Montazeri et al.[16]
Suri et al.
Dates
Attributes
February 2020
October 2020
April 2020
June 2020
June 2020
April 2021
Proposed 2021
1
ROBINS
✗
✗
✗
✗
✗
✗
✓
2
PROBAST
✓
✓
✓
✗
✗
✓
✓
3
AP(ai)Bias Ranking
✗
✗
✗
✗
✗
✗
✓
4
AP(ai)Bias interpretation
✗
✗
✗
✗
✗
✗
✓
5
PRISMA
91
45
27
13
11
44
89
6
Study Considered
91
45
27
13
11
44
42
7
References
41
84
71
30
109
80
89
8
Patho section
✗
✗
✗
✗
✗
✗
✓
9.1
Recommendations – R1
Transparency
Input data
CRF
Multiethnic
MCDA
Fine-tuning
Data Size
9.2
Recommendations – R2
High Quality
Robust training
PTC
-
AHP
Data augm
Clinical Validate
9.3
Recommendations – R3
-
Reproducibility
VTEM
-
-
-
IIV
9.4
Recommendations – R4
-
High Q doc
-
-
-
-
Comorbidity
9.5
Recommendations – R5
-
Peer review
-
-
-
-
Risk Granularity
9.6
Recommendations – R6
-
-
-
-
-
-
Reproducible
10
Segmentation
✗
✗
✗
✗
✗
✓
✓
11
AI arch layout
✗
✗
✗
✗
✗
✓
✓
12
Performance of AI studies
✓
✓
✓
✓
✓
✓
✓
*CRF: Conventional risk factor consists of age, body temperature, and (respiratory) signs and symptoms; for prognostic models, age, sex, C reactive protein, lactic dehydrogenase, lymphocyte count, and potentially features derived from CT scoring; PTC: Point the challenges in design; VTEM: validate the existing models; IIV: Inter- and intra-observer.
*CRF: Conventional risk factor consists of age, body temperature, and (respiratory) signs and symptoms; for prognostic models, age, sex, C reactive protein, lactic dehydrogenase, lymphocyte count, and potentially features derived from CT scoring; PTC: Point the challenges in design; VTEM: validate the existing models; IIV: Inter- and intra-observer.*Segm: Segmentation; class: Classification; IIV: Intra- and inter-variability; Innov: Innovation; Bench: Benchmarking; PE: Performance evaluation; CE: Clinical Evaluation; LP: Learning Paradigm; DP: Data Preparation
Recommendations
The primary set of six-point recommendations is: (i)
Comorbidity: This section focuses on a novel intuitive approach that can lead to new improvements to the present prevailing methods for COVID-19 diagnosis. Studies have shown that comorbidities like age, ethnicity, hypertension, diabetes, higher BMI, respiratory disorders, hyperlipidemia, and obesity lead to worsening of ARDS [102], [103]. In these studies, ARDS is more prominent among the older patients, and if given a timely prognosis, can be controlled to reduce the risk factor and enhance the efficacy related to COVID-19. For validation of this data, a randomized controlled trial is necessary where the data is clinically validated. With AI techniques and comorbidity factors, this disease's prognosis and diagnosis can be made more accurate. (ii)
Inter- and intra-observer variability: Our observation shows that only 12% of the AI-based studies attempted inter-and intra-observer variability study analysis. This does not assure that the AI results are robust. An example of comprehensive IIV can be seen in [104], [105]. Similarly, inter-and intra-operator variability can be computed, ensuring further reliability in clinical settings. (iii)
Data Size: Even though 50% of the AI studies for ARDS had a total number of subjects <500, this could be improved by targeting 1,000 and above. Typically, clinical trials adapted in meta-analysis range to many thousands (>15,000). This also requires conducting “power-analysis” for the AI system [106-108]. (iv)
Scientific Validation and Clinical Evaluation: It requires the engineering AI-based design to be validated by the clinical community in terms of reliability, accuracy, and reproducibility. Therefore, the medical, scientific, and engineering community needs to collaborate more closely. They could also incur more costs for system design, should the medical community participate for a longer duration [109], [110]. (v)
Risk Granularity: Risk assessment in several medicine fields is binary; however, such a strategy poses a challenge during drug prescription and better patient care. For this reason, new strategies have evolved recently where a multiclass framework provides a stronger granularity of risk leading to better control of medications and monitoring [111], [112]. This requires multiple classes in the ground truths design for COVID-19 severity. This means a careful examination of the CT lung images by the radiologist in conjunction with the pulmonologist. The second alternative is to stratify the risks of the AI systems’ output in continuous values between 0 and 1, thereby partitioning the output into multiple classes. Note that the classes’ thresholds are based on the baseline characteristics of the input cohort's demographics.Note that the classes’ thresholds are based on the baseline characteristics of the input cohort's demographics. (vi)
Scientific validation using cross-modality fusion: Scientific validation is crucial for ensuring the AI system's correct functioning. One way could be looking at COVID-19 using two different angles, such as imaging of the lung using CT and positron emission tomography (PET). With the advanced technology of combined PET/CT, one can visualize the PET images’ metabolic distribution corresponding to the spatial CT [113]. Fig. 13 below shows the PET image showing the functional metabolic distribution of COVID-19. The selected studies did not consider PET/CT fusion as part of a systematic review.
Fig. 13.
PET-CT pair of a COVID-19 lung ([113]).
PET-CT pair of a COVID-19 lung ([113]).The secondary set of recommendations include (i) solid model design (training and prediction) [86], [89], [114], (ii) reproducibility [104], (iii) process of peer-review, (iv) high-quality documentation, (v) multi-ethnic and multi-regional data collection, and (vi) multiple ground-truth events for clinical validation [11]. Since all the three paradigms of hypothesis (section II.B) were invalid, we thus conclude that one requires the above concrete recommendations for improving AI design for meeting the requirements of the hypothesis keeping in mind for optimal AI performance. Note that in our study the AI solutions did not consider the socio-economic causes during the design. Recently a study [115] was conducted elaborating the need for inclusion of socio-economic conditions for COVID-19. This could possibly be adapted as an extension to the current work. The role of socio-economic conditions during the ML design was attempted by our group for neonatal deaths in different counties of Bangladesh [116]. A similar approach can also be adapted for COVID-19 based on geography, social-economic conditions.
Strength, Weakness, and its Extension
This main strength of this review is the selection of the best 19 AI studies for analyzing RoB. Most of the studies adapted DL as their base architecture due to the medical imaging source. As a result, DL was the best suited for its purpose. We observed that data augmentation was also used in 43% of the studies where the data was lacking. For the selected set of studies, we have successfully analyzed the risk factors using two schemes, ROBINS-I and PROBAST. There was a gap in the studies that there was no linking on how they did COVID severity, clinical validation, variability study, and cross-modality [117]. Some of the studies did not mention about hardware constraints. On top of all, none of them showed any 510(K) FDA approval. There is much inconsistency in the studies, which can be fixed by adopting some initial clinical validation on the patients using cross-modality. Even RL, TL with cross-modality, and comorbidity can introduce innovation to this lifesaving study worldwide. Understandably, this may take more time for the medical community but could cut corners on patient care.
Conclusion
This study used the PRISMA model to select 89 studies, which was then AI-filtered to 42. Based on the mean score ranking, the selection was further refined to 19 studies using a raw cutoff of 1.9. Further, the three RoB paradigms were analyzed using the Venn diagram. The percentage of studies that satisfied the “non-randomized Artificial Intelligence-based” hypothesis for ARDS-based COVID-19-infected lungs were only 32%, 16%, and 26%, corresponding to AP(ai)Bias, ROBINS-I, and PROBAST, respectively. This percentage obtained was using high-bias cutoff of 2.2 and moderate-high bias cutoff of 2.5, respectively. None of the three RoB models met the requirement of the hypothesis.The aiSR's overall presents a set of six-point primary and six-point secondary recommendations for improving the AI design for ARDS were (i) the inclusion of comorbidity in AI design, (ii) increase in data size, (meets the performance of all the standards in ARDS COVID-19 lung infected patients), (iii) scientific validation using cross-modality, (iv) conducting the clinical validations, (v) improved inter-and intra-observer variability studies, (vi) risk granularity for better drug prescription. The secondary set of recommendations include (i) solid model design (training and prediction), (ii) reproducibility of the proposed model, (iii) process of peer-review, (iv) high-quality documentation, (v) multiethnic and multi-regional data collection, and (vi) multiple ground truth events for clinical validation.
Authors: Mohit Agarwal; Luca Saba; Suneet K Gupta; Amer M Johri; Narendra N Khanna; Sophie Mavrogeni; John R Laird; Gyan Pareek; Martin Miner; Petros P Sfikakis; Athanasios Protogerou; Aditya M Sharma; Vijay Viswanathan; George D Kitas; Andrew Nicolaides; Jasjit S Suri Journal: Med Biol Eng Comput Date: 2021-02-05 Impact factor: 2.602
Authors: Ankush Jamthikar; Deep Gupta; Luca Saba; Narendra N Khanna; Tadashi Araki; Klaudija Viskovic; Sophie Mavrogeni; John R Laird; Gyan Pareek; Martin Miner; Petros P Sfikakis; Athanasios Protogerou; Vijay Viswanathan; Aditya Sharma; Andrew Nicolaides; George D Kitas; Jasjit S Suri Journal: Cardiovasc Diagn Ther Date: 2020-08
Authors: Sanagala S Skandha; Suneet K Gupta; Luca Saba; Vijaya K Koppula; Amer M Johri; Narendra N Khanna; Sophie Mavrogeni; John R Laird; Gyan Pareek; Martin Miner; Petros P Sfikakis; Athanasios Protogerou; Durga P Misra; Vikas Agarwal; Aditya M Sharma; Vijay Viswanathan; Vijay S Rathore; Monika Turk; Raghu Kolluri; Klaudija Viskovic; Elisa Cuadrado-Godia; George D Kitas; Andrew Nicolaides; Jasjit S Suri Journal: Comput Biol Med Date: 2020-08-16 Impact factor: 4.589
Authors: Robert F Wolff; Karel G M Moons; Richard D Riley; Penny F Whiting; Marie Westwood; Gary S Collins; Johannes B Reitsma; Jos Kleijnen; Sue Mallett Journal: Ann Intern Med Date: 2019-01-01 Impact factor: 25.391
Authors: Victor Mergen; Adrian Kobe; Christian Blüthgen; André Euler; Thomas Flohr; Thomas Frauenfelder; Hatem Alkadhi; Matthias Eberhard Journal: Eur J Radiol Open Date: 2020-10-06
Authors: Riccardo Cau; Pier Paolo Bassareo; Lorenzo Mannelli; Jasjit S Suri; Luca Saba Journal: Int J Cardiovasc Imaging Date: 2020-11-19 Impact factor: 2.357
Authors: Narendra N Khanna; Mahesh Maindarkar; Ajit Saxena; Puneet Ahluwalia; Sudip Paul; Saurabh K Srivastava; Elisa Cuadrado-Godia; Aditya Sharma; Tomaz Omerzu; Luca Saba; Sophie Mavrogeni; Monika Turk; John R Laird; George D Kitas; Mostafa Fatemi; Al Baha Barqawi; Martin Miner; Inder M Singh; Amer Johri; Mannudeep M Kalra; Vikas Agarwal; Kosmas I Paraskevas; Jagjit S Teji; Mostafa M Fouda; Gyan Pareek; Jasjit S Suri Journal: Diagnostics (Basel) Date: 2022-05-17
Authors: Mohit Agarwal; Sushant Agarwal; Luca Saba; Gian Luca Chabert; Suneet Gupta; Alessandro Carriero; Alessio Pasche; Pietro Danna; Armin Mehmedovic; Gavino Faa; Saurabh Shrivastava; Kanishka Jain; Harsh Jain; Tanay Jujaray; Inder M Singh; Monika Turk; Paramjit S Chadha; Amer M Johri; Narendra N Khanna; Sophie Mavrogeni; John R Laird; David W Sobel; Martin Miner; Antonella Balestrieri; Petros P Sfikakis; George Tsoulfas; Durga Prasanna Misra; Vikas Agarwal; George D Kitas; Jagjit S Teji; Mustafa Al-Maini; Surinder K Dhanjil; Andrew Nicolaides; Aditya Sharma; Vijay Rathore; Mostafa Fatemi; Azra Alizad; Pudukode R Krishnan; Rajanikant R Yadav; Frence Nagy; Zsigmond Tamás Kincses; Zoltan Ruzsa; Subbaram Naidu; Klaudija Viskovic; Manudeep K Kalra; Jasjit S Suri Journal: Comput Biol Med Date: 2022-05-21 Impact factor: 6.698
Authors: Jasjit S Suri; Sushant Agarwal; Gian Luca Chabert; Alessandro Carriero; Alessio Paschè; Pietro S C Danna; Luca Saba; Armin Mehmedović; Gavino Faa; Inder M Singh; Monika Turk; Paramjit S Chadha; Amer M Johri; Narendra N Khanna; Sophie Mavrogeni; John R Laird; Gyan Pareek; Martin Miner; David W Sobel; Antonella Balestrieri; Petros P Sfikakis; George Tsoulfas; Athanasios D Protogerou; Durga Prasanna Misra; Vikas Agarwal; George D Kitas; Jagjit S Teji; Mustafa Al-Maini; Surinder K Dhanjil; Andrew Nicolaides; Aditya Sharma; Vijay Rathore; Mostafa Fatemi; Azra Alizad; Pudukode R Krishnan; Ferenc Nagy; Zoltan Ruzsa; Mostafa M Fouda; Subbaram Naidu; Klaudija Viskovic; Manudeep K Kalra Journal: Diagnostics (Basel) Date: 2022-05-21
Authors: Pankaj K Jain; Neeraj Sharma; Luca Saba; Kosmas I Paraskevas; Mandeep K Kalra; Amer Johri; John R Laird; Andrew N Nicolaides; Jasjit S Suri Journal: Diagnostics (Basel) Date: 2021-12-02
Authors: Jasjit S Suri; Sushant Agarwal; Alessandro Carriero; Alessio Paschè; Pietro S C Danna; Marta Columbu; Luca Saba; Klaudija Viskovic; Armin Mehmedović; Samriddhi Agarwal; Lakshya Gupta; Gavino Faa; Inder M Singh; Monika Turk; Paramjit S Chadha; Amer M Johri; Narendra N Khanna; Sophie Mavrogeni; John R Laird; Gyan Pareek; Martin Miner; David W Sobel; Antonella Balestrieri; Petros P Sfikakis; George Tsoulfas; Athanasios Protogerou; Durga Prasanna Misra; Vikas Agarwal; George D Kitas; Jagjit S Teji; Mustafa Al-Maini; Surinder K Dhanjil; Andrew Nicolaides; Aditya Sharma; Vijay Rathore; Mostafa Fatemi; Azra Alizad; Pudukode R Krishnan; Ferenc Nagy; Zoltan Ruzsa; Archna Gupta; Subbaram Naidu; Kosmas I Paraskevas; Mannudeep K Kalra Journal: Diagnostics (Basel) Date: 2021-12-15
Authors: Jasjit S Suri; Mrinalini Bhagawati; Sudip Paul; Athanasios D Protogerou; Petros P Sfikakis; George D Kitas; Narendra N Khanna; Zoltan Ruzsa; Aditya M Sharma; Sanjay Saxena; Gavino Faa; John R Laird; Amer M Johri; Manudeep K Kalra; Kosmas I Paraskevas; Luca Saba Journal: Diagnostics (Basel) Date: 2022-03-16