Ashley G Gillman1, Febrio Lunardo2,3, Joseph Prinable4, Gregg Belous2, Aaron Nicolson2, Hang Min2, Andrew Terhorst5, Jason A Dowling2. 1. Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Surgical Treatment and Rehabilitation Service, 296 Herston Road, Brisbane, QLD, 4029, Australia. Ashley.Gillman@csiro.au. 2. Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Surgical Treatment and Rehabilitation Service, 296 Herston Road, Brisbane, QLD, 4029, Australia. 3. College of Science and Engineering, James Cook University, Australian Tropical Science Innovation Precinct, Townsville, QLD, 4814, Australia. 4. ACRF Image X Institute, University of Sydney, Level 2, Biomedical Building (C81), 1 Central Ave, Australian Technology Park, Eveleigh, Sydney, NSW, 2015, Australia. 5. Data61, Commonwealth Scientific and Industrial Research Organisation, College Road, Sandy Bay, Hobart, TAS, 7005, Australia.
Abstract
OBJECTIVES: To conduct a systematic survey of published techniques for automated diagnosis and prognosis of COVID-19 diseases using medical imaging, assessing the validity of reported performance and investigating the proposed clinical use-case. To conduct a scoping review into the authors publishing such work. METHODS: The Scopus database was queried and studies were screened for article type, and minimum source normalized impact per paper and citations, before manual relevance assessment and a bias assessment derived from a subset of the Checklist for Artificial Intelligence in Medical Imaging (CLAIM). The number of failures of the full CLAIM was adopted as a surrogate for risk-of-bias. Methodological and performance measurements were collected from each technique. Each study was assessed by one author. Comparisons were evaluated for significance with a two-sided independent t-test. FINDINGS: Of 1002 studies identified, 390 remained after screening and 81 after relevance and bias exclusion. The ratio of exclusion for bias was 71%, indicative of a high level of bias in the field. The mean number of CLAIM failures per study was 8.3 ± 3.9 [1,17] (mean ± standard deviation [min,max]). 58% of methods performed diagnosis versus 31% prognosis. Of the diagnostic methods, 38% differentiated COVID-19 from healthy controls. For diagnostic techniques, area under the receiver operating curve (AUC) = 0.924 ± 0.074 [0.810,0.991] and accuracy = 91.7% ± 6.4 [79.0,99.0]. For prognostic techniques, AUC = 0.836 ± 0.126 [0.605,0.980] and accuracy = 78.4% ± 9.4 [62.5,98.0]. CLAIM failures did not correlate with performance, providing confidence that the highest results were not driven by biased papers. Deep learning techniques reported higher AUC (p < 0.05) and accuracy (p < 0.05), but no difference in CLAIM failures was identified. INTERPRETATION: A majority of papers focus on the less clinically impactful diagnosis task, contrasted with prognosis, with a significant portion performing a clinically unnecessary task of differentiating COVID-19 from healthy. Authors should consider the clinical scenario in which their work would be deployed when developing techniques. Nevertheless, studies report superb performance in a potentially impactful application. Future work is warranted in translating techniques into clinical tools.
OBJECTIVES: To conduct a systematic survey of published techniques for automated diagnosis and prognosis of COVID-19 diseases using medical imaging, assessing the validity of reported performance and investigating the proposed clinical use-case. To conduct a scoping review into the authors publishing such work. METHODS: The Scopus database was queried and studies were screened for article type, and minimum source normalized impact per paper and citations, before manual relevance assessment and a bias assessment derived from a subset of the Checklist for Artificial Intelligence in Medical Imaging (CLAIM). The number of failures of the full CLAIM was adopted as a surrogate for risk-of-bias. Methodological and performance measurements were collected from each technique. Each study was assessed by one author. Comparisons were evaluated for significance with a two-sided independent t-test. FINDINGS: Of 1002 studies identified, 390 remained after screening and 81 after relevance and bias exclusion. The ratio of exclusion for bias was 71%, indicative of a high level of bias in the field. The mean number of CLAIM failures per study was 8.3 ± 3.9 [1,17] (mean ± standard deviation [min,max]). 58% of methods performed diagnosis versus 31% prognosis. Of the diagnostic methods, 38% differentiated COVID-19 from healthy controls. For diagnostic techniques, area under the receiver operating curve (AUC) = 0.924 ± 0.074 [0.810,0.991] and accuracy = 91.7% ± 6.4 [79.0,99.0]. For prognostic techniques, AUC = 0.836 ± 0.126 [0.605,0.980] and accuracy = 78.4% ± 9.4 [62.5,98.0]. CLAIM failures did not correlate with performance, providing confidence that the highest results were not driven by biased papers. Deep learning techniques reported higher AUC (p < 0.05) and accuracy (p < 0.05), but no difference in CLAIM failures was identified. INTERPRETATION: A majority of papers focus on the less clinically impactful diagnosis task, contrasted with prognosis, with a significant portion performing a clinically unnecessary task of differentiating COVID-19 from healthy. Authors should consider the clinical scenario in which their work would be deployed when developing techniques. Nevertheless, studies report superb performance in a potentially impactful application. Future work is warranted in translating techniques into clinical tools.
The novel coronavirus, SARS-Cov-2 and its associated disease, COVID-19, have presented a significant and urgent threat to public health while simultaneously disrupting healthcare systems. Despite being more than 2 years since the beginning of the pandemic, outbreaks continue to threaten to overwhelm healthcare systems, and viral variants continue to introduce uncertainty [1]. Fast and accurate diagnostic and prognostic capability help quickly determine which patients need to be isolated and informs triage of patients. Reverse-transcription polymerase chain reaction (RT-PCR) is the current clinical standard for diagnosis of COVID-19, however, its low sensitivity often necessitates repeat testing [2] taking additional time. This has led to the suggestion that there is a role for radiology in diagnosing COVID-19.Radiological professional bodies have generally recommended against the use of imaging for screening in COVID-19 but recognise the role of incidental findings and for disease staging. Early in the pandemic, the use of computed tomography (CT) for diagnosis and screening was discussed in the context of shortages of RT-PCR test kits and poor sensitivity [3]. In March of 2020, a consensus report was released [4], endorsed by the Society of Thoracic Radiology, the American College of Radiology and the Radiological Society of North America (RSNA), recommending against the use of chest CT for screening due to a low negative predictive value, but also partly due to a lack of evidence early in the pandemic. The Royal Australian and New Zealand College of Radiologists released their advice in April of 2020, which remains current, recommending against the use of chest radiograph for screening but recommending for the use of CT for staging [5]. The report, however, stops short of recommending a severity scale. By June of 2020, the World Health Organisation recommended the use of radiological imaging: (1) for diagnostic purposes in symptomatic patients when RT-PCR is not available, is available but results are delayed and when RT-PCR is negative but there is high clinical suspicion of COVID-19; (2) for triage purposes when deciding to admit to hospital and/or intensive care unit (ICU); and (3) for staging purposes when deciding appropriate therapeutic management [6]. The most recent version of the Cochrane review on the topic suggest that CT and chest X-ray (CXR) are moderately sensitive and specific to the diagnosis of COVID-19, whereas ultrasound is sensitive but not specific to the diagnosis of COVID-19 [7]. This novel application of radiology has spurred an interest in the application of machine learning techniques to automate the image interpretation tasks.Many investigators have proposed techniques in a wide range of applications to automate image interpretation in imaging of COVID-19, including segmentation of COVID-19 related lesions, typically ground-glass opacities (GGOs), diagnosis, staging of the current disease progression and prognosis of likely future disease progression. However, the field has inspired controversy. DeGrave et al. [8] demonstrated that combining data from multiple sources, in particular where data from different classes have different acquisition and pre-processing parameters, led to a significant bias that artificially improved the measured performance in many studies. Garcia Santa Cruz et al. [9] presented a review of public CXR datasets, concluding that the most popular datasets used in the literature were at a high risk of introducing bias into reported results.Many other reviews have been introduced on the topic, we now introduce the seminal ones. Shi et al. [10] presented a narrative review very early in the pandemic (published April of 2020) of machine learning techniques for segmentation of COVID-19-related lesions and for diagnosis, staging and prognosis of COVID-19 using CXR and CT. However, this early review did not consider potential study bias in its papers. Others have presented systematic reviews [11, 12] that, while following a more rigorous approach to inclusion also failed to asses bias when assessing results. Wynants et al. [13] present a broadly-scoped systematic review for prediction models in COVID-19, leveraging the prediction model risk of bias assessment tool (PROBAST) [14]. They reported high risk of bias across the field. Roberts et al. [15] presented a systematic review of machine learning techniques applied to CXR and CT imaging, published up to the 3rd of October, 2020, assessing bias using the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) [16], Radiomics Quality Score (RQS) [17] and PROBAST [14] and reporting methodological and dataset trends. They use this to develop a set of recommendations for authors in the field.In this review, we use similar techniques to those presented by Roberts et al. [15]. Rather than assessing papers on separate criteria, RQS and CLAIM, we assess all papers with CLAIM. We also aim to present a richer analysis of techniques and their performance, and to provide an update, including publications until 31st October, 2021. We also introduce an analysis of authors and institutions in the field, in the hope that it encourages and facilitates further collaboration.Research questions:Which techniques are most successful in differentiating COVID-19?What are the clinical requirements driving the development of these tools? How would such techniques be implemented clinically?Who is publishing this in this field?
Methodology
Study selection
The inclusion criteria for the review are:Studies that aim to automatically (allowing for manual contouring as a preprocessing step under the assumption this could be automated) diagnose, stage or prognose COVID-19 or segment lesions associated with COVID-19; andStudies that use medical imaging or signals, including CXR, CT, ultrasound, magnetic resonance imaging (MRI), or electrocardiograph (ECG) as input to their model.rscopus version 0.6.6 [18] was used to retrieve articles according to the search criteria outlined in Panel 1. The search was performed on the 19th November, 2021. Papers meeting the inclusion criteria that were identified during the investigation but not identified in the search were also included in the study.Panel 1: Scopus search criteriaTITLE-ABS-KEY ( ( covid OR coronavirus ) AND ( ( chest W/5 xray ) OR “computed tomography” OR ultrasound OR “magnetic resonance” OR mr OR mri OR ecg OR electrocardiograph* ) AND ( diagnos* OR staging OR identif* OR response OR prognos* OR segment* ) AND ( learn* OR convolutional OR network OR radiomic*) )Exclusion criteria were also imposed to eliminate studies that exhibited or were likely to exhibit a high risk of bias:Studies from journals with a source normalized impact per paper (SNIP), as measured in 2021, less than 1 were excluded. SNIP is a metric introduced by Scopus that measures contextual impact, normalising between fields with different citation rates. This process was manually checked by two of the authors, and journals that were likely to publish relevant studies and reputable within their fields, that would be eliminated, were included.Studies that were more than 90 days old and had not attracted any citations were excluded. This criteria is included to automatically filter articles which the scientific community has deemed uninteresting, under the assumption that in such a fast moving field, 90 days should be adequate to have attracted at least one citation.Studies with metadata indicating that they were Editorials, Reviews, Notes or Letters were excluded.Studies where application to COVID-19 is secondary and not the primary focus of the paper were excluded.Studies not meeting the minimum risk of bias assessment (see “Bias assessments” section) were excluded.Remaining studies were assigned amongst reviewing authors, and each study was reviewed by one author, who assessed for minimum risk-of-bias, and extracted data. Studies were not de-identified before analysis.
Bias assessments
Due to reports of a high risk-of-bias in the field [9, 13, 15], we include a bias assessment. Improper study design, data collection, data partitioning and statistical methods can lead to misleading reported results [14]. This commonly manifests as a positive bias because authors (rightly) attempt to improve the performance of their proposed techniques.The CLAIM checklist was completed for all included papers [16]. All 42 checklist items were given either a pass or fail score, or a “not applicable” score which did not count towards the failure count in cases where the checklist item was not applicable to the paper. The number of failure scores was used as a measure for bias. Similar to Roberts et al. [15], we impose a subset of CLAIM, items 7, 9, 20, 21, 22, 25, 26 and 28, as a minimum risk of bias. Any papers that did not meet all subset checklist items were excluded. CLAIM checklist reports from Roberts et al. [15] were merged and used where available to avoid duplication.
Extracted data
Methodological and performance results were collected per technique, where each study presents one or more technique. When multiple techniques were introduced in each study, only the highest performing technique was surveyed, unless the techniques filled different purposes (e.g., one study presenting a segmentation and diagnostic technique) or different contexts (e.g., different available clinical data to augment image input) (Table 1).
Table 1
Data collected during survey
Field
Definition
Task
Diagnosis (differentiating COVID-19 from healthy or other diseases), prognosis (this included staging, differentiating within COVID-19 for the severity or expected disease trajectory) and segmentation of COVID-19 related lesions, including GGO.
Output classes
For diagnosis tasks, whether COVID-19 was differentiated for either or both of other pneumonia and/or healthy controls. For prognosis tasks, the number of classes or if the task is a continuous regression one, as well as the derivation of the class. Derivations were classified as either clinical assessment, where the severity is measured based on clinical features at the time of imaging, progression, where the severity is measured either by time spent in hospital or by required interventions, and survival, where the severity is measured by whether the infection proved lethal.
Imaging type
Input to model, including modality and whether additional clinical or demographic information was passed into the model.
Model information
Including the machine learning or deep learning model, optimiser, parameters and augmentation (if deep learning) and manual extracted features (if radiomics).
Number of centres
Number of separate institutions from which data was sourced.
Performance
Performance measures for proposed technique, as reported.
Reproducibility
Whether data and code were made available.
Bias
A CLAIM checklist was completed for each study and the number of failures was used as a measure of potential bias.
Data collected during survey
Analysis of studies
Accuracy and area under the curve (AUC) of the receiver operating characteristic (ROC), where reported, were used for performance comparison. Statistical significance was measured throughout this review using two-sided independent t-tests, with a significance threshold of p < 0.05. No adjustments were made for multiple comparisons.
Analysis of authors and publishers
Author, institution and publication metadata were extracted using rscopus 0.6.6 [18] and used to compute author h-indices. A co-author network was generated with tidygraph 1.2.0 [19] by linking authors that had published together, and the most central authors identified using the betweenness centrality.
Results
Of 1002 studies identified, 282 were assessed against the required subset of the CLAIM checklist for exclusion, after which 81 studies were included in the study (Fig. 1). A list of identified and included studies are available in Supplementary 1, Table S1, and the full set of studies identified and collected data are available in Supplementary 2. CLAIM 26 eliminated the most studies (Fig. 2, left), which pertains to the evaluation of the best-performing model. Most papers failing this subset failed to evaluate against a separate test set after presenting multiple models. CLAIM 25 eliminated the next most studies, which required an adequate description of hyperparameter selection. Only one in four of papers met the inclusion criteria, and approximately one in four of papers failed a half or more of the required CLAIM subset (Fig. 2, right). From the 81 studies included, a total of 103 separate techniques were included.
Fig. 1
PRISMA flow diagram of search
Fig. 2
Studies excluded for bias. The percentage of total studies that failed each of the required subset of the CLAIM checklist for inclusion (left), and a histogram of the number of failures (right), where only studies with 0 failures met the inclusion criteria
PRISMA flow diagram of searchStudies excluded for bias. The percentage of total studies that failed each of the required subset of the CLAIM checklist for inclusion (left), and a histogram of the number of failures (right), where only studies with 0 failures met the inclusion criteria
Bias
Remaining CLAIM failures in the included articles are depicted in Fig. 3 (left). The count of failures for each article became the risk-of-bias surrogate, a histogram over all papers is shown in Fig. 3 (right). The mean number of failures was 8.3 3.9 standard deviation.
Fig. 3
CLAIM results of studies included: the number of included studies that failed each of the CLAIM items (left), and a histogram of the number of failures (right)
CLAIM results of studies included: the number of included studies that failed each of the CLAIM items (left), and a histogram of the number of failures (right)
Methodologies
The majority, 58%, of techniques sought to solve a diagnosis task, attempting to classify COVID-19 disease from healthy patients and/or non-COVID-19 pneumonia (Fig. 4, left), versus 31% performing prognosis (where techniques performing both are counted in both). Of the 31% of techniques attempting to solve a prognosis task, the majority used an objective prognostic outcome measure (46% progression and 16% survival) rather than matching a clinical assessment.
Fig. 4
(Left) Machine learning tasks attempted to be solved by techniques. (Top Right) A breakdown of Diagnosis and Diagnosis & Prognosis approaches by diagnostic outcome variable classes. (Bottom Right) A breakdown of Prognosis and Diagnosis & Prognosis approaches by prognostic outcome variable. The inner ring represents the number of classes, or continuous for regression tasks, and the outer ring represents the derivation of the outcome variable. See Table 1 for definitions of derivations
(Left) Machine learning tasks attempted to be solved by techniques. (Top Right) A breakdown of Diagnosis and Diagnosis & Prognosis approaches by diagnostic outcome variable classes. (Bottom Right) A breakdown of Prognosis and Diagnosis & Prognosis approaches by prognostic outcome variable. The inner ring represents the number of classes, or continuous for regression tasks, and the outer ring represents the derivation of the outcome variable. See Table 1 for definitions of derivationsMost papers used CT images, either in 3D or as 2D slices, as model input, followed by CXR and US (Fig. 5, left). Only a small minority of papers included clinical features as input. Although MRI and ECG were explicitly included within the scope of the review, no techniques using these modalities were included. No MRI papers were identified, and none of the 3 identified ECG papers that progressed beyond screening met the inclusion criteria.
Fig. 5
(Left) The distribution of modalities used for input to techniques. (Middle) The reported AUC and (Right) accuracy of techniques by modality. Only techniques reporting AUC or accuracy are included, respectively. Results of a two-sided independent t-test are give as ‘*’ for significance or ‘ns’ for no significance
(Left) The distribution of modalities used for input to techniques. (Middle) The reported AUC and (Right) accuracy of techniques by modality. Only techniques reporting AUC or accuracy are included, respectively. Results of a two-sided independent t-test are give as ‘*’ for significance or ‘ns’ for no significanceThe majority of papers used a deep learning approach, the most common deep learning models used are listed in Fig. 6.
Fig. 6
(Left) The distribution of techniques using traditional machine learning and radiomics approaches versus deep learning and (Right) the distribution of the most popular deep learning networks
(Left) The distribution of techniques using traditional machine learning and radiomics approaches versus deep learning and (Right) the distribution of the most popular deep learning networks
Performance
Performance is only reported here for studies where AUC or accuracy were described. The top-performing diagnostic and prognostic techniques are listed in Tables 2 and 3, respectively. Neither AUC (Fig. 7, left) nor accuracy (Fig. 7, right) significantly correlated with the number of CLAIM failures for diagnosis nor prognosis. There were no statistically significant differences between input modalities on performance (Fig. 5, middle and right), although CXR appeared to provide a higher AUC than CT, and US appeared to provide a lower accuracy than CT and CXR. Deep learning approaches had increased reported AUC (p = 0.04) and accuracy (p = 0.01), but no significant difference in bias was identified (Fig. 8).
Table 2
Union of top 5 performing diagnostic techniques by AUC and accuracy. Techniques performing binary classification between healthy and COVID-19 were excluded
Refs.
CLAIM failures
Classes
Modality
Datasets
Method
AUC
Accuracy
Zheng et al. [20]
7/42
3 class
CT
Bespoke, COVID-CT [21]
DenseNet-121
0.991
98.6%
Han et al. [22]
10/42
3 class
CT
Bespoke
Attention DL
0.99
97.9%
Das et al. [23]
14/42
3 class
CXR
Cohen [24], Montgomery County X-ray [25], Kermany [26]
InceptionNet
0.99
99.0%
Wang et al. [27]
3/42
3 class
CT
3DLSC-COVID [27]
U-Net, ResNet
0.983
Liu et al. [28]
6/42
2 class (pneumonia)
CT, clinical features
Bespoke
LASSO Radiomics
0.98
93.0%
Krakansis et al. [29]
7/42
3 class
CXR
Kermany [26], Cohen [24]
CNN
98.3%
Jin et al. [30]
4/42
3 class
CT
NIH Chest X-ray [31]
AlexNet
96.86%
CNN convolutional neural network, DL deep learning, LASSO least absolute shrinkage and selection operator
Table 3
Union of top 5 performing prognostic techniques by AUC and accuracy
Refs.
CLAIM failures
Classes
Modality
Datasets
Method
AUC
Accuracy
Tang et al. [32]
10/42
2 (clinical assessment)
CT
Bespoke
RF
0.98
89%
Wu et al. [33] (CrrScore)
9/42
2 (progression)
CT, clinical features
Bespoke
LASSO
0.977
Wu et al. [33] (RadScore)
9/42
2 (progression
CT
Bespoke
LASSO
0.976
Li et al. [34]
11/42
2 (clinical assessment)
CT
Bespoke
LR
0.97
Elsharkawy et al. [35]
12/42
2 (clinical assessment)
CXR
Cohen [24], CORD-19 [36]
NN
98%
Wang et al. [27]
7/42
3 (survival)
CT
Bespoke
LR
88.5%
Meng et al. [37]
5/42
2 (survival)
CT
Bespoke
De-DOVID19-Net
87.5%
Zhu et al. [38]
10/42
2 (progression)
CT
Bespoke
LR
85.69%
RF random forest, LASSO least absolute shrinkage and selection operator, LR logistic regression, NN neural network
Fig. 7
Performance of techniques, as measured by AUC (left) and accuracy (right), plotted against CLAIM failures. Hue represents tasks, as indicated in the legend. Dashed lines indicate the mean regression for each of the tasks, and shading indicates the 95% confidence interval. All regression lines were compared with a two-sided independent t-test against a null hypothesis that gradient = 0, none of which reached significance
Fig. 8
Comparison of (Left) AUC, (middle) accuracy and (Right) number of CLAIM fails between techniques leveraging deep learning and those leveraging classical machine learning and radiomics approaches. Results of a two-sided independent t-test are represented as ‘*’ for significance or ‘ns’ for no significance
Union of top 5 performing diagnostic techniques by AUC and accuracy. Techniques performing binary classification between healthy and COVID-19 were excludedCNN convolutional neural network, DL deep learning, LASSO least absolute shrinkage and selection operatorUnion of top 5 performing prognostic techniques by AUC and accuracyRF random forest, LASSO least absolute shrinkage and selection operator, LR logistic regression, NN neural networkPerformance of techniques, as measured by AUC (left) and accuracy (right), plotted against CLAIM failures. Hue represents tasks, as indicated in the legend. Dashed lines indicate the mean regression for each of the tasks, and shading indicates the 95% confidence interval. All regression lines were compared with a two-sided independent t-test against a null hypothesis that gradient = 0, none of which reached significanceComparison of (Left) AUC, (middle) accuracy and (Right) number of CLAIM fails between techniques leveraging deep learning and those leveraging classical machine learning and radiomics approaches. Results of a two-sided independent t-test are represented as ‘*’ for significance or ‘ns’ for no significance
Authors
The country of residence of authors tended to correlate with countries that were affected the most by the pandemic in early 2020 (Fig. 9).
Fig. 9
Number of articles published by author country. Articles with authors from multiple countries, indicated by hue, are counted in duplicate for each country
Number of articles published by author country. Articles with authors from multiple countries, indicated by hue, are counted in duplicate for each countryA network analysis of connectivity between authors yielded 48 separate graphs of the 81 publications, depicted in Supplementary 1 Figure S1, and a subset in Fig. 10. The most productive research groups are summarised in Table 4.
Fig. 10
Authorship graph, where nodes represent authors and edges represent co-authorship. Depicted are the 5 largest clusters
Table 4
20 most productive groups
Research group
City
Country
Studies
Department of Radiology, Tongji hospital, Huazhong University of Science and Technology
Wuhan
China
16
Institute of Automation Chinese Academy of Sciences
Beijing
China
10
Department of Radiology, The First Affiliated Hospital of Jinan University
Guangzhou
China
7
Beihang University
Beijing
China
7
NVIDIA
Santa Clara
United States
7
Department of Radiology, Université de Paris
Paris
France
6
School of Electrical and Computer Engineering, University of Oklahoma
Norman
United States
5
College of Intelligence Science and Technology, National University of Defense Technology
Changsha
China
5
Huazhong University of Science and Technology
Wuhan
China
5
Tencent
Shenzhen
China
5
Department of Radiology, Renmin Hospital of Wuhan University
Wuhan
China
4
Department of Radiology, Xinhua Hospital
Shanghai
China
4
Department of Radiology, Xiangya Hospital
Changsha
China
4
Universidad de Granada
Granada
Spain
4
Hubei Province Key Laboratory of Molecular Imaging
Wuhan
China
6
Department of Bioengineering, University of Louisville
Louisville
United States
3
School of Public Health, Capital Medical University
Beijing
China
3
Department of Radiology, Shanghai Jiao Tong University
Shanghai
China
3
Department of Radiology, Xhongnan Hospital of Wuhan University
Wuhan
China
3
The University of Adelaide
Adelaide
Australia
3
Authorship graph, where nodes represent authors and edges represent co-authorship. Depicted are the 5 largest clusters20 most productive groups
Discussion
In this work, we present a systematic review of automated techniques for diagnosis, prognosis and segmentation of COVID-19 disease. Because the field has proven both popular and controversial, we used liberal exclusion criteria to reduce the number of lower-quality papers for manual review. In formulating the criteria, we assumed that impactful papers are likely to be published in highly cited publications and are likely to attract citations themselves. Studies published in journals with a SNIP below 1 were eliminated, which risks eliminating journals that aren’t ranked by Scopus. In order to reduce this risk, the list of eliminated journals was reviewed by all authors, and a consensus on non-indexed journals to include was reached. Further, studies that have been published for greater than 90 days yet hadn’t attracted any citations were eliminated, which risks eliminating unnoticed studies. Even after screening, 71% of papers were excluded during bias assessment (Figs. 1, 2), indicating that the majority of work in the field is at high risk of bias, including those published in reputable peer-reviewed publications.
Sources of bias
Datasets
Many studies use data from sources with minimal provenance and metadata, and often use data that was not intended for training diagnostic or prognostic tools. A number of datasets aggregate data from different sources, some of which may be aggregates themselves [9]; and many studies aggregate a number of datasets, either to increase their training size or to provide an independent test set. However, this causes a complex set or participants and leads to a high risk that the same images are present in the training and evaluation set. Other datasets present a series of CT slices without metadata indicating which images belong to which participants, leading to a high risk that adjacent axial slices from a participant may lie in the training and evaluation set. Any studies exhibiting these risks failed CLAIM 21.Although it did not lead to exclusion in this review, some datasets also aggregate different classes from different sources. It has been established that this presents a high risk of bias, as networks are able to distinguish between classes using non-disease-related domain effects.
Data handling
Studies that did not split training and evaluation sets at the patient level also failed CLAIM 21. This mostly occurred in papers dealing with CT as 2D axial slices, some of which randomly allocated all 2D images between classes. CLAIM 26 was responsible for the most failures (45%, Figure < CLAIM subset >), which often indicated a failure to allocate an evaluation set for use after model selection.
Description of methods
The remaining CLAIM checklist items, 7, 9, 20, 22, 25 and 28, each related to adequately documenting methodology. This is important not only for reproducibility, which is important in technical publications to ensure the advancement of the field, but also could represent hidden bias. The field of machine learning requires attention to detail in implementation to prevent overfitting, data dredging or otherwise accidentally positively biasing results.
Study demographics
The majority (58%) of techniques sought to solve a diagnosis task. Although there has been limited need for diagnosis of COVID-19 using imaging, the potential for faster analysis compared with RT-PCR, especially when considering that consecutive negative RT-PCR testing is required for exclusion when the pre-test probability is high [39]. However, within this set, 38% only demonstrated an ability to differentiate COVID-19 from healthy individuals. Any clinically realistic scenario for deployment of such an algorithm would need to demonstrate an ability to aid in a differential diagnosis between similar diseases. Regardless, most professional bodies recommend the use of radiographic imaging in COVID-19 only for triage purposes [5, 6, 40] and therefore it is most likely more impactful for investigators to explore prognostic techniques.CT scanning was the most popular modality, likely due to the image quality of tomographic imaging and the availability of public datasets. The additional context a 3D image can give may also have motivated the use of the modality, although many techniques only considered 2D axial sections. Given the clinical context and the fact that techniques are likely to be most useful during an outbreak, the use of CXR may be more convenient and practical. For example, clinical practice dictates that imaging rooms require an hour between patients for cleaning, a requirement that can be obviated with portable CXR that can move to the patient’s room [41]. Therefore, we suggest that future investigations may be more impactful in delivering a technique using CXR data, especially as no significant performance differences were seen between CXR and CT (Fig. 5).It has been proposed that ultrasound analysis for COVID-19 could be valuable in rural and remote regions, and as a tool to facilitate social distancing in urban regions [42]. The relatively niche requirement means that systems for automated analysis of ultrasound are likely to be less impactful. This may be offset by the low cost of ultrasound, and the potential to deploy systems to developing countries. Other modalities, including MRI and even ECG, were explicitly included in the scope of this review, however no papers met the inclusion criteria for either. MRI generally yields poor contrast within the lung and provides few benefits over CT in this application. Some studies investigating ECG remain after the screening process, but either were excluded as they were not automated or did not meet the bias assessment requirements.
Study performance
Studies tended to report excellent diagnostic and prognostic performance based on imaging features. The top diagnostic techniques all reported AUC ≥ 0.98 and accuracy ≥ 96.8% (Table 2), while the prognostic techniques reported AUC ≥ 0.97 and accuracy ≥ 85.7% (Table 3). Further, these results were relatively stable across the number of CLAIM failures (Fig. 7), providing some confidence that the top results are not dominated by biased studies. Notably, though, the top performing prognostic techniques in Table 3 are binary classification tasks, which naturally yield higher metrics than those with more classes.
Observations
Many studies used image storage formats that don’t meet medical imaging standards. Images may be stored at lower bit depth resolution, be stored using lossy compression, or be stored without requisite metadata. If these traits are consistent between classes, these issues are less likely to lead to a positive bias in reported results but may lead to lower performance. Similarly, many CT studies reported using per-image intensity normalisation for pre-processing. For quantitative modalities such as CT, this leads to a loss of information that the network is likely having to account for internally.
Input data
Studies that presented techniques under identical conditions with and without clinical data reported superior performance with the clinical data [28, 43]. This may be reporting bias, but it is likely that some combination of demographic, symptomatic and imaging data is likely to provide additional discrimination into the disease progression. Much of this information is relatively easily acquired, so there is little cost to include it.
Ethics
The majority of studies presenting novel datasets reported detail on the ethical approval. However, far fewer provided information on the consent given by participants, as required by CLAIM item 7. To be consistent with the analysis of Roberts et al., we have ignored this requirement, however we note this is an area to be improved in the medical imaging literature. Further, no studies that sourced from public datasets reported any ethical approval. The National Statement on Ethical Conduct in Human Research [44] outlines the definition of human data to include that sourced from public datasets.
Clinical translation
Few of the reviewed papers realistically considered clinical deployment. As Roberts et al. [15] highlight, no developed systems are ready to be deployed clinically, with one reason being the need to work with clinicians to ensure the developed algorithms are clinically relevant and implementable. This is highlighted by a review by Born et al. [45] who found that although 84% of clinical studies report the use of CT (with CXR only comprising 10% of studies), a much larger proportion of the AI papers were focused on X-ray. The same paper also emphasizes the need for additional stakeholder engagement, including patients, ethics committees, regulatory bodies, hospital administrators and clinicians. For clinical deployment medical imaging software generally requires validation through randomised control trials, regulatory certification (generally the software would be developed within an ISO1485 and IEC 62304 environment), and integration with existing clinical workflow (aligning with agreed standards for interoperability and upgradability, particularly the DICOM standard and required vendor tags).
Author demographics
We provide data on the authors (Fig. 9) and institutions (Table 4) publishing in the field as a landscape map for new authors. Most of the authors are located between China, The United States of America and Italy, and most of the most productive groups in China. Collaboration between groups predominantly occurred within the same country, except for a cluster of collaboration between Italy and the United States (Fig. 10).
Review limitations
Automatic filtering of studies using the SNIP of the published journal and number of citations for studies older than 90 days at the time of search was conducted. This was required in order to regulate the scope of the manually reviewed articles. This risked omitting rigorous papers that have not attracted scientific interest or are published in less circulated or newer journals. We believe this risk is low enough that the results presented are generalisable to the field.In this work, we collect studies primarily from the Scopus database. While Scopus, alongside Web of Science, are historically the most widely used databases in bibliometric analysis, their coverage is not complete. Notwithstanding, Scopus shares 99.11% of its indexed journals with Web of Science and 96.61% with Dimensions. For this reason, we believe the methods in this review were valid and fit-for-purpose.In this work, we use CLAIM as a surrogate measure for bias. CLAIM provides a prescriptive and objective criterion, well-suited to having a range of reviewers quickly and consistently assess a large number of papers. However, CLAIM is designed as a checklist of best practices, as opposed to an assessment of bias. The number of CLAIM failures should be interpreted by the reader as only an approximate measure of bias.
Conclusion
In this systematic review, we collected 1002 studies and have included 82 in the analysis after screening, relevance and bias assessment. A 71% exclusion ratio for bias despite extensive screening was indicative of a high level of risk-of-bias in the field. Commonly, publications sought to solve tasks with lower potential clinical impact, focusing on diagnosis rather than prognosis and differentiation of COVID-19 from controls rather than from other likely candidate diseases in a differential diagnosis. Similarly, clinical considerations and deployment were seldom discussed. Medical imaging standards were also regularly not met, with data sourced online without provenance and in compressed formats. Nevertheless, studies reported superb prognostic and diagnostic performance, and these results were robust amongst studies regardless of risk-of-bias or modality. Deep learning studies tended to report improved performance but did not report higher risk-of-bias compared with traditional machine learning approaches. We therefore conclude that the field has proven itself as a concept and that future work should focus on developing clinically useful and robust tools.Below is the link to the electronic supplementary material.Supplementary file1 (DOCX 541 kb)Supplementary file2 (XLSX 890 kb)
Authors: Elie A Akl; Ivana Blazic; Sally Yaacoub; Guy Frija; Roger Chou; John Adabie Appiah; Mansoor Fatehi; Nicola Flor; Eveline Hitti; Hussain Jafri; Zheng-Yu Jin; Hans Ulrich Kauczor; Michael Kawooya; Ella Annabelle Kazerooni; Jane P Ko; Rami Mahfouz; Valdair Muglia; Rose Nyabanda; Marcelo Sanchez; Priya B Shete; Marina Ulla; Chuansheng Zheng; Emilie van Deventer; Maria Del Rosario Perez Journal: Radiology Date: 2020-07-30 Impact factor: 11.105