Literature DB >> 34919204

Automated COVID-19 diagnosis and prognosis with medical imaging and who is publishing: a systematic review.

Ashley G Gillman¹, Febrio Lunardo^2,3, Joseph Prinable⁴, Gregg Belous², Aaron Nicolson², Hang Min², Andrew Terhorst⁵, Jason A Dowling².

Abstract

OBJECTIVES: To conduct a systematic survey of published techniques for automated diagnosis and prognosis of COVID-19 diseases using medical imaging, assessing the validity of reported performance and investigating the proposed clinical use-case. To conduct a scoping review into the authors publishing such work.
METHODS: The Scopus database was queried and studies were screened for article type, and minimum source normalized impact per paper and citations, before manual relevance assessment and a bias assessment derived from a subset of the Checklist for Artificial Intelligence in Medical Imaging (CLAIM). The number of failures of the full CLAIM was adopted as a surrogate for risk-of-bias. Methodological and performance measurements were collected from each technique. Each study was assessed by one author. Comparisons were evaluated for significance with a two-sided independent t-test.
FINDINGS: Of 1002 studies identified, 390 remained after screening and 81 after relevance and bias exclusion. The ratio of exclusion for bias was 71%, indicative of a high level of bias in the field. The mean number of CLAIM failures per study was 8.3 ± 3.9 [1,17] (mean ± standard deviation [min,max]). 58% of methods performed diagnosis versus 31% prognosis. Of the diagnostic methods, 38% differentiated COVID-19 from healthy controls. For diagnostic techniques, area under the receiver operating curve (AUC) = 0.924 ± 0.074 [0.810,0.991] and accuracy = 91.7% ± 6.4 [79.0,99.0]. For prognostic techniques, AUC = 0.836 ± 0.126 [0.605,0.980] and accuracy = 78.4% ± 9.4 [62.5,98.0]. CLAIM failures did not correlate with performance, providing confidence that the highest results were not driven by biased papers. Deep learning techniques reported higher AUC (p < 0.05) and accuracy (p < 0.05), but no difference in CLAIM failures was identified.
INTERPRETATION: A majority of papers focus on the less clinically impactful diagnosis task, contrasted with prognosis, with a significant portion performing a clinically unnecessary task of differentiating COVID-19 from healthy. Authors should consider the clinical scenario in which their work would be deployed when developing techniques. Nevertheless, studies report superb performance in a potentially impactful application. Future work is warranted in translating techniques into clinical tools.

Entities: Chemical

Keywords: Chest X-ray; Computed tomography; Coronavirus; Diagnosis; Prognosis; Staging

Mesh：

Year: 2021 PMID： 34919204 PMCID： PMC8678975 DOI： 10.1007/s13246-021-01093-0

Source DB: PubMed Journal: Phys Eng Sci Med ISSN： 2662-4729

Introduction

The novel coronavirus, SARS-Cov-2 and its associated disease, COVID-19, have presented a significant and urgent threat to public health while simultaneously disrupting healthcare systems. Despite being more than 2 years since the beginning of the pandemic, outbreaks continue to threaten to overwhelm healthcare systems, and viral variants continue to introduce uncertainty [1]. Fast and accurate diagnostic and prognostic capability help quickly determine which patients need to be isolated and informs triage of patients. Reverse-transcription polymerase chain reaction (RT-PCR) is the current clinical standard for diagnosis of COVID-19, however, its low sensitivity often necessitates repeat testing [2] taking additional time. This has led to the suggestion that there is a role for radiology in diagnosing COVID-19. Radiological professional bodies have generally recommended against the use of imaging for screening in COVID-19 but recognise the role of incidental findings and for disease staging. Early in the pandemic, the use of computed tomography (CT) for diagnosis and screening was discussed in the context of shortages of RT-PCR test kits and poor sensitivity [3]. In March of 2020, a consensus report was released [4], endorsed by the Society of Thoracic Radiology, the American College of Radiology and the Radiological Society of North America (RSNA), recommending against the use of chest CT for screening due to a low negative predictive value, but also partly due to a lack of evidence early in the pandemic. The Royal Australian and New Zealand College of Radiologists released their advice in April of 2020, which remains current, recommending against the use of chest radiograph for screening but recommending for the use of CT for staging [5]. The report, however, stops short of recommending a severity scale. By June of 2020, the World Health Organisation recommended the use of radiological imaging: (1) for diagnostic purposes in symptomatic patients when RT-PCR is not available, is available but results are delayed and when RT-PCR is negative but there is high clinical suspicion of COVID-19; (2) for triage purposes when deciding to admit to hospital and/or intensive care unit (ICU); and (3) for staging purposes when deciding appropriate therapeutic management [6]. The most recent version of the Cochrane review on the topic suggest that CT and chest X-ray (CXR) are moderately sensitive and specific to the diagnosis of COVID-19, whereas ultrasound is sensitive but not specific to the diagnosis of COVID-19 [7]. This novel application of radiology has spurred an interest in the application of machine learning techniques to automate the image interpretation tasks. Many investigators have proposed techniques in a wide range of applications to automate image interpretation in imaging of COVID-19, including segmentation of COVID-19 related lesions, typically ground-glass opacities (GGOs), diagnosis, staging of the current disease progression and prognosis of likely future disease progression. However, the field has inspired controversy. DeGrave et al. [8] demonstrated that combining data from multiple sources, in particular where data from different classes have different acquisition and pre-processing parameters, led to a significant bias that artificially improved the measured performance in many studies. Garcia Santa Cruz et al. [9] presented a review of public CXR datasets, concluding that the most popular datasets used in the literature were at a high risk of introducing bias into reported results. Many other reviews have been introduced on the topic, we now introduce the seminal ones. Shi et al. [10] presented a narrative review very early in the pandemic (published April of 2020) of machine learning techniques for segmentation of COVID-19-related lesions and for diagnosis, staging and prognosis of COVID-19 using CXR and CT. However, this early review did not consider potential study bias in its papers. Others have presented systematic reviews [11, 12] that, while following a more rigorous approach to inclusion also failed to asses bias when assessing results. Wynants et al. [13] present a broadly-scoped systematic review for prediction models in COVID-19, leveraging the prediction model risk of bias assessment tool (PROBAST) [14]. They reported high risk of bias across the field. Roberts et al. [15] presented a systematic review of machine learning techniques applied to CXR and CT imaging, published up to the 3rd of October, 2020, assessing bias using the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) [16], Radiomics Quality Score (RQS) [17] and PROBAST [14] and reporting methodological and dataset trends. They use this to develop a set of recommendations for authors in the field. In this review, we use similar techniques to those presented by Roberts et al. [15]. Rather than assessing papers on separate criteria, RQS and CLAIM, we assess all papers with CLAIM. We also aim to present a richer analysis of techniques and their performance, and to provide an update, including publications until 31st October, 2021. We also introduce an analysis of authors and institutions in the field, in the hope that it encourages and facilitates further collaboration. Research questions: Which techniques are most successful in differentiating COVID-19? What are the clinical requirements driving the development of these tools? How would such techniques be implemented clinically? Who is publishing this in this field?

Methodology

Study selection

The inclusion criteria for the review are: Studies that aim to automatically (allowing for manual contouring as a preprocessing step under the assumption this could be automated) diagnose, stage or prognose COVID-19 or segment lesions associated with COVID-19; and Studies that use medical imaging or signals, including CXR, CT, ultrasound, magnetic resonance imaging (MRI), or electrocardiograph (ECG) as input to their model. rscopus version 0.6.6 [18] was used to retrieve articles according to the search criteria outlined in Panel 1. The search was performed on the 19th November, 2021. Papers meeting the inclusion criteria that were identified during the investigation but not identified in the search were also included in the study. Panel 1: Scopus search criteria TITLE-ABS-KEY ( ( covid OR coronavirus ) AND ( ( chest W/5 xray ) OR “computed tomography” OR ultrasound OR “magnetic resonance” OR mr OR mri OR ecg OR electrocardiograph* ) AND ( diagnos* OR staging OR identif* OR response OR prognos* OR segment* ) AND ( learn* OR convolutional OR network OR radiomic*) ) Exclusion criteria were also imposed to eliminate studies that exhibited or were likely to exhibit a high risk of bias: Studies from journals with a source normalized impact per paper (SNIP), as measured in 2021, less than 1 were excluded. SNIP is a metric introduced by Scopus that measures contextual impact, normalising between fields with different citation rates. This process was manually checked by two of the authors, and journals that were likely to publish relevant studies and reputable within their fields, that would be eliminated, were included. Studies that were more than 90 days old and had not attracted any citations were excluded. This criteria is included to automatically filter articles which the scientific community has deemed uninteresting, under the assumption that in such a fast moving field, 90 days should be adequate to have attracted at least one citation. Studies with metadata indicating that they were Editorials, Reviews, Notes or Letters were excluded. Studies where application to COVID-19 is secondary and not the primary focus of the paper were excluded. Studies not meeting the minimum risk of bias assessment (see “Bias assessments” section) were excluded. Remaining studies were assigned amongst reviewing authors, and each study was reviewed by one author, who assessed for minimum risk-of-bias, and extracted data. Studies were not de-identified before analysis.

Bias assessments

Due to reports of a high risk-of-bias in the field [9, 13, 15], we include a bias assessment. Improper study design, data collection, data partitioning and statistical methods can lead to misleading reported results [14]. This commonly manifests as a positive bias because authors (rightly) attempt to improve the performance of their proposed techniques. The CLAIM checklist was completed for all included papers [16]. All 42 checklist items were given either a pass or fail score, or a “not applicable” score which did not count towards the failure count in cases where the checklist item was not applicable to the paper. The number of failure scores was used as a measure for bias. Similar to Roberts et al. [15], we impose a subset of CLAIM, items 7, 9, 20, 21, 22, 25, 26 and 28, as a minimum risk of bias. Any papers that did not meet all subset checklist items were excluded. CLAIM checklist reports from Roberts et al. [15] were merged and used where available to avoid duplication.

Extracted data

Methodological and performance results were collected per technique, where each study presents one or more technique. When multiple techniques were introduced in each study, only the highest performing technique was surveyed, unless the techniques filled different purposes (e.g., one study presenting a segmentation and diagnostic technique) or different contexts (e.g., different available clinical data to augment image input) (Table 1).

Table 1

Data collected during survey

Field	Definition
Task	Diagnosis (differentiating COVID-19 from healthy or other diseases), prognosis (this included staging, differentiating within COVID-19 for the severity or expected disease trajectory) and segmentation of COVID-19 related lesions, including GGO.
Output classes	For diagnosis tasks, whether COVID-19 was differentiated for either or both of other pneumonia and/or healthy controls. For prognosis tasks, the number of classes or if the task is a continuous regression one, as well as the derivation of the class. Derivations were classified as either clinical assessment, where the severity is measured based on clinical features at the time of imaging, progression, where the severity is measured either by time spent in hospital or by required interventions, and survival, where the severity is measured by whether the infection proved lethal.
Imaging type	Input to model, including modality and whether additional clinical or demographic information was passed into the model.
Model information	Including the machine learning or deep learning model, optimiser, parameters and augmentation (if deep learning) and manual extracted features (if radiomics).
Number of centres	Number of separate institutions from which data was sourced.
Performance	Performance measures for proposed technique, as reported.
Reproducibility	Whether data and code were made available.
Bias	A CLAIM checklist was completed for each study and the number of failures was used as a measure of potential bias.

Data collected during survey

Analysis of studies

Accuracy and area under the curve (AUC) of the receiver operating characteristic (ROC), where reported, were used for performance comparison. Statistical significance was measured throughout this review using two-sided independent t-tests, with a significance threshold of p < 0.05. No adjustments were made for multiple comparisons.

Analysis of authors and publishers

Author, institution and publication metadata were extracted using rscopus 0.6.6 [18] and used to compute author h-indices. A co-author network was generated with tidygraph 1.2.0 [19] by linking authors that had published together, and the most central authors identified using the betweenness centrality.

Results

Of 1002 studies identified, 282 were assessed against the required subset of the CLAIM checklist for exclusion, after which 81 studies were included in the study (Fig. 1). A list of identified and included studies are available in Supplementary 1, Table S1, and the full set of studies identified and collected data are available in Supplementary 2. CLAIM 26 eliminated the most studies (Fig. 2, left), which pertains to the evaluation of the best-performing model. Most papers failing this subset failed to evaluate against a separate test set after presenting multiple models. CLAIM 25 eliminated the next most studies, which required an adequate description of hyperparameter selection. Only one in four of papers met the inclusion criteria, and approximately one in four of papers failed a half or more of the required CLAIM subset (Fig. 2, right). From the 81 studies included, a total of 103 separate techniques were included.

Fig. 1

PRISMA flow diagram of search

Fig. 2

Studies excluded for bias. The percentage of total studies that failed each of the required subset of the CLAIM checklist for inclusion (left), and a histogram of the number of failures (right), where only studies with 0 failures met the inclusion criteria

PRISMA flow diagram of search Studies excluded for bias. The percentage of total studies that failed each of the required subset of the CLAIM checklist for inclusion (left), and a histogram of the number of failures (right), where only studies with 0 failures met the inclusion criteria

Bias

Remaining CLAIM failures in the included articles are depicted in Fig. 3 (left). The count of failures for each article became the risk-of-bias surrogate, a histogram over all papers is shown in Fig. 3 (right). The mean number of failures was 8.3 3.9 standard deviation.

Fig. 3

CLAIM results of studies included: the number of included studies that failed each of the CLAIM items (left), and a histogram of the number of failures (right)

Methodologies

The majority, 58%, of techniques sought to solve a diagnosis task, attempting to classify COVID-19 disease from healthy patients and/or non-COVID-19 pneumonia (Fig. 4, left), versus 31% performing prognosis (where techniques performing both are counted in both). Of the 31% of techniques attempting to solve a prognosis task, the majority used an objective prognostic outcome measure (46% progression and 16% survival) rather than matching a clinical assessment.

Fig. 4

(Left) Machine learning tasks attempted to be solved by techniques. (Top Right) A breakdown of Diagnosis and Diagnosis & Prognosis approaches by diagnostic outcome variable classes. (Bottom Right) A breakdown of Prognosis and Diagnosis & Prognosis approaches by prognostic outcome variable. The inner ring represents the number of classes, or continuous for regression tasks, and the outer ring represents the derivation of the outcome variable. See Table 1 for definitions of derivations Most papers used CT images, either in 3D or as 2D slices, as model input, followed by CXR and US (Fig. 5, left). Only a small minority of papers included clinical features as input. Although MRI and ECG were explicitly included within the scope of the review, no techniques using these modalities were included. No MRI papers were identified, and none of the 3 identified ECG papers that progressed beyond screening met the inclusion criteria.

Fig. 5

(Left) The distribution of modalities used for input to techniques. (Middle) The reported AUC and (Right) accuracy of techniques by modality. Only techniques reporting AUC or accuracy are included, respectively. Results of a two-sided independent t-test are give as ‘*’ for significance or ‘ns’ for no significance

Fig. 6

(Left) The distribution of techniques using traditional machine learning and radiomics approaches versus deep learning and (Right) the distribution of the most popular deep learning networks

Performance

Performance is only reported here for studies where AUC or accuracy were described. The top-performing diagnostic and prognostic techniques are listed in Tables 2 and 3, respectively. Neither AUC (Fig. 7, left) nor accuracy (Fig. 7, right) significantly correlated with the number of CLAIM failures for diagnosis nor prognosis. There were no statistically significant differences between input modalities on performance (Fig. 5, middle and right), although CXR appeared to provide a higher AUC than CT, and US appeared to provide a lower accuracy than CT and CXR. Deep learning approaches had increased reported AUC (p = 0.04) and accuracy (p = 0.01), but no significant difference in bias was identified (Fig. 8).

Table 2

Union of top 5 performing diagnostic techniques by AUC and accuracy. Techniques performing binary classification between healthy and COVID-19 were excluded

Refs.	CLAIM failures	Classes	Modality	Datasets	Method	AUC	Accuracy
Zheng et al. [20]	7/42	3 class	CT	Bespoke, COVID-CT [21]	DenseNet-121	0.991	98.6%
Han et al. [22]	10/42	3 class	CT	Bespoke	Attention DL	0.99	97.9%
Das et al. [23]	14/42	3 class	CXR	Cohen [24], Montgomery County X-ray [25], Kermany [26]	InceptionNet	0.99	99.0%
Wang et al. [27]	3/42	3 class	CT	3DLSC-COVID [27]	U-Net, ResNet	0.983
Liu et al. [28]	6/42	2 class (pneumonia)	CT, clinical features	Bespoke	LASSO Radiomics	0.98	93.0%
Krakansis et al. [29]	7/42	3 class	CXR	Kermany [26], Cohen [24]	CNN		98.3%
Jin et al. [30]	4/42	3 class	CT	NIH Chest X-ray [31]	AlexNet		96.86%

CNN convolutional neural network, DL deep learning, LASSO least absolute shrinkage and selection operator

Table 3

Union of top 5 performing prognostic techniques by AUC and accuracy

Refs.	CLAIM failures	Classes	Modality	Datasets	Method	AUC	Accuracy
Tang et al. [32]	10/42	2 (clinical assessment)	CT	Bespoke	RF	0.98	89%
Wu et al. [33] (CrrScore)	9/42	2 (progression)	CT, clinical features	Bespoke	LASSO	0.977
Wu et al. [33] (RadScore)	9/42	2 (progression	CT	Bespoke	LASSO	0.976
Li et al. [34]	11/42	2 (clinical assessment)	CT	Bespoke	LR	0.97
Elsharkawy et al. [35]	12/42	2 (clinical assessment)	CXR	Cohen [24], CORD-19 [36]	NN		98%
Wang et al. [27]	7/42	3 (survival)	CT	Bespoke	LR		88.5%
Meng et al. [37]	5/42	2 (survival)	CT	Bespoke	De-DOVID19-Net		87.5%
Zhu et al. [38]	10/42	2 (progression)	CT	Bespoke	LR		85.69%

RF random forest, LASSO least absolute shrinkage and selection operator, LR logistic regression, NN neural network

Fig. 7

Performance of techniques, as measured by AUC (left) and accuracy (right), plotted against CLAIM failures. Hue represents tasks, as indicated in the legend. Dashed lines indicate the mean regression for each of the tasks, and shading indicates the 95% confidence interval. All regression lines were compared with a two-sided independent t-test against a null hypothesis that gradient = 0, none of which reached significance

Fig. 8

Comparison of (Left) AUC, (middle) accuracy and (Right) number of CLAIM fails between techniques leveraging deep learning and those leveraging classical machine learning and radiomics approaches. Results of a two-sided independent t-test are represented as ‘*’ for significance or ‘ns’ for no significance

Union of top 5 performing diagnostic techniques by AUC and accuracy. Techniques performing binary classification between healthy and COVID-19 were excluded CNN convolutional neural network, DL deep learning, LASSO least absolute shrinkage and selection operator Union of top 5 performing prognostic techniques by AUC and accuracy RF random forest, LASSO least absolute shrinkage and selection operator, LR logistic regression, NN neural network Performance of techniques, as measured by AUC (left) and accuracy (right), plotted against CLAIM failures. Hue represents tasks, as indicated in the legend. Dashed lines indicate the mean regression for each of the tasks, and shading indicates the 95% confidence interval. All regression lines were compared with a two-sided independent t-test against a null hypothesis that gradient = 0, none of which reached significance Comparison of (Left) AUC, (middle) accuracy and (Right) number of CLAIM fails between techniques leveraging deep learning and those leveraging classical machine learning and radiomics approaches. Results of a two-sided independent t-test are represented as ‘*’ for significance or ‘ns’ for no significance

Authors

The country of residence of authors tended to correlate with countries that were affected the most by the pandemic in early 2020 (Fig. 9).

Fig. 9

Number of articles published by author country. Articles with authors from multiple countries, indicated by hue, are counted in duplicate for each country

Number of articles published by author country. Articles with authors from multiple countries, indicated by hue, are counted in duplicate for each country A network analysis of connectivity between authors yielded 48 separate graphs of the 81 publications, depicted in Supplementary 1 Figure S1, and a subset in Fig. 10. The most productive research groups are summarised in Table 4.

Fig. 10

Authorship graph, where nodes represent authors and edges represent co-authorship. Depicted are the 5 largest clusters

Table 4

20 most productive groups

Research group	City	Country	Studies
Department of Radiology, Tongji hospital, Huazhong University of Science and Technology	Wuhan	China	16
Institute of Automation Chinese Academy of Sciences	Beijing	China	10
Department of Radiology, The First Affiliated Hospital of Jinan University	Guangzhou	China	7
Beihang University	Beijing	China	7
NVIDIA	Santa Clara	United States	7
Department of Radiology, Université de Paris	Paris	France	6
School of Electrical and Computer Engineering, University of Oklahoma	Norman	United States	5
College of Intelligence Science and Technology, National University of Defense Technology	Changsha	China	5
Huazhong University of Science and Technology	Wuhan	China	5
Tencent	Shenzhen	China	5
Department of Radiology, Renmin Hospital of Wuhan University	Wuhan	China	4
Department of Radiology, Xinhua Hospital	Shanghai	China	4
Department of Radiology, Xiangya Hospital	Changsha	China	4
Universidad de Granada	Granada	Spain	4
Hubei Province Key Laboratory of Molecular Imaging	Wuhan	China	6
Department of Bioengineering, University of Louisville	Louisville	United States	3
School of Public Health, Capital Medical University	Beijing	China	3
Department of Radiology, Shanghai Jiao Tong University	Shanghai	China	3
Department of Radiology, Xhongnan Hospital of Wuhan University	Wuhan	China	3
The University of Adelaide	Adelaide	Australia	3

Authorship graph, where nodes represent authors and edges represent co-authorship. Depicted are the 5 largest clusters 20 most productive groups

Discussion

In this work, we present a systematic review of automated techniques for diagnosis, prognosis and segmentation of COVID-19 disease. Because the field has proven both popular and controversial, we used liberal exclusion criteria to reduce the number of lower-quality papers for manual review. In formulating the criteria, we assumed that impactful papers are likely to be published in highly cited publications and are likely to attract citations themselves. Studies published in journals with a SNIP below 1 were eliminated, which risks eliminating journals that aren’t ranked by Scopus. In order to reduce this risk, the list of eliminated journals was reviewed by all authors, and a consensus on non-indexed journals to include was reached. Further, studies that have been published for greater than 90 days yet hadn’t attracted any citations were eliminated, which risks eliminating unnoticed studies. Even after screening, 71% of papers were excluded during bias assessment (Figs. 1, 2), indicating that the majority of work in the field is at high risk of bias, including those published in reputable peer-reviewed publications.

Sources of bias

Datasets

Many studies use data from sources with minimal provenance and metadata, and often use data that was not intended for training diagnostic or prognostic tools. A number of datasets aggregate data from different sources, some of which may be aggregates themselves [9]; and many studies aggregate a number of datasets, either to increase their training size or to provide an independent test set. However, this causes a complex set or participants and leads to a high risk that the same images are present in the training and evaluation set. Other datasets present a series of CT slices without metadata indicating which images belong to which participants, leading to a high risk that adjacent axial slices from a participant may lie in the training and evaluation set. Any studies exhibiting these risks failed CLAIM 21. Although it did not lead to exclusion in this review, some datasets also aggregate different classes from different sources. It has been established that this presents a high risk of bias, as networks are able to distinguish between classes using non-disease-related domain effects.

Data handling

Studies that did not split training and evaluation sets at the patient level also failed CLAIM 21. This mostly occurred in papers dealing with CT as 2D axial slices, some of which randomly allocated all 2D images between classes. CLAIM 26 was responsible for the most failures (45%, Figure < CLAIM subset >), which often indicated a failure to allocate an evaluation set for use after model selection.

Description of methods

The remaining CLAIM checklist items, 7, 9, 20, 22, 25 and 28, each related to adequately documenting methodology. This is important not only for reproducibility, which is important in technical publications to ensure the advancement of the field, but also could represent hidden bias. The field of machine learning requires attention to detail in implementation to prevent overfitting, data dredging or otherwise accidentally positively biasing results.

Study demographics

The majority (58%) of techniques sought to solve a diagnosis task. Although there has been limited need for diagnosis of COVID-19 using imaging, the potential for faster analysis compared with RT-PCR, especially when considering that consecutive negative RT-PCR testing is required for exclusion when the pre-test probability is high [39]. However, within this set, 38% only demonstrated an ability to differentiate COVID-19 from healthy individuals. Any clinically realistic scenario for deployment of such an algorithm would need to demonstrate an ability to aid in a differential diagnosis between similar diseases. Regardless, most professional bodies recommend the use of radiographic imaging in COVID-19 only for triage purposes [5, 6, 40] and therefore it is most likely more impactful for investigators to explore prognostic techniques. CT scanning was the most popular modality, likely due to the image quality of tomographic imaging and the availability of public datasets. The additional context a 3D image can give may also have motivated the use of the modality, although many techniques only considered 2D axial sections. Given the clinical context and the fact that techniques are likely to be most useful during an outbreak, the use of CXR may be more convenient and practical. For example, clinical practice dictates that imaging rooms require an hour between patients for cleaning, a requirement that can be obviated with portable CXR that can move to the patient’s room [41]. Therefore, we suggest that future investigations may be more impactful in delivering a technique using CXR data, especially as no significant performance differences were seen between CXR and CT (Fig. 5). It has been proposed that ultrasound analysis for COVID-19 could be valuable in rural and remote regions, and as a tool to facilitate social distancing in urban regions [42]. The relatively niche requirement means that systems for automated analysis of ultrasound are likely to be less impactful. This may be offset by the low cost of ultrasound, and the potential to deploy systems to developing countries. Other modalities, including MRI and even ECG, were explicitly included in the scope of this review, however no papers met the inclusion criteria for either. MRI generally yields poor contrast within the lung and provides few benefits over CT in this application. Some studies investigating ECG remain after the screening process, but either were excluded as they were not automated or did not meet the bias assessment requirements.

Study performance

Studies tended to report excellent diagnostic and prognostic performance based on imaging features. The top diagnostic techniques all reported AUC ≥ 0.98 and accuracy ≥ 96.8% (Table 2), while the prognostic techniques reported AUC ≥ 0.97 and accuracy ≥ 85.7% (Table 3). Further, these results were relatively stable across the number of CLAIM failures (Fig. 7), providing some confidence that the top results are not dominated by biased studies. Notably, though, the top performing prognostic techniques in Table 3 are binary classification tasks, which naturally yield higher metrics than those with more classes.

Observations

Many studies used image storage formats that don’t meet medical imaging standards. Images may be stored at lower bit depth resolution, be stored using lossy compression, or be stored without requisite metadata. If these traits are consistent between classes, these issues are less likely to lead to a positive bias in reported results but may lead to lower performance. Similarly, many CT studies reported using per-image intensity normalisation for pre-processing. For quantitative modalities such as CT, this leads to a loss of information that the network is likely having to account for internally.

Input data

Studies that presented techniques under identical conditions with and without clinical data reported superior performance with the clinical data [28, 43]. This may be reporting bias, but it is likely that some combination of demographic, symptomatic and imaging data is likely to provide additional discrimination into the disease progression. Much of this information is relatively easily acquired, so there is little cost to include it.

Ethics

The majority of studies presenting novel datasets reported detail on the ethical approval. However, far fewer provided information on the consent given by participants, as required by CLAIM item 7. To be consistent with the analysis of Roberts et al., we have ignored this requirement, however we note this is an area to be improved in the medical imaging literature. Further, no studies that sourced from public datasets reported any ethical approval. The National Statement on Ethical Conduct in Human Research [44] outlines the definition of human data to include that sourced from public datasets.

Clinical translation

Few of the reviewed papers realistically considered clinical deployment. As Roberts et al. [15] highlight, no developed systems are ready to be deployed clinically, with one reason being the need to work with clinicians to ensure the developed algorithms are clinically relevant and implementable. This is highlighted by a review by Born et al. [45] who found that although 84% of clinical studies report the use of CT (with CXR only comprising 10% of studies), a much larger proportion of the AI papers were focused on X-ray. The same paper also emphasizes the need for additional stakeholder engagement, including patients, ethics committees, regulatory bodies, hospital administrators and clinicians. For clinical deployment medical imaging software generally requires validation through randomised control trials, regulatory certification (generally the software would be developed within an ISO1485 and IEC 62304 environment), and integration with existing clinical workflow (aligning with agreed standards for interoperability and upgradability, particularly the DICOM standard and required vendor tags).

Author demographics

We provide data on the authors (Fig. 9) and institutions (Table 4) publishing in the field as a landscape map for new authors. Most of the authors are located between China, The United States of America and Italy, and most of the most productive groups in China. Collaboration between groups predominantly occurred within the same country, except for a cluster of collaboration between Italy and the United States (Fig. 10).

Review limitations

Automatic filtering of studies using the SNIP of the published journal and number of citations for studies older than 90 days at the time of search was conducted. This was required in order to regulate the scope of the manually reviewed articles. This risked omitting rigorous papers that have not attracted scientific interest or are published in less circulated or newer journals. We believe this risk is low enough that the results presented are generalisable to the field. In this work, we collect studies primarily from the Scopus database. While Scopus, alongside Web of Science, are historically the most widely used databases in bibliometric analysis, their coverage is not complete. Notwithstanding, Scopus shares 99.11% of its indexed journals with Web of Science and 96.61% with Dimensions. For this reason, we believe the methods in this review were valid and fit-for-purpose. In this work, we use CLAIM as a surrogate measure for bias. CLAIM provides a prescriptive and objective criterion, well-suited to having a range of reviewers quickly and consistently assess a large number of papers. However, CLAIM is designed as a checklist of best practices, as opposed to an assessment of bias. The number of CLAIM failures should be interpreted by the reader as only an approximate measure of bias.

Conclusion

In this systematic review, we collected 1002 studies and have included 82 in the analysis after screening, relevance and bias assessment. A 71% exclusion ratio for bias despite extensive screening was indicative of a high level of risk-of-bias in the field. Commonly, publications sought to solve tasks with lower potential clinical impact, focusing on diagnosis rather than prognosis and differentiation of COVID-19 from controls rather than from other likely candidate diseases in a differential diagnosis. Similarly, clinical considerations and deployment were seldom discussed. Medical imaging standards were also regularly not met, with data sourced online without provenance and in compressed formats. Nevertheless, studies reported superb prognostic and diagnostic performance, and these results were robust amongst studies regardless of risk-of-bias or modality. Deep learning studies tended to report improved performance but did not report higher risk-of-bias compared with traditional machine learning approaches. We therefore conclude that the field has proven itself as a concept and that future work should focus on developing clinically useful and robust tools. Below is the link to the electronic supplementary material. Supplementary file1 (DOCX 541 kb) Supplementary file2 (XLSX 890 kb)

97 in total

1. Mini-COVIDNet: Efficient Lightweight Deep Neural Network for Ultrasound Based Point-of-Care Detection of COVID-19.

Authors: Navchetan Awasthi; Aveen Dayal; Linga Reddy Cenkeramaddi; Phaneendra K Yalavarthy
Journal: IEEE Trans Ultrason Ferroelectr Freq Control Date: 2021-05-25 Impact factor: 2.725

2. Deep Learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) With CT Images.

Authors: Ying Song; Shuangjia Zheng; Liang Li; Xiang Zhang; Xiaodong Zhang; Ziwang Huang; Jianwen Chen; Ruixuan Wang; Huiying Zhao; Yutian Chong; Jun Shen; Yunfei Zha; Yuedong Yang
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2021-12-08 Impact factor: 3.710

3. CT-based radiomics for predicting the rapid progression of coronavirus disease 2019 (COVID-19) pneumonia lesions.

Authors: Bin Zhang; Ma-Yi-di-Li Ni-Jia-Ti; Ruike Yan; Nan An; Lv Chen; Shuyi Liu; Luyan Chen; Qiuying Chen; Minmin Li; Zhuozhi Chen; Jingjing You; Yuhao Dong; Zhiyuan Xiong; Shuixing Zhang
Journal: Br J Radiol Date: 2021-04-21 Impact factor: 3.039

4. Use of Chest Imaging in the Diagnosis and Management of COVID-19: A WHO Rapid Advice Guide.

Authors: Elie A Akl; Ivana Blazic; Sally Yaacoub; Guy Frija; Roger Chou; John Adabie Appiah; Mansoor Fatehi; Nicola Flor; Eveline Hitti; Hussain Jafri; Zheng-Yu Jin; Hans Ulrich Kauczor; Michael Kawooya; Ella Annabelle Kazerooni; Jane P Ko; Rami Mahfouz; Valdair Muglia; Rose Nyabanda; Marcelo Sanchez; Priya B Shete; Marina Ulla; Chuansheng Zheng; Emilie van Deventer; Maria Del Rosario Perez
Journal: Radiology Date: 2020-07-30 Impact factor: 11.105

5. Modality alignment contrastive learning for severity assessment of COVID-19 from lung ultrasound and clinical information.

Authors: Wufeng Xue; Chunyan Cao; Jie Liu; Yilian Duan; Haiyan Cao; Jian Wang; Xumin Tao; Zejian Chen; Meng Wu; Jinxiang Zhang; Hui Sun; Yang Jin; Xin Yang; Ruobing Huang; Feixiang Xiang; Yue Song; Manjie You; Wen Zhang; Lili Jiang; Ziming Zhang; Shuangshuang Kong; Ying Tian; Li Zhang; Dong Ni; Mingxing Xie
Journal: Med Image Anal Date: 2021-01-20 Impact factor: 8.545

6. Deep COVID DeteCT: an international experience on COVID-19 lung detection and prognosis using chest CT.

Authors: Edward H Lee; Jimmy Zheng; Errol Colak; Maryam Mohammadzadeh; Golnaz Houshmand; Nicholas Bevins; Felipe Kitamura; Emre Altinmakas; Eduardo Pontes Reis; Jae-Kwang Kim; Chad Klochko; Michelle Han; Sadegh Moradian; Ali Mohammadzadeh; Hashem Sharifian; Hassan Hashemi; Kavous Firouznia; Hossien Ghanaati; Masoumeh Gity; Hakan Doğan; Hojjat Salehinejad; Henrique Alves; Jayne Seekins; Nitamar Abdala; Çetin Atasoy; Hamidreza Pouraliakbar; Majid Maleki; S Simon Wong; Kristen W Yeom
Journal: NPJ Digit Med Date: 2021-01-29

1. Evaluation of Diagnostic Strategies for Identifying SARS-CoV-2 Infection in Clinical Practice: a Systematic Review and Compliance with the Standards for Reporting Diagnostic Accuracy Studies Guideline (STARD).

Authors: Paula Cremades-Martínez; Lucy A Parker; Elisa Chilet-Rosell; Blanca Lumbreras
Journal: Microbiol Spectr Date: 2022-06-14

1 in total