Literature DB >> 35185296

Multi-modal trained artificial intelligence solution to triage chest X-ray for COVID-19 using pristine ground-truth, versus radiologists.

Tao Tan¹, Bipul Das², Ravi Soni³, Mate Fejes⁴, Hongxu Yang¹, Sohan Ranjan², Daniel Attila Szabo⁴, Vikram Melapudi², K S Shriram², Utkarsh Agrawal², Laszlo Rusko⁴, Zita Herczeg⁴, Barbara Darazs⁴, Pal Tegzes⁴, Lehel Ferenczi⁴, Rakesh Mullick², Gopal Avinash³.

Abstract

The front-line imaging modalities computed tomography (CT) and X-ray play important roles for triaging COVID patients. Thoracic CT has been accepted to have higher sensitivity than a chest X-ray for COVID diagnosis. Considering the limited access to resources (both hardware and trained personnel) and issues related to decontamination, CT may not be ideal for triaging suspected subjects. Artificial intelligence (AI) assisted X-ray based application for triaging and monitoring require experienced radiologists to identify COVID patients in a timely manner with the additional ability to delineate and quantify the disease region is seen as a promising solution for widespread clinical use. Our proposed solution differs from existing solutions presented by industry and academic communities. We demonstrate a functional AI model to triage by classifying and segmenting a single chest X-ray image, while the AI model is trained using both X-ray and CT data. We report on how such a multi-modal training process improves the solution compared to single modality (X-ray only) training. The multi-modal solution increases the AUC (area under the receiver operating characteristic curve) from 0.89 to 0.93 for a binary classification between COVID-19 and non-COVID-19 cases. It also positively impacts the Dice coefficient (0.59 to 0.62) for localizing the COVID-19 pathology. To compare the performance of experienced readers to the AI model, a reader study is also conducted. The AI model showed good consistency with respect to radiologists. The DICE score between two radiologists on the COVID group was 0.53 while the AI had a DICE value of 0.52 and 0.55 when compared to the segmentation done by the two radiologists separately. From a classification perspective, the AUCs of two readers was 0.87 and 0.81 while the AUC of the AI is 0.93 based on the reader study dataset. We also conducted a generalization study by comparing our method to the-state-art methods on independent datasets. The results show better performance from the proposed method. Leveraging multi-modal information for the development benefits the single-modal inferencing.

Entities: Chemical

Keywords: Artificial intelligence; COVID-19; Multi-modal; Reader study

Year: 2022 PMID： 35185296 PMCID： PMC8847079 DOI： 10.1016/j.neucom.2022.02.040

Source DB: PubMed Journal: Neurocomputing ISSN： 0925-2312 Impact factor: 5.719

Introduction

Coronavirus disease 2019 (COVID-19) is extremely contagious and has become a pandemic [1], [2]. It has spread inter-continentally in the third and forth wave [3] and suspected to be currently entering the second wave [4], [5], [6] in various countries, having infected more than 30 million people and caused nearly 3.5 M deaths till May 2021 [7]. The mortality rate of this disease differs from country to country ranging from 2.5% to 7% compared with 1% from influenza [8], [9], [10]. Considering different age groups, elder people and patients with comorbidities are most vulnerable and more likely to progress to a life-threatening condition[11]. To prevent the spread of the disease, different governments have implemented strict containment measures [12], [13] which have aimed to minimize transmission. Because of the strong infection rate of COVID-19, rapid and accurate diagnostic methods are urgently required to identify, isolate and treat the patients specially considering that effective vaccines are still under development. The diagnosis of COVID-19 relies on reverse-transcriptase-polymerase chain reaction (RTPCR) test [14], [15], [16]. However it has several drawbacks. The RTPCR tests often require 5 to 6 h to yield results. Sensitivity of RTPCR depends on the stage of the infection [16] and can be as low as 71%[17]. Therefore care must be taken on interpreting RTPCR tests. More importantly, the cost of RTPCR prevents large population from being tested in the developing and highly populated economies. From the imaging domain, chest CT may be considered as a primary tool for COVID-19 detection [18], [15], [19] and the sensitivity of chest CT can be greater than that of RTPCR (98% vs 71%) [17], but the cost, time and risks of imaging including dose and need for system decontamination can be prohibitive in most markets. In contrast, portable X-ray (XR) units are accessible, cost and time effective, with lower radiation dose and thus the defacto imaging modality used in diagnosis and disease management. Since X-ray imaging has limited capability to provide detailed 3D structure of anatomy or pathology of chest cavity, it is not regarded as an optimum tool for quantitative analysis [20]. Also due to the imaging apparatus and nature of the X-ray projection, it is challenging for radiologists to identify relevant disease regions for accurate interpretation and quantification[21], [22]. The early detection of COVID-19 through XR images is particularly challenging, even for expert radiologists, as it cause only subtle changes in the projected image. To alleviate the lack of experienced radiologists and minimize human effort in managing an exponentially growing pandemic and the impending task to triage suspected COVID-19 subjects, the academic and industry communities have proposed various systems for diagnosing COVID-19 patients using X-ray imaging [23], [24], [25], [26]. The performance of some AI systems to detect COVID-19 pneumonia was comparable to radiologists to identify presence or absence of COVID-19 infection[24]. The common outputs of these diagnostic solutions or triage systems are classification probabilities for COVID-19 and non-COVID-19 accompanied by heatmaps localizing the suspected pathological region or areas of attention. Artificial intelligence algorithms can be used to discern subtle changes and aid radiologist analysis in various applications.[27], [28], [29], [30], [31], [32], [33], [34], [35] Although the performance of a host of systems approaches the level of radiologists for chest X-rays classification, very limited studies have verified the detection and segmentation of the disease regions compared to human annotation on X-rays. Segmentation is critical for (a) severity assessment of the disease; and for (b) follow-up for treatment monitoring or progression of patient condition. Although human annotations can be obtained for COVID-19 regions on X-rays, the certainty of regions is weak compared to annotations derived from CT. The development in the field suffers from the lack of availability of COVID-19 X-ray images with corresponding region annotations. The contribution of our work builds on establishing a multi-modal protocol for our analysis and downstream classification of X-ray images. The core of this study is the transformation between multi-modal data. Previously, [36] has shown that convolutional neural networks trained on propagated MR contours significantly outperform those trained on CT contours and also experts contouring manually on CT for hippocampus segmentation. This is probably due to the poor visibility of hippocampus on CT. For COVID-19 application, [37] leverages existing annotated CT images to generate frontal projection X-ray images for training COVID-19 chest X-ray models. Their model far outperforms baselines with limited supervised training and may assist in automated COVID-19 severity quantification on chest X-rays. Barbosa Jr. et. al. [38] have also leveraged CT ground-truth for generating synthetic X-ray where their approach takes ratios of disease region to lung region using both synthetic X-ray (SXR) and regular X-ray images and force them to be equal to the ratio of disease volume to lung volume in CT. It might be challenging to incorporate such a definition into clinical practice. Our deep learning model also trains using data from CT, by generating synthetic X-ray from paired CT scans to complement the including original X-ray images. At the inference end of the pipeline, only X-ray images are exclusively used. Differently from existing studies, we established a synthetic X-ray generation scheme to generate a multitude of realistic synthetic X-ray to significantly augment X-ray images to expand our training pool. Second, we use synthetic X-ray as a bridge to transfer the ground-truth from CT to the original X-ray geometry. Another contribution of our study is to highlight the serious domain shift issue when collecting images from multiple data sources where all three-class image data from the same source is not always available for training and testing. Although a preliminary work of this study has been previously published as a conference paper[39], this paper is substantially different from preliminary work in that it presents a comprehensive study of proposed approach and includes an observer study validation.

Method

Solution Design Overview

The key idea is to learn the disease patterns jointly using multi-modal (CT and XR) data but to inference the solution using only a single modality (XR). In this study, we are not particularly focusing on the design of deep-learning networks, but rather on improving data sufficiency and enhancing ground-truth quality to improve the visibility of abnormal tissue on a lower dimensional modality. In order to maximally leverage available X-ray data, CT scans and paired X-ray and CT images, we have designed a pipeline as illustrated in Fig. 1 . Disease regions in X-ray and CT are annotated independently by trained staff with varying levels of experience. Using CT, we generate multiple SXR image per CT volume and the corresponding projected 2D masks of the diseased region(Section 2.2). For patients where X-rays and CTs (paired XR and CT data) are acquired within a small time-window (48 h given that COVID-19 is a fast evolving disease and pathology regions in the lungs may change dramatically between exams), we automatically project and transfer the CT masks using the SXR to match the corresponding XR. The transferred annotations on XR are further adjusted manually reviewed by trained staff to build our pristine mask ground-truth/annotations (Section 2.3). For patients where the time-window between CT and XR exams is larger than 48 h, no ground-truth transfer is performed across the two modalities, but data is still added in the training pool. With all available XR and SXR images along with corresponding disease masks, we train a deep-learning convolutional neural network for diagnosing and segmenting COVID-19 disease regions on X-ray.

Fig. 1

The training scheme and inferencing design.

Synthetic X-ray Generation

Original X-ray images are generated by shining an X-ray beam with initial intensity on the subject and measuring the intensity (I) of the beam having passed through an attenuating medium at different positions using a 2D X-ray detector array. The attenuation of X-ray beams in matter follows the BeerLambert law, stating that the decrease in the beam’s intensity is proportional to the intensity itself (I) and the linear attenuation coefficient value() of the material being traversed. (1).From this we can derive the beam intensity function over distance to be an exponential decay as the following:In medical X-ray images the values saved to file are proportional to the expression seen in 3, which is actually a summation of the attenuation values along the beam. If the subject is made up of smaller homogeneous cells, this could be formulated as a sum.3D CT images are constructed from many 2D X-ray images of the subject taken at different angles and positions (helical or other movement of the source-detector pair) by using a tomographic reconstruction algorithms such as iterative reconstruction or the inverse radon transform. The voxel values of the constructed CT image are the linear attenuation coefficient values specific to the used X-ray beam’s energy spectrum. These values are converted to Hounsfield unit (HU) values as seen in Eqn. 4, where and are the linear attenuation coefficient for water and air respectively for the given X-ray source. Using this unit of measure transforms images taken with different energy X-ray beams to have similar intensity values for the same tissues, scaling water to 0 HU, air to −1000 HU.An inverse process can help create X-ray images from CT images by projecting the constructed 3D volume back into a 2D plane along virtual rays originating from a virtual source. Since the X-ray image (Eqn. 3) is an integral/summation of attenuation values along the projection rays, established methods create projection images by converting the CT voxel values from HU to linear attenuation coefficient and simply summing the pixel values along the virtual rays (weighted by the travelled path-length by the ray withing each pixel). To achieve realistic pseudo X-ray we used a point source for the projection which accurately models the source of the X-ray, as in actual X-ray imaging apparatus. For the projection itself we used the ASTRA-toolbox[40], which allowed parameterization of the beam’s angle (vertical and horizontal) as well as the distances between the virtual point source, the origin of the subject and the virtual detector (see Fig. 2 ). As X-ray images have the same pixel spacing in both dimensions, while CT images usually have larger pixel spacing in the z direction, we re-sampled the 3D volumes to have isotropic (0.4 mm) spacing prior to computing the projection. In clinical X-ray images, the radiation source is behind and the detector is in front of the patient - also known as posterior-anterior (PA) view the images generated appear to be flipped horizontally. To conform with this protocol, we also flipped PA projections. For each CT volume we perform projections at different angles. The range of angles between the center of the CT image and the virtual X-ray point source falls between −10 and 10 degrees around the longitudinal axis, and −5 and 5 degrees around the mediolateral axis of the patient. Based on what is traditionally used in clinical practice, we used 2 distance settings between the center of the CT volume and the virtual X-ray source, 1 m and 1.8 m, while the distance between the CT volume center and the virtual detector is set to 0.25 m.

Fig. 2

The illustration of synthetic X-ray and its mask generation.

The illustration of synthetic X-ray and its mask generation. It is important to point out that in case of small field-of-view CT images, where the patient doesn’t fit into the reconstruction circle in the horizontal plane, creating coronal or sagittal projection will not result in lifelike X-rays image as body parts are missing from the virtual ray’s path. During our work we also projected the segmentation ground truth masks for CT images to 2D format. The same transformations were applied to these binary masks as to the corresponding CT volumes. In our workflow we use both a binary format of the 2D projected masks - created by setting all non-zero values to 1 - and a depth-mask format which retains information of the depth of the original 3D mask. The latter was scaled so the values have the physical meaning of depth. It should be noted that one important novelty in our study is that our segmentation/groundtruth is based on CT groundtruth which means that the disease region are from 3D. If we project the 3D disease on X-ray, the disease can be outside of traditional defined 2D X-ray lung while the real lung is much bigger than the black region on X-ray (see Fig. 3 ).

Fig. 3

A synthetic X-ray with its corresponding projected disease mask as an overlay.

Pristine Annotation Generation

For paired XR and CT images of the same patients, our objective is to transfer the pixel-wise ground-truth from CT to SXR and from SXR to XR. SXR serves as a bridge between CT and XR. For each CT volume, we generate a number of SXRs and corresponding disease masks by varying the imaging parameters as mentioned in Section 2.2. The goal is to register each SXR candidate to XR so that the disease region between SXR and XR have the best alignment so that we can apply the same transformation on the disease mask from SXR and generated registered disease mask for the paired XR. As for the same CT, multiple SXRs are generated resulting different registered masks and we simply select the one with the best mutual information between registered SXR and XR. Another approach to select optimal SXR involves comparison of the lung mask region. The pair with the most overlapping mask was used to select the ideal pair. To register SXR to XR, one issue to address is that fields of body view are different between the two and SXR. Furthermore SXR images manifest with hands and arms up while the XR images have the hands and arms hanging downward, normal position of the body. To alleviate this problem during image registration, instead of using original images, we generate a lung region of interest (ROI) image by applying lung segmentation on synthetic X-ray and X-ray. The lung segmentation solution is an inhouse developed method using a 2D U-Net [41] trained previously using a different dataset outside the scope of this study. We use affine registration to maximize the mutual information between lung ROIs from SXR and XR. Once registration error converges or registration step limit is reached, the obtained transformation is applied to the annotations corresponding to the SXR to generate transferred annotations. Fig. 4 shows one example how we transfer the CT ground-truth from CT to synthetic X-ray and from synthetic X-ray to real X-ray. We observe that there are moderate differences between annotations from original X-ray and those annotations derived from transformed and transferred CT data implying that the visibility of COVID-19 related pneumonia on X-ray may not be ideal and most comprehensive.

Fig. 4

An example of transferring synthetic X-ray mask to X-ray mask. Top: a representative synthetic X-ray generated from CT, the corresponding lung image and disease mask; Middle: paired X-ray, the corresponding lung image and direct disease annotation from X-ray; bottom: X-ray with transferred annotations from CT shown as red contour; registered lung image from synthetic X-ray and transferred disease annotations from synthetic X-ray. Although we have an approximate spatial match between SXR and XR, we cannot blindly use the transferred ground-truth(GT) for XR for training. The disease could have rapidly evolved during the first and second scan even within 48 h. In our study, the transferred groundtruth can only be used as a directional guidance by the human expert to help the annotation of the 2D X-rays. The general instruction to the annotators is to disallow/erase any transferred regions when no underlying lesions are visible on the X-rays image, and keep minimally visible regions, even if they appear in the heart/diaphragm region.

COVID-19 Modeling and Evaluation Strategy

During this extra-ordinary COVID-19 pandemic, AI systems have been investigated to identify anomalies in the lungs and assist in the detection, triage, quantification and stratification (e.g. mild, moderate and severe) of COVID-19 stages. To help radiologists do the triage, our deep-learning model takes frontal (anterial-posterior view or posterior-antierial view) X-ray as input and outputs two types of information (see Fig. 5 ): (i) location of the disease regions and (ii) classification. For location, the network generates a low resolution (480 x 480) segmentation mask to identify disease pixels on XR and SXR images related to both COVID-19 and regular pneumonia cases. For classification, a fully connected neural network (FCN) outputs the probablity of (i) COVID-19, (ii) regular pneumonia or (iii) a negative finding for each given input AP/PA XR image.The classification branch consists of one maxpooling layer, a dense layer of 10 nodes and a dense layer of 3 nodes. The segmentation branch consists of 5 upscale blocks and each block consists of one residual block and one transpose convolution layer. Each upscale block doubles the dimension of each of the feature channels. Both XR and SXR images are first resized to 1024 x 1024 pixels and normalized by Z-score method before being fed into the model. In this study, we form COVID-19 disease classification as a three-class classification problem hypothesizing that the distribution of abnormality in the lung may become a differentiator between COVID-19 and regular pneumonia patients. But it should also be noted that a three-class classification can be converted to a two-class classification when inferencing by taking the maximum probability between the COVID-19 disease and pneumonia classes. For training, we use a combination of cross-entropy loss from the classification branch and Dice loss from the segmentation branch. The model is trained for 50 epochs, with batch size 3 and using the Adam optimizer[42].

Fig. 5

The schematic overview of our proposed classification and segmentation deep-learning model.

The schematic overview of our proposed classification and segmentation deep-learning model. Many publications have released their algorithms in open online forums or are marketing the same as additional pneumonia indicators. However, the robustness of the algorithms and their clinical value is somewhat unproven. A few studies have characterized systems for COVID-19 prediction with stand-alone performance that approaches that of human experts. However, all the existing works have either no established pixelwise ground-truth or are evaluated using pixel-wise ground-truth from purely X-ray annotations with uncertainties from annotators. Our deep learning model was trained with and without the multi-modal data from CT cases to investigated the benefits of multi-modal learning. We evaluated our approach in three separate aspects. First, AI model predictions were compared for the accuracy of the image-level classification labels (COVID-19 pneumonia, other pneumonia or negative). Second, the model segmentation of disease regions for the COVID-19 class was evaluated against direct human X-ray pixel-wise pathology annotations/masks. Third, the model segmentation of disease regions for the COVID-19 pneumonia was compared against human pristine pixel-wise annotations/masks.

Data and Groundtruth

In this study, we formed a large experimental dataset consisting of real X-ray images and synthetic X-ray images originating from CT volumes. The data was sourced from in–house/internal collections as well as publicly available data sources including Kaggle Pneumonia RSNA [43], Kaggle Chest Dataset [44], PadChest Dataset [45], IEEE github dataset[46], NIH dataset [47]. Any image databases with limiting non-commercial use licenses were excluded from our train/test cohorts. Representative paired and unpaired CT and XR datasets from US, Africa, and European population were included and were sourced through our data partnerships. Outcomes were derived from information aggregated from radiological and laboratory reports. A summary of the database and selected categories used for our experiments is summarized in Table 2 where the in–house test dataset is a subset of the complete test dataset. As the trained model will be inferenced on real X-ray images, we remove synthetic X-rays from our validation and testing cohorts. A dedicated in–house testing dataset has been used for our study due to the availability of complete ground truth on these cases marked into three classes: COVID-19, regular pneumonia and negative. From the general testing datset, some of the data sources do not contain all three-class images. This may cause domain shift-based bias. Table 1 shows the details image composition of different data sources.

Table 2

Dataset breakdown for our experiments.

Dataset	X-rays (# XMA, # PMA)	Synthetic X-rays (# SMA)
train COVID-19	974 (247, 77)	21487 (8322)
train pneumonia	10175(6108, 17)	11312 (5380)
train negative	14859 (NA,NA)	12542 (NA)
val COVID-19	113 (37,2)	NA
val pneumonia	531 (473,8)	NA
val negative	3301 (NA, NA)	NA
test COVID-19	307 (68, 52)	NA
test pneumonia	1006 (345,33)	NA
test negative	2271 (NA,NA)	NA
in–house test COVID-19	266 (68,52)	NA
in–house test pneumonia	116 (45,33)	NA
in–house test negative	37 (NA,NA)	NA

Table 1

Data source details.

Data source	COVID-19 (train/val/test)	Pneumonia (train/val/test)	Negative (train/val/test)
Kaggle Pneumonia RSNA	NA	5412/300/300	NA
Kaggle Pneumonia Chest	NA	3875/8/390	1341/8/234
PadChest Dataset	NA	694/200/200	4925/2000/2000
IEEE github dataset RSNA	122/29/41	NA	NA
NIH dataset	NA	NA	6018/757/0
In–house negative data source	NA	NA	2379/497/0
In–house three class source	852/84/266	194/23/116	196/39/37

Data source details. Dataset breakdown for our experiments. Two levels of groundtruth are associated with each image: image-level groundtruth and pixel-level (segmentation) groundtruth. For image-level groundtruth, each image is assigned with a label of COVID-19, pneumonia or negative. All in–house X-rays and CTs and the COVID-19 images from the public data sources were confirmed by RTPCR tests. The labels of pneumonia and normal images from public data sources are given by radiologists. For the pixel-wise/segmentation groundtruth (masks), we have four types of annotations: (i) X-ray manual Mask Annotations (XMA) made by annotators purely based on X-rays without any information from CTs; (ii) Synthetic Mask annotations (SMA) generated by the projection algorithms based on CT annotations for synthetic X-rays; (iii) Transferred CT Mask Annotations (TMA) automatically generated by registration algorithms which transfer annotations from CT to X-ray using SXR as a bridge; (iv) Pristine Mask Annotations (PMA) generated by trained human annotators with the adjustment to the TMA. The voxel/pixel-wise annotations from CT and X-ray except for RSNA dataset were performed by internal annotators. The RSNA pixel annotations were generated by fitting ellipses to the bounding boxes provided from the data source.

Experiment Settings

To show the benefits of multi-modal training for developing COVID-19 model, we have conducted training with 4 different training datasets summarized in Table 3 where S3 contains paired images used twice with XMA and PMA.

Table 3

Different training sets.

training dataset setting	Description
S1	X-ray images with XMA
S2	S1 + synthetic X-ray images with SMA
S3	S1 + X-ray images with PMA
S4	S1 + synthetic X-ray images with SMA + X-ray images with PMA

Different training sets. To further improve the robustness of learning using a single model, we also conduct ensemble learning of three classifiers with different weight initlizations of the same training data setting (S4). The output of the ensemble is the averge of the outputs from three trained models.

Reader Study

To assess the usability of this AI system and justify its performance, we have conducted a reader study. Two certified radiologists are invited to read 50 X-ray images (26 COVID-19, 8 pneumonia and 23 negative). Both radiologists have over 10 years’ experience on thoracic imaging.Radiologists performed pixelwise annotations for the disease regions in addition to classifying them. For the classification, five options with increasing suspiciousness levels were chosen: no pathology at all, no pneumonia sign, nonCOVID-19, indeterminate COVID-19 and probable COVID-19.

Evaluation Metrics

To evaluate the performance of our inferred classification on the test subjects, we used the area under the receiver operating characteristic curve (AUC) between different combinations of positive and negative classes including COVID-19 pneumonia vs other pneumonia, COVID-19 pneumonia vs other pneumonia + negative and COVID-19 pneumonia vs negative for different deployment scenarios. Dice coefficient was used to evaluate the exactness of our pathology localization,

Results

Pristine annotation creation

With three different pixel-wise annotations on X-ray, we evaluated overlap between XMA, PMA and TMA. We can observe (see Table 4 ) that after using TMA (transferred CT annotations), the consistency of human annotations to CT annotations is largely improved from 0.28 to 0.47 in terms of Dice coefficient. The Dice coefficient between XMA (X-ray manual mask annotations) and PMA (pristine mask annotations) are also moderate which means PMA has both good consistency to TMA and XMA while the consistency between TMA and XMA is poor. Fig. 6 shows a number of examples with TMA, XMA and PMA. The X-ray annotations show large inconsistency with automated CT transferred annotations.

Table 4

Area overlapping between different annotations.

Comparisons	Dice
XMA vs TMA	0.28
PMA vs TMA	0.47
XMA vs PMA	0.50

Fig. 6

Examples with large annotation inconsistencies where TMA as red contour, XMA as blue regions and PMA as green regions.

Area overlapping between different annotations. Examples with large annotation inconsistencies where TMA as red contour, XMA as blue regions and PMA as green regions.

Model Evaluation

The evaluation of the model is performed in terms of both classification and segmentation of COVID-19 disease regions. To test the effect of domain shift, we show the evaluation results on both all X-ray test dataset and in–house test dataset where COVID-19 regular pneumonia and normal cases are all available. Regarding classification, Table 5 shows AUC based on different combinations for positive and the negative classes. By adding synthetic X-rays, the AUC increases for COVID-19 pneumonia vs other pneumonia + negative increase from 0.89 to 0.93. The addition of adding pristine groundtruth does not further increase the AUC. The same increase is observed if AUC is computed using COVID-19 pneumonia as positive and other pneumonia as negative class. We formulate the triaging of COVID-19 patients as three-classification problem to cope with different application situations. When inferencing, clinicians can adjust AI outputs depending on the different use-cases. For example, if clinicians want minimum regular pneumonia and negative patients in the recall, the probablity from COVID-19 is a sufficient indicator. If clinicians prefer high sensitivity, and recalling regular pneumonia patients is not considered a clinical burden, the maximum among probablity of COVID-19 class and pneumonia class from our solution can be considered as an indicator.

Table 5

Classification results: AUC measures of different training schemes on different datasets with different positive and negative compositions.

Training dataset setting vs test AUC	AUC C vs P + N on XR testset	AUC C vs P on XR testset	AUC C vs P + N on in–house XR testset	AUC C vs P on in–house XR testset
S1	0.98	0.98	0.89	0.87
S2	0.99	0.98	0.93	0.92
S3	0.99	0.98	0.91	0.90
S4	0.99	0.99	0.93	0.92
S4 ensemble	0.99	0.99	0.93	0.93

Classification results: AUC measures of different training schemes on different datasets with different positive and negative compositions. We measure the Dice coefficients to estimate the segmentation accuracy. As PMA were obtained for the in–house dataset, we have measured Dice on all testing images with XMA, in–house test images with XMA and testing images with PMA only shown in Table 6 . Adding PMA can largely improve the Dice measures across different test settings pushing it up to 0.70. Fig. 7 shows examples of AI detection and segmentation of COVID-19 regions with manual annotations as overlay as well in one uni-lateral and one bilatral case. In the bilateral case, our AI missed the consolidation in the bottom of the left lung. The AI might recognize this consolidation as pleural effusion and therefore dismissed this region.

Table 6

Segmentation results: Dice measures of different training schemes on different datasets.

Train dataset setting vs test Dice	XR testset with XMA	in–house XR testset with XMA	in–house XR testset with PMA
S1	0.58	0.59	0.59
S2	0.57	0.56	0.58
S3	0.60	0.62	0.70
S4	0.57	0.58	0.62
S4 ensemble	0.58	0.59	0.64

Fig. 7

Segmentation examples where images on the left are original X-ray images, in the middle are PMA and on the right are AI segmentations.

Segmentation results: Dice measures of different training schemes on different datasets. Segmentation examples where images on the left are original X-ray images, in the middle are PMA and on the right are AI segmentations.

Reader Study Results

From the reader study, from the detection/segmentation perspective, the AI has shown good consistency with respect to radiologists (see Table 7 ) considering that the DICE between two radiologists on the COVID group is 0.53. From a classification perspective, the AUCs of two readers are 0.87 and 0.81 while the AUC of the AI is 0.93 for this reader study dataset. Fig. 8 shows ROC curves of radiologists and the AI system in the use case of COVID versus non-COVID classification.

Table 7

Area overlapping between readers and AI.

Comparisons	Dice
Radiologist 1 vs Radiologist 2	0.53
Radiologist 1 vs AI	0.52
Radiologist 2 vs AI	0.55

Fig. 8

ROC curves of radiologists and the AI system in the use case of COVID-19 versus non-COVID-19 classification.

Area overlapping between readers and AI. ROC curves of radiologists and the AI system in the use case of COVID-19 versus non-COVID-19 classification.

Generalization Study

To evaluate the generalization of the proposed method, we considered few state-of-the-art methods and puclic datasets and we presented our anlaysis in this section. Since the current models are mostly focusing on one task, i.e. segmentation or classification only, two methods are compared to the proposal in different perspective. For segmentation task, CT2X-ray method [37] is considered, which employs the CT to X-ray transformation as the multi-modality approach for model training in X-ray domain. In addition, to validate the classification performance, Covid-Net [48] is also adopted for the comparison. To perform a fair comparison, extra public datasets are used for the above segmentation and classification tasks. For segmentation task, BIMCV dataset [49] is adopted, which provides 14 X-ray images with carefully annotated ground truth (excluding lateral views to fit our method). As for classification task, 300 positive COVID-19 X-ray images are obtained from [50] (by carefully excluding the RSNA images to avoid information leakage). Similarly, 300 negative non-COVID images are obtained from CheXpert dataset [51]. Note that these 300 + 300 images are obtained from the first 300 images of each dataset. The segmentation results for two different multi-modality methods are evaluated using Dice score, which are 0.57 and 0.51 for the proposed method and CT2X-ray method, respectivaly. For the classification results, the proposed method is also better than Covid-Net when evaluating on the independent datasets. These results are shown in the Table 8 . There is a drop noticed on the classification performance and the cause is attributed to the extremely low resolution (256x256) of the available.

Table 8

Classification performance on independent dataset.

Comparisons	AUC	Specificity	Sensitivity
Covid-Net	0.55	0.55	0.55
Proposed	0.81	0.73	0.72

Classification performance on independent dataset.

Conclusion and Discussion

In this study, from multi-modal perspective, we have developed an artificial intelligence system which learns from a mix of high dimensional modality CT and X-ray but inferences only on low dimensional mono-modality X-ray for COVID-19 diagnosis and segmentation/localization of the diseased regions. The system classifies a given image into three categories: COVID-19, pneumonia and negative. We show that by learning from CT, the performance of the AI system seems to improve both classification and segmentation of the pathology. Our AI system achieves a classification AUC of 0.99 and 0.93 between COVID-19 pneumonia and other pneumonia plus negative on the full testing dataset and the subset in–house dataset, respectively. The Dice of 0.57 and 0.58 are obtained for COVID-19 disease regions on full testing dataset and the subset in–house dataset using X-ray direct annotations, respectively. The Dice is increased to 0.62 when pristine ground-truth transferred from CT is used for training and testing. We also observe that with ensemble modelling, the classification performance and segmentation performance can be further improved over a single S4 model. We have also conducted a reader study to justify the performance of AI. The DICE of AI (0.53–0.55) is at a comparable level of radiologist (0.53) and AI outperformed two radiologists on the triaging of the COVID19 patients. Accurate classification can aid physician in triaging patients and make appropriate clinical decisions. Accurate segmentation enhanced confidence in the triage, and helps in quantitative reporting of disease for reporting and monitoring progression. The transferred groundtruth from CT is used as hint for guiding the annotation and generating pristine markings labeling the data and pathology. One can imagine TMA as computer-aided detection (CAD) markers to aid the radiologists to improve the disease region detection/delineation. In the mainstream FDA or CE reader studies[52], [53], it is often mentioned that CAD markers are used in a similar way to improve the accuracy and consistency of the disease detection. Learning from a second modality (CT) in our multi-modal approach has two main implications. One impact is to add synthetic X-ray and corresponding disease masks with different projection parameters to significantly augment training image pool and ensure data diversity. Another benefit relates to additional pathology evidence from a imaging modality with higher-sensitivity observed when we transfer the CT annotations to original X-ray using synthetic X-ray as a bridge, thereby allowing manual adjustment used for training. The first addition contributes mostly towards gain in the classification accuracy and the second addition contributes substantially to the gain in disease localization. Although the CT resolution is generally lower than X-ray resolution, the diagnosis of COVID-19 disease depends on disease distribution over the left and right lungs. We found limited benefit on pathology segmentation by adding synthetic X-rays (S2) compared to leveraging only real X-rays (S3) with pristine ground-truth. Such differences may be due to the fact that synthetic X-rays have different intensities and contrast compared to real X-rays and segmentation annotations of synthetic X-ray are derived directly from CT where some lesions may not be visible in the original X-rays. In S2 setting, the mask annotations for synthetic X-rays come directly from CT annotations without manual adjustments. The mask annotations can be inconsistent to visual perception of abnormality of synthetic X-rays. This may leads to less benefits on the segmentation from S2 setting. The quality of the synthetic X-ray may play an important role and is worth further investigation, perhaps using generative networks to make the synthetic X-ray more realistic. Our study shows that there exists substantial inconsistency between X-ray direct annotations and automated transferred CT annotations. When using transferred annotations as hints, the second version of X-ray annotations (pristine groundtruth) are more consistent to automated transferred CT annotations. It also indicated that the automatically transferred CT annotations cannot be directly used for training as some lesions visible in CT are just not visible in X-ray and also because of the disease change between the two exams. The Dice metric between direct annotations and pristine groundtruth is 0.50 which might be a good reference indicating the entitlement of AI based segmentation. One important difference between XMA and TMA is that usually radiologists tend to annotate lung disease regions just within the dark areas which are assumed to be lung parenchyma on X-ray. However, our Fig. 3 clearly shows that after projecting the 3D disease from the CT lung images, the disease can be actually outside of dark lung region like the cardiac region. Readers can further adjust their annotations to cover possible and visible disease regions. We do not plan to use the TMA directly as some of projected regions are obscured on X-ray images. Readers can also just adjust annotations by only looking at CT annotations for guidance. However, we aim for pixel-level annotations, therefore using the transferring approach, TMA is generated for pixel-level guidance for the manual operators. SMI also supplements the training cohort along with the synthetic X-ray as an augmentation opportunity. It should be further noted that the automated transferred annotations can not be directly used in training as registration error can cause incorrect definition of pathology.We recognize that COVID-19 is a fast changing disease and since we are pairing X-ray and CT acquired within a pre-defined time-window of 48 h, the disease status in CT might be quite different compared to the disease status when X-ray is taken. Therefore manual adjustment is an important consideration. One important observation we would like to point out is that domain shift can occur when developing an AI solution using data collected from different data sources espeically for a fast envolving disease or when there are extraordinary limitations preventing access to large volumes of data. This also confirms the observation by DeGrave et al. [54]. It was shown that recent deep learning systems to detect COVID-19 from chest radiographs rely on confounding factors rather than medical pathology, creating an alarming situation in which the systems appear accurate, but fail when tested in new hospitals. Excellent performance is achieved in a general testing dataset but prominent performance drop is observed in the results on in–house testing set. On one hand, the COVID-19 and the pneumonia cases are confirmed with RTPCR tests, while RTPCR test has lower sensitivity making the ground-truth less dependable. On the other hand, the public datasets do not contain all three-classes of images together from the same source thereby confounding the trained model to recognize both disease and data source at the same time for the classification task. Although both normalization and extensive augmentations are applied to balance the data pools during training, when this model is tested on the general testing dataset, the recognized data-source may help to achieve an unrealistically good classification results due to the data-source bias. In the research and industry community, efforts have been made to apply AI into imaging-based pipeline of for the COVID-19 applications. However, many existing AI studies for segmentation and diagnosis are based on small samples and based on single data source, which may lead to the over-fitting of results. To make the results clinically applicable, a large amount of data from different sources shall be collected for evaluation. Moreover, many studies only provide classification prediction without providing segmentation or heatmap which makes AI systems lack explainability. By providing the segmentation, we aim to fill this void, enhancing the promotion of AI in clinical practice. On the other hand, the imaging-based diagnosis has limitations and clinicians make the diagnosis considering clinical symptoms also. An AI system can be largely enhanced with incorporation of patient clinical parameters [55] such as blood oxygen level, body temperature, to further enhanced the capability to accurately diagnose pathological conditions. WHO has recommended[56] a few scenarios where chest-imaging can play an important role in care delivery. From the triage perspective, WHO suggests using chest imaging for the diagnostic workup of COVID-19 when RTPCR testing is not available (timely) or is negative while patients have relevant symptoms. In this case, the classification support from our AI can aid radiologists to identify COVID-19 patients. From monitoring or temporal perspective, for patients with suspected or confirmed COVID-19, WHO suggests using chest imaging in addition to clinical and laboratory assessment to decide on hospital admission versus home discharge, to decide on regular admission versus intensive care unit (ICU) admissions, to inform the therapeutic management. In these scenarios, accurate segmentation of disease regions is essential for the evaluations. The COVID-19 disease continues to spread around the whole world. Medical imaging and corresponding artificial intelligence applications together with clinical indicators provides solutions for triage, risk analysis and temporal analysis. This study provides a solution from multi-modal perspective to leverage the CT information but to inference on X-ray to avoid the necessity of taking CT imaging because of limited accessibility, dose and decontamination concern. Future work focuses on leveraging paired CT information to estimate severity and other higher dimensional measures. We believe that our study introduces a new trend of combining multi-modal training and single-modality inference. Although this study tries to leverage CT information as much as possible to aid the data-driven AI solution on X-rays, it should be noted that the extraction of relevant information is limited by the nature of X-ray imaging because of 2D projection as well as impaired visibility with the presence of the non-lung thick tissue. In addition different vendors may apply varying post-processing algorithms to suppress information on those thick tissue regions. To avoid information loss, our AI may be deployed directly on X-ray hardware with direct access to X-ray raw images for ideal translation to the clinic.

Compliance with Ethical Standards

The principles outlined in the Helsinki Declaration of 1975, as revised in 2000 are followed.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

39 in total

1. Automated Detection and Quantification of COVID-19 Airspace Disease on Chest Radiographs: A Novel Approach Achieving Expert Radiologist-Level Performance Using a Deep Convolutional Neural Network Trained on Digital Reconstructed Radiographs From Computed Tomography-Derived Ground Truth.

Authors: Eduardo J Mortani Barbosa; Warren B Gefter; Florin C Ghesu; Siqi Liu; Boris Mailhe; Awais Mansoor; Sasa Grbic; Sebastian Vogt
Journal: Invest Radiol Date: 2021-08-01 Impact factor: 6.016

2. Artificial intelligence-enabled rapid diagnosis of patients with COVID-19.

Authors: Xueyan Mei; Hao-Chih Lee; Kai-Yue Diao; Mingqian Huang; Bin Lin; Chenyu Liu; Zongyu Xie; Yixuan Ma; Philip M Robson; Michael Chung; Adam Bernheim; Venkatesh Mani; Claudia Calcagno; Kunwei Li; Shaolin Li; Hong Shan; Jian Lv; Tongtong Zhao; Junli Xia; Qihua Long; Sharon Steinberger; Adam Jacobi; Timothy Deyer; Marta Luksza; Fang Liu; Brent P Little; Zahi A Fayad; Yang Yang
Journal: Nat Med Date: 2020-05-19 Impact factor: 53.440

3. Differential Diagnosis of Atypical Hepatocellular Carcinoma in Contrast-Enhanced Ultrasound Using Spatio-Temporal Diagnostic Semantics.

Authors: Qinghua Huang; Fengxin Pan; Wei Li; Feiniu Yuan; Hangtong Hu; Jinhua Huang; Jie Yu; Wei Wang
Journal: IEEE J Biomed Health Inform Date: 2020-03-03 Impact factor: 5.772

4. Detection of Breast Cancer with Mammography: Effect of an Artificial Intelligence Support System.

Authors: Alejandro Rodríguez-Ruiz; Elizabeth Krupinski; Jan-Jurre Mordang; Kathy Schilling; Sylvia H Heywang-Köbrunner; Ioannis Sechopoulos; Ritse M Mann
Journal: Radiology Date: 2018-11-20 Impact factor: 11.105

5. PadChest: A large chest x-ray image dataset with multi-label annotated reports.

Authors: Aurelia Bustos; Antonio Pertusa; Jose-Maria Salinas; Maria de la Iglesia-Vayá
Journal: Med Image Anal Date: 2020-08-20 Impact factor: 8.545

6. Use of CT and artificial intelligence in suspected or COVID-19 positive patients: statement of the Italian Society of Medical and Interventional Radiology.

Authors: Emanuele Neri; Vittorio Miele; Francesca Coppola; Roberto Grassi
Journal: Radiol Med Date: 2020-04-29 Impact factor: 3.469

Review 7. Variation in False-Negative Rate of Reverse Transcriptase Polymerase Chain Reaction-Based SARS-CoV-2 Tests by Time Since Exposure.

Authors: Lauren M Kucirka; Stephen A Lauer; Oliver Laeyendecker; Denali Boon; Justin Lessler
Journal: Ann Intern Med Date: 2020-05-13 Impact factor: 25.391

8. The many estimates of the COVID-19 case fatality rate.

Authors: Dimple D Rajgor; Meng Har Lee; Sophia Archuleta; Natasha Bagdasarian; Swee Chye Quek
Journal: Lancet Infect Dis Date: 2020-03-27 Impact factor: 25.071

Review 9. COVID-19: Pandemic Contingency Planning for the Allergy and Immunology Clinic.

Authors: Marcus S Shaker; John Oppenheimer; Mitchell Grayson; David Stukus; Nicholas Hartog; Elena W Y Hsieh; Nicholas Rider; Cullen M Dutmer; Timothy K Vander Leek; Harold Kim; Edmond S Chan; Doug Mack; Anne K Ellis; David Lang; Jay Lieberman; David Fleischer; David B K Golden; Dana Wallace; Jay Portnoy; Giselle Mosnaim; Matthew Greenhawt
Journal: J Allergy Clin Immunol Pract Date: 2020-03-26

10. Death and contagious infectious diseases: Impact of the COVID-19 virus on stock market returns.

Authors: Abdullah M Al-Awadhi; Khaled Al-Saifi; Ahmad Al-Awadhi; Salah Alhamadi
Journal: J Behav Exp Finance Date: 2020-04-08

3 in total

1. AI-CenterNet CXR: An artificial intelligence (AI) enabled system for localization and classification of chest X-ray disease.

Authors: Saleh Albahli; Tahira Nazir
Journal: Front Med (Lausanne) Date: 2022-08-30

2. Intubation and mortality prediction in hospitalized COVID-19 patients using a combination of convolutional neural network-based scoring of chest radiographs and clinical data.

Authors: Aileen O'Shea; Matthew D Li; Nathaniel D Mercaldo; Patricia Balthazar; Avik Som; Tristan Yeung; Marc D Succi; Brent P Little; Jayashree Kalpathy-Cramer; Susanna I Lee
Journal: BJR Open Date: 2022-03-24

3. Deep fusion of gray level co-occurrence matrices for lung nodule classification.

Authors: Ahmed Saihood; Hossein Karshenas; Ahmad Reza Naghsh Nilchi
Journal: PLoS One Date: 2022-09-29 Impact factor: 3.752

3 in total