Literature DB >> 35506115

TL-med: A Two-stage transfer learning recognition model for medical images of COVID-19.

Jiana Meng¹, Zhiyong Tan¹, Yuhai Yu¹, Pengjie Wang¹, Shuang Liu¹.

Abstract

The recognition of medical images with deep learning techniques can assist physicians in clinical diagnosis, but the effectiveness of recognition models relies on massive amounts of labeled data. With the rampant development of the novel coronavirus (COVID-19) worldwide, rapid COVID-19 diagnosis has become an effective measure to combat the outbreak. However, labeled COVID-19 data are scarce. Therefore, we propose a two-stage transfer learning recognition model for medical images of COVID-19 (TL-Med) based on the concept of "generic domain-target-related domain-target domain". First, we use the Vision Transformer (ViT) pretraining model to obtain generic features from massive heterogeneous data and then learn medical features from large-scale homogeneous data. Two-stage transfer learning uses the learned primary features and the underlying information for COVID-19 image recognition to solve the problem by which data insufficiency leads to the inability of the model to learn underlying target dataset information. The experimental results obtained on a COVID-19 dataset using the TL-Med model produce a recognition accuracy of 93.24%, which shows that the proposed method is more effective in detecting COVID-19 images than other approaches and may greatly alleviate the problem of data scarcity in this field.

Entities: Chemical

Keywords: COVID-19; Pretrained Model; Transfer Learning; ViT

Year: 2022 PMID： 35506115 PMCID： PMC9051950 DOI： 10.1016/j.bbe.2022.04.005

Source DB: PubMed Journal: Biocybern Biomed Eng ISSN： 0208-5216 Impact factor: 5.687

Introduction

Since the outbreak of the novel coronavirus (COVID-19) in late 2019, the virus has been ravaging the world to date, which has had a great impact on global politics, the economy, culture, etc., and has caused immeasurable economic and property losses. Due to its extremely high infection rate and mortality rate, the World Health Organization declared that the COVID-19 outbreak was a global health emergency in a very short period [1]. As of 6 September 2021, more than 220 million cumulative diagnoses and more than 85 million cumulative deaths have been confirmed worldwide. These numbers may be even higher due to asymptomatic cases and flawed tracking policies. Some researchers have modeled COVID-19 through the fractal approach of the epidemic curve to help the medical community understand the dynamics and evolution of the COVID-19 outbreak, and thus control the spread of the outbreak [2]. Other researchers have used mice to study the pathogenesis of COVID-19 in order to evaluate the effectiveness of new treatments and vaccines and to find more effective ways to combat COVID-19 [3]. Despite global efforts to prevent rapid outbreaks of the disease, hundreds of thousands of COVID-19 cases continue to be confirmed worldwide every day, making rapid COVID-19 detection an effective measure with which governments can respond to COVID-19; this can help health departments and government authorities develop effective resource allocation strategies and break the chain of transmission. The most typical symptoms of COVID-19 include fever, dry cough, myalgia, dyspnea, and headache, but in some cases, no symptoms can be seen (these are called asymptomatic cases), making this disease a greater public health threat than other diseases. Currently, reverse transcription-polymerase chain reaction (RT-PCR) is viewed as the gold standard for the diagnosis of COVID-19. However, the rapid and effective screening of suspected cases is limited by the lack of resources and stringent testing environment requirements. In addition, RT-PCR tests are time-consuming and have high false negative (FN) rates [4]. The spread of COVID-19 has led to the production of many variants of this strain, such as delta variant, which is hundreds of times more virulent and infectious than common strains; these mutations undoubtedly add to the pressure of COVID-19 detection. However, computed tomography (CT) scan images are considered a better method for detecting COVID-19 [5], with a sensitivity of 98 % (compared to 71 % for RT-PCR) [4]. In COVID-19 cases, CT scan images show some specific manifestations, including multilobular ground glass shadows (GGOs) distributed on both sides, in the periphery or in the rear [6], predominantly in the lower lobes and less frequently in the middle lobes. Diffuse distribution, vascular thickening, and fine reticular clouding are other common features reported for patients with neocoronavirus pneumonia. For example, Fig. 1 shows two CT scans: one for a patient with COVID-19 and one for a non-COVID-19 patient [7]; the CT scan of the lungs of a patient infected with COVID-19 is located on the left side of the figure, where some distinct GGOs are shown as red arrows, and a normal CT scan of the lungs is shown on the right side. The use of computers to classify medical images and thus to assist physicians in diagnosis is now a common and effective method [[8], [9]]. The use of artificial intelligence technology to classify CT images has become a widespread concern in medical image analysis [[10], [11]].

Fig. 1

A CT scan of the lungs of a patient infected with COVID-19 and a CT scan of a normal lung [7].

A CT scan of the lungs of a patient infected with COVID-19 and a CT scan of a normal lung [7]. During the pandemic, hospitals generate thousands of CT scan images every day. For such a large number of CT scan images, it is a great challenge to rely on the naked eye of a professional physician for detection and identification, and the human eye tends to become fatigued and easily overlook some details, leading to misdiagnosis; the cost of misdiagnosis is unbearable. In contrast, machines do not become fatigued, and some details that are easily overlooked by humans can be detected. Therefore, the deep learning approach can effectively help us to quickly detect and identify COVID-19. Based on these facts, our approach introduces a Vision Transformer (ViT) into the COVID-19 detection and classification task. The proposed model uses CT scan images to identify normal CT scan images and patient CT scan images. Since the generic domain dataset used for the ViT-pretrained model and the dataset used in this experiment are different in terms of their feature spaces and dimensions and the tuberculosis (TB) dataset is highly similar to the COVID-19 dataset with respect to their feature spaces and dimensions, in the first stage, we use heterogeneous transfer learning to learn the generic features of the images and use the ViT model (which is trained on a large proportion of the generic domain data) to detect the five types of TB. Then, in the second stage, the domain features of the images are learned by using homogeneous transfer learning. Based on the first stage, the model obtained from the TB dataset is used as the second stage pretrained model in the second stage, and this model then fine-tuned to detect and identify either COVID-19 or non-COVID-19 patients. The performance of the proposed technique is evaluated by comparing the results derived from the model with those of Residual Network 34 (ResNet34) [12], ResNet101 [12], and DenseNet169 [13], which are based on convolutional neural network (CNN) technology. The experimental results show that the developed model outperforms the existing models based on CNN techniques. The remainder of the paper is organized as follows. Section 2 introduces the related works on COVID-19 detection. Section 3 describes the proposed method in detail. Section 4 gives the related experimental results, and Section 5 draws conclusions.

Related work

In this section, we discuss the research areas relevant to our work - research related to the detection of COVID-19. Since the COVID-19 outbreak, rapid diagnostic testing has become one of the most effective methods for interrupting the spread of COVID-19. Because deep learning-based detection methods are more convenient and faster than traditional approaches, researchers have developed many effective models for detecting COVID-19 based on deep learning. Heidarian et al. [14] developed a two-stage fully automated CT framework (COVID-FACT), which is mainly composed of a capsule network. COVID-FACT can capture spatial information without extensive data augmentation and large datasets and is performed in two stages: the first stage detects infected slices, and the second stage classifies the patient CT scan image of the patient. The supervision and annotation of COVID-FACT depend less on the input data relative to the same type of model. Chaudhary et al. [15] developed a two-stage classification framework based on CNNs. The authors first used a pretrained DenseNet [13] for COVID-19 or community-acquired pneumonia (CAP) detection in the first stage and then used the EfficientNet [16] network for COVID-19, CAP and normal controls for triclassification. Its classification effectiveness was ranked first in the IEEE ICASSP 2021 Signal Processing Grand Challenge (SPGC) evaluation. Data are expensive resources; in particular, datasets in the medical field are extremely scarce. For this reason, He et al. [17] collected and collated hundreds of COVID-19 CT scan images and made them publicly available. A self-supervised transfer learning method was also proposed, reducing the bias generated on the source images and their class labels by conducting an auxiliary task on the CT images, and its performance was superior to that of several state-of-the-art methods. Polsinelli et al. [18] designed a lightweight CNN based on SqueezeNet [19], whose processing speed surpassed that of other models and whose processing time when running without GPU acceleration also surpassed that of many models that require GPU acceleration; however, its performance was not greatly improved over that of other networks, and its accuracy was not high. Sani et al. [20] constructed a new network structure for COVID-19 recognition by using a high-precision Hopfield neural network (HNN) to find symptoms and using a mathematical model to improve the accuracy of masking. Scarpiniti et al. [21] proposed a novel unsupervised method that used deep denoising convolutional autoencoders to provide compact and meaningful hidden representations. The experimental results show that the method has high reliability and low computational cost for the recognition of COVID-19. Similarly, Pathak et al. [22] designed a transfer learning network based on ResNet-50 [23] that extracts the latent features of CT COVID-19 images through ResNet-50 and uses transfer learning to train the classification model. Finally, based on its training results, optimized hyperparameters are obtained by using a CNN. Jaiswal et al. [24] used DenseNet201 [13] to classify patients with COVID-19. They used transfer learning techniques to extract the image features learned by pretraining DenseNet201 and then fed these obtained features into a CNN for classification. Loey et al. [25] generated more images via classical data augmentation with conditional generative adversarial networks (CGANs) [26], and then they trained these networks for classification via deep transfer learning methods. Muhammet et al. [27] proposed two architectures to classify COVID-19. In this study, AlexNet is used as the backbone network for transfer learning. Compared with architecture 1, which directly used AlexNet for transfer learning, architecture 2 consists of AlexNet and BiLSTM, considering the time and order of the data. There are also many researchers working on the detection of COVID-19 through the study of X-ray modal data. Xiao et al. [28] adopted a local phase-based image enhancement method to obtain a multi-feature CXR image, which was fed into the network together with the original image for fusion, which further improved the classification performance of the model. For the tuning problem in deep learning, Mohammad et al. [29] optimized the CNN structure in multiple stages by iteratively using heuristic optimization methods to finally evolve the best performing network with the smallest number of convolutional layers. Govardhan et al. [30] utilized two CNN models (ResNet50 and ResNet101) for a two-stage detection task. In this study, the first-stage ResNet50 distinguished bacterial pneumonia, viral pneumonia, and X-rays of normal healthy people, and the detected viral pneumonia samples are used as the input data of the second-stage ResNet101 network to distinguish COVID-19 and other viral pneumonia patients. Bejoy et al. [31] used multiple pre-trained CNN models for feature extraction, before selecting significant features through correlation-based feature selection techniques and subset size forward selection, and finally classifying them through a Bayesnet classifier. Ruochi et al. [32] proposed the COVID19XrayNet model, which was designed using two-step transfer learning. In this study, the first step used pre-trained ResNet34 for fine-tuning on the common pneumonia dataset, transfering the model weights trained in the first step to the corresponding network modules in the second step, and trained on the COVID-19 dataset. Shervin et al. [33] fine-tuned four pre-trained network models (ResNet18, ResNet50, SqueezeNet, DenseNet-169) with a small number of COVID-19 datasets, and the model performed well. Shayan et al. [34] directly utilized standard CNN to classify lung images, and the model performed well. Gupta et al. [35] proposed the COVID-WideNet capsule network. Compared with other CNN models, its parameters are greatly reduced, and it can detect COVID-19 quickly and efficiently. Goel et al. [36] proposed the Multi-COVID-Net model, which was an ensemble network composed of InceptionV3 and ResNet50 pre-trained networks, and optimized hyperparameters through the Multi-Objective Grasshopper Optimization Algorithm (MOGOA). Since fine-tuning the architecture and hyperparameters of deep learning models is a complex and time-consuming process, Jalali et al. [37] proposed a novel deep neuroevolutionary algorithm, which was mainly achieved by modifying the competitive swarm optimizer algorithm and adjusting the volume Hyperparameters and Architecture of Productive Neural Networks. Compared with the CNN structure, ViT can pay attention to more global information at the lower level. And more data for training can help ViT to further learn local information, thereby improving the performance of the model [38]. Therefore, many researchers use the ViT architecture to detect COVID-19. In order to accurateely classify and quantify the severity of COVID-19, Sangjoon et al. [39] proposed a multi-task Vision Transformer model.The method obtained its low-level corpus features by pre-training on a large and general CXR dataset, and then used the acquired corpus features are input as a general Transformer for classification and severity quantification of COVID-19. To address the problem that the Vision Transformer required large-scale data for training, in order to obtain a Vision Transformer with good performance, Li et al. [40] used ResNet as a teacher network through a knowledge distillation method to extract its knowledge learned in COVID-19 CT images into the student network Vision Transformer. Debaditya et al. [41] proposed the COVID-Transformer model, which fine-tuned the Vision Transformer on a large dataset collected and merged, and performed well compared to other standard CNN models. Sangjoon et al. [42] proposed the FESTA framework. By dividing the Vision Transformer into three parts: head, tail and Transformer for training, it reduced the consumption of client resources and bandwidth by joint training, and improved the performance of the model for single task processing. By fine-tuning the Vision Transformer, Koushik et al. [43] performed better than other standard CNN models. Sangjoon et al. [44] used DenseNet to extract abnormal lung features on a large low-level dataset, and then used it as a corpus for Vision Transformer training to obtain a better feature embedding and improve the performance of the model. ARNAB et al. [45] proposed the xViTCOS framework, which employed Vision Transformer as the backbone network through a multi-stage transfer learning approach, pre-training and fine-tuning on multiple scales of ImageNet datasets, followed by fine-tuning training on the COVID-19 CXR dataset and CT dataset, respectively, to allow the model to them for classification.Sangjoon et al. [46] trained a backbone network with a large number of carefully curated public records to generate generic CXR results, thereby maximizing the performance of the Transformer model using a low-level CXR corpus from the backbone network.

Proposed method

In this paper, we propose a two-stage transfer learning approach, which is based on the ViT model, to classify COVID-19 by using different transfer learning methods. The overall model structure is shown in Fig. 3. In the figure, ① indicates that the ImageNet dataset is trained through the ViT model, ② indicates that the trained model in ① is used as a pretraining model to complete heterogeneous transfer learning on the TB dataset, ③ represents the transfer of the relevant medical feature knowledge learned in ② (as shown by the dotted line in the figure) and the completion of homogeneous transfer learning on the COVID-19 dataset. The main process of the model is as follows. First, a multimodal image is input and undergoes a series of image preprocessing steps to convert the image to a uniform resolution size (224, 224); the image is then passed through a convolution layer and flattened to , where p is the resolution size of each image patch and is the number of image patches, which is then spread and projected using a trainable linear onto a D-dimensional vector and then concatenated with a trainable class token for classification, i.e., . Thus, its dimensionality becomes , after which a positional embedding with the same dimensions is added: . The obtained result is fed into the transformer encoder module for computation, and finally, the class token with dimensions is extracted and sent to the multilayer perceptron (MLP) module for classification.

Fig. 3

Overall structure of the model.

In this work, several main parts of the model include a self-attention (SA) mechanism, a multiheaded self-attention (MSA) mechanism, and an MLP.

Attention mechanisms

Since the transformer [47] was proposed in 2017, it has developed in the field of natural language processing (NLP) at an astonishing rate, and its attention model soon received great attention from NLP researchers. In the following years, transformer-based models have emerged, and some of the better models, such as the bidirectional encoder representations from transformers (BERT) model [48], have occupied the main position in the field of NLP in recent years. Inspired by the success of transformers in NLP, researchers have introduced them into the computer vision (CV) field and conducted some experiments. The experimental results show that transformers have great potential to surpass models with pure CNN architectures in some areas. Some researchers have used combinations of CNNs and transformers [[49], [50]], and other researchers have directly used a pure transformer architecture instead of a CNN architecture [[51], [52]]. The biggest reason for the success of transformers is their attention mechanism. For translation tasks, the transformer model proposed by Google replaced long short-term memory (LSTM) with an attention mechanism and has been a great success. The principle of an attention mechanism is that different features are contained in each layer of a network; these features can be in different channels and different locations and have different levels of importance, and the later layers should pay attention to the most important information and suppress the less important information. In other words, an attention mechanism should increase the presence of the areas that need attention and give them extra attention, while it should reduce the presence of less important areas and then reduce their influence on the overall situation. The ViT model proposed in [51] has greatly stimulated interest in transformer research. Similar to the sequence processing approach in the field of NLP, ViT uses a pure transformer structure, where the input images are split into fixed-size patches, and then the embedded sequences of these image patches are used as inputs for the transformer. Through a series of experiments, it has been shown that ViT has great potential for image processing, especially for large-scale image processing. Nonlocal neural networks [53] are viewed as early works on attention mechanisms in computer vision and early attempts by researchers to use transformers in the CV field. The use of self-attention mechanism involves building remote dependencies and then directly capturing these remote dependencies for determining the interactions between any two locations without being restricted to adjacent points; this is equivalent to constructing a convolution kernel as large as the size of the feature map and thus can maintain more information. The drawbacks of this network model are that its time complexity and space complexity are both large and that expensive GPU devices are required to train the model on massive data. Bahdanau et al. [54] first introduced an attention mechanism into the field of NLP. Previously, the main architecture for neural machine translation models was the sequence-to-sequence (Seq2Seq) framework, and the main bottleneck of this model framework was its intermediate transformation of the fixed dimensional sizes of vectors. To solve this problem, a Seq2Seq + attention model was used for machine translation, achieving excellent results. Vaswani et al. [47] completely discarded common network structures such as recurrent neural networks (RNNs) and CNNs for machine translation tasks, used the attention mechanism only for machine translation tasks, and achieved excellent results. Since then, the attention mechanism has become a hot research issue.

Transfer learning

With the rapid development of machine learning, an increasing number of machine learning application scenarios have emerged. The better performance of supervised learning is driven by a large amount of data, but in some domains, such data are often hard to obtain or too small to support the training of a good model; thus, transfer learning was born. Transfer learning models are mostly pretrained on large-scale datasets (e.g., ImageNet [55]) and then fine-tuned for downstream tasks. This idea has been successfully applied to many scenarios, such as dense pose recognition [56], image classification [57], and language understanding [48]. Transfer learning also has many applications in the medical field, such as tuberculosis detection [58], chest X-ray pneumonia classification [28], and breast cancer classification [59]. The main transfer learning methods that have emerged include sample-based transfer learning [60], feature-based transfer learning [61], model-based transfer learning [62], homogeneous transfer learning [63], heterogeneous transfer learning [64], and adversarial transfer learning [65]. Among them, homogeneous transfer learning and heterogeneous transfer learning are the main methods used in this paper for the classification of COVID-19. Domains and tasks are two common basic concepts in transfer learning. A domain D in transfer learning contains two parts, i.e., a feature space and a marginal probability distribution ; the domain is given by. A task T in transfer learning consists of two parts, i.e., a label space and a target prediction function which can also be viewed as the conditional probability ; this is given by. Among the source domain and the target domain in transfer learning, the former is the domain used to train the model or the tasks, and the latter is the machine learning domain used to predict, classify, and cluster the data by using the former model. A generalized definition of nonuniform transfer learning, schematically shown in Fig. 2 , is as follows:

Fig. 2

Schematic diagram of transfer learning.

Condition: Given a source domain and a learning task on the source domain , a target domain and a learning task on the target domain. Goal: Use the knowledge of and to improve the learning of the prediction function on the target domain . Restrictions: . Schematic diagram of transfer learning. Overall structure of the model.

Homogeneous transfer learning

If the source and target domain data have the same or similar representation structures but obey different probability distributions, i.e., the feature spaces of the source and target domains are the same, () and have the same dimensionality (), homogeneous transfer learning is applicable. Both data- and model-based transfer learning belong to homogeneous transfer learning. The shortcoming of homogeneous transfer learning lies in its ability to improve the generalization of the target domain with only the help of the source domain of the homogeneous representation space.

Heterogeneous transfer learning

In this case, the source and target domains have different feature spaces or different feature representations, i.e., For example, if the source domain is a dataset from the generic domain and the target domain is a proprietary dataset, the feature spaces have different dimensions, i.e., . For the scenario of medical image classification, suppose that the source domain is an ImageNet dataset about the aspects of our daily life and that the target domain is a CT scan image regarding tuberculosis. In general, these domains contain different features and possess different feature dimensions. In heterogeneous transfer learning, the source domain generally possesses richer labeling samples, while the target domain is unlabeled or has a small number of labeled samples, and this approach overcomes the shortcomings of homogeneous transfer learning by allowing the domains to be different.

SA and MSA

Attention mechanism is essentially derived from the visual attention mechanism of humans. Humans do not always focus their attention on the whole of something when viewing it, but rather on particular parts that interest them. Moreover, when people find that a scene often shows a certain part of something they want to observe, they perform learning to focus their attention on that part when a similar scene occurs again in the future. SA is a class of attention mechanism that goes through different parts of the same sample to obtain the part that should be focused on. SA has various forms, and the generic converter relies on the scaled dot product form shown in Fig. 4 . In the SA layer, the input vector is first converted into three different vectors, i.e., a query matrix Q, a key matrix K and a value matrix V. Then, the weights of these vectors are obtained by the dot product of Q and each K. After normalizing these weights by using a softmax function, finally, the weights and their corresponding key values V are weighted and summed to obtain the final attention value. The function is calculated as follows:where is the dimensionality of the vector and normalizes the vector.

Fig. 4

SA structure.

SA structure. MSA is the core component of the transformer. Its structure is shown in Fig. 5 , which differs from that of SA in that the input of MSA is divided into many small parts. Then, the scaled dot product of each input is calculated in parallel, and all the attention outputs are concatenated to obtain the final result. The MSA equation is expressed as follows: where , , , and are trainable parameter matrices.

Fig. 5

Multiheaded SA structure.

MLP and embedding layers

In [17], the MLP head immediately follows the MSA and is composed of a fully connected layer (Linear), and the final class token passes through its output prediction class; moreover, an MLP block is also present in the transformer encoder. The transformer encoder consists of a stack of encoder blocks, and the transformer encoder is also alternatively composed of MSA and MLP blocks. In the transformer encoder, layer normalization (LN) is applied before each block, and residual concatenation is used after each block. One of the MLP blocks is composed of a nonlinear activation function (a Gaussian error linear unit (GELU)), a dropout layer, and two fully connected layers. The structural composition is shown in Fig. 6 .

Fig. 6

Transformer block structure.

Transformer block structure. The input image is split into fixed-size patches and then convolved, and the resulting vector is flattened and mapped to the corresponding size dimension by using a trainable linear projection.

Experiments

Datasets

COVID-19 dataset

The COVID-19 dataset required for the experiments in this paper is obtained from [6], with a total of 746 CT scans. In the dataset, 349 of these COVID-19-positive CT images were collected from papers on COVID-19 in medRxiv and bioRxiv, and the other 397 COVID-19-negative CT images were obtained from PubMedCentral (a search engine) and MedPix (a publicly available online medical image database containing CT scans of various diseases). The data distributions for these two categories are shown in Table 1 .

Table 1

COVID-19 dataset table.

categories	quantities
COVID-19	349
NonCOVID-19	397

COVID-19 dataset table.

TB dataset

The TB dataset is obtained from the ImageCLEF2021 challenge, which was used to classify TB cases into five main types: (1) infiltrative; (2) focal; (3) tuberculoma; (4) miliary; and (5) fibro-cavernous. We use 917 3D CT scans, which are stored in the NIFTI file format. Each slice has an image size of 512 × 512 pixels, and the number of slices is approximately 100; the dataset distribution is shown in Table 2 below.

Table 2

TB dataset.

Types	Amounts
Infiltrative	420
Focal	226
Tuberculoma	101
Miliary	100
Fibro-cavernous	70

TB dataset.

Preprocessing

The images are first matched one by one with the category labels, and then each image is resampled to (224, 224). Thus, the final 746 CT scans are all processed to sizes of (3, 224, 224). The data are enhanced using horizontal flipping, vertical flipping, rotation, brightening, and darkening operations. The distributions of the data before and after enhancement are shown in Table 3 and the brackets indicate the data after augmentation.

Table 3

Distributions before and after data enhancement.

Type	Categories		Total
	COVID-19	NonCOVID-19
Train	280(2520)	318(2862)	598(5382)
Test	69	79	148
Total	349(2589)	397(2941)	746(5530)

Distributions before and after data enhancement.

Experimental setup and evaluation criteria

In our experiments, we randomly divide the dataset into a training set (5 9 8) and a test set (1 4 8) based on a ratio of 8:2, where data augmentation is used on the training set. The distributions of the dataset before and after augmentation are shown in Table 3. The training set is then used to optimize the parameters. We set the optimizer (stochastic gradient descent (SGD)) learning rate to 0.01, the momentum to 0.9, and the weight decay to . Training is conducted for a total of 30 rounds. The main objective of this paper is to classify COVID-19. Classification results can be positive (infected with COVID-19) or negative (not infected with COVID-19). The predicted outcome of each classification match may or may not align with the actual category. In this setup, it is assumed that a true positive (TP) indicates an actual COVID-19 case and that the CT image can be correctly classified as COVID-19. A false positive (FP) indicates that an actual nonCOVID-19 case is incorrectly classified as COVID-19. A true negative (TN) indicates an actual nonCOVID-19 case whose CT images can be correctly classified as nonCOVID-19. Finally, an FN indicates that the actual COVID-19 case is incorrectly classified as nonCOVID-19. The performance of the model is evaluated with several commonly used evaluation criteria, i.e., the accuracy (Acc), precision (Precision), recall (Recall), and F1 value. These metrics are defined below: In medical research, especially for major infectious diseases such as COVID-19, it is very important to reduce the numbers of FP and FN results in the modeling process. In particular, FNs should be minimized to avoid classifying any COVID-19 patients as nonCOVID-19 pneumonia patients, as coronavirus may cause more significant damage. Of course, it is also important to minimize the FP rate to avoid the unnecessary waste of manpower and resources.

Classification model comparison

We evaluate DenseNet169 [12], ResNet101 [22] and ResNet34 [22] on the COVID-19 dataset and compare them with the ViT model. And they all have pretrained on ImageNet dataset. The experimental results are shown in Table 4 . The number of model training rounds is set to 30. From the experimental results, ViT achieves accuracy improvements of 3.37 %, 4.05 %, and 5.4 % over those of DenseNet169, ResNet101, and ResNet34, respectively. Moreover, compared to these models, the ViT converges more quickly, and the area under the receiver operating characteristic curve (Auc) = 0.9545, which means that the COVID-19 classification model based on the ViT performs better. Through these comparative experiments, the ViT model is still more effective than other classification models in this area of image classification.

Table 4

Comparison among the pretrained models.

Model	Acc.	Precision	Recall	F1	Auc	Time (min)
DenseNet169	0.8649	0.9538	0.7848	0.8611	0.9274	8
ResNet101	0.8581	0.9028	0.8228	0.8609	0.9176	15
ResNet34	0.8446	0.8500	0.8608	0.8553	0.9439	10
ViT	0.8986	0.9103	0.8987	0.9045	0.9545	9.4

Comparison among the pretrained models.

Comparison of transfer models

We first use the TB dataset, whose construction is shown in Table 2. Since the pretrained model is obtained via training on the ImageNet dataset [27], which belongs to the general-purpose domain, and the dataset used in this experiment consists of the COVID-19 CT images in the specialized domain, this scenario involves heterogeneous transfer learning. Since these CT images are stored in the NIFTI file format, we first slice them, train the obtained 110,000 CT images by using the pretrained model, and save the best-performing parametric model. Several types of slices are shown in Fig. 7 .

Fig. 7

Five types of TB slices.

Five types of TB slices. We train the best-performing model obtained through heterogeneous transfer learning as a pretrained model on the COVID-19 dataset, which is distributed as shown in Table 1. The comparison results obtained in our experiments are shown in Table 5 . Among the tested methods, ViT-HTL indicates heterogeneous transfer learning, and TL-Med indicates further homogeneous transfer learning on top of heterogeneous transfer learning. And ResNet34-TL, ResNet101-TL and DenseNet169-TL indicate that they are pre-trained on the TB dataset, and then fine-tuned on the COVID-19 dataset. It is shown that homogeneous transfer learning yields better results. The main reason for this is that the structures of the TB and COVID-19 datasets are similar, i.e., the feature spaces of the source and target domains are similar, and the features learned by the models are also similar.

Table 5

Transfer learning comparison.

Model	Acc.	Precision	Recall	F1	Auc	Time
ResNet34-TL	0.777	0.8833	0.6709	0.7626	0.8556	9
ResNet101-TL	0.7635	0.7683	0.7975	0.7826	0.7666	10
DenseNet169-TL	0.8919	0.8795	0.9241	0.9012	0.9532	10
ViT-HTL	0.8986	0.9103	0.8987	0.9045	0.9545	9.4
TL-Med	0.9122	0.9459	0.8861	0.9150	0.9606	9.3

Transfer learning comparison.

Ablation experiments

In this subsection, for the overall TL-Med model, we explore the impact of changing the corresponding settings on the model performance; in other words, we explore the impacts of whether the pretrained model is frozen, whether the prelogit module (PL) which consists of a linear layer and a nonlinear activation function(Tanh) is increased, and a before-and-after comparison of the enhancing data on the resulting model. From Table 6 , concerning the freezing of the pretraining model, we can see that the pretraining models without freezing are generally better than those that freeze. The accuracy is improved by 6.08 %, 4.06 %, and 7.43 % under the same experimental settings by only changing the freeze setting of the pretrained model, which shows that changes in the pretrained model settings have great impacts on the model performance. It is not difficult to understand that the freezing operation involves freezing all the previous layers and training only the MLP head module, i.e., train the classification output layer, whose model has fewer adjustable parameters and therefore results in limited model performance improvement. The operation without freezing uses the pretraining weights as the initial weights, and all of the weights participate in the training of the model, so the overall effect is better.

Table 6

Ablation experiments.

Model	Acc.	Precision	Recall	F1	Auc	Time
TL-Med + freeze	0.8514	0.8800	0.8354	0.8571	0.9066	7.2
TL-Med + no-freeze	0.9122	0.9459	0.8861	0.9150	0.9606	9.3
TL-Med + freeze + PL	0.8716	0.9054	0.8481	0.8758	0.9197	7.1
TL-Med + no-freeze + PL	0.9122	0.9342	0.8987	0.9161	0.9576	9.3
TL-Med + freeze + PL + dataAug	0.8581	0.8919	0.8354	0.8627	0.9145	337
TL-Med + no-freeze + PL + dataAug	0.9324	0.9600	0.9114	0.9351	0.9686	364.9

Ablation experiments. To achieve greater learning and fitting capabilities for the network model and to strengthen the representation capability of the network, we add the PL module before the last layer (linear layer), where the PL and the linear layer constitute the MLP head; this is denoted as the linear + tanh activation function. That is, the MLP head consists of two linear layers and a nonlinear activation function. Compared with Method Ⅰ and Method III, this approach achieves improved experimental results by adding the PL module, with a 2.02 % improvement in accuracy. Compared with Method Ⅱ and Method Ⅳ, this approach does not achieve obviously improved accuracy, but its recall is improved by 1.26 %, and the F1 value is increased by 0.11 %. The improvement in recall, which indicates a reduction in FNs, allows us to minimize the risk of the model diagnosing a COVID-19 patient as a nonCOVID-19 patient during the test identification process for infectious diseases, especially for a major infectious disease such as COVID-19. The distribution of the data after data augmentation is shown in Table 3. Pretraining is viewed as a major modeling technique in CV, in which pretraining is always applied to one dataset for use with another dataset. From Table 6, we can see that when the pretraining weights are frozen, the performance of the model decreases by 1.35 % when utilizing data augmentation. The reason for this phenomenon may be that after the pretraining model of the network is frozen, the use of the data augmentation technique adds more labeled data to the training process, which has a bad effect on pretraining and then reduces the effect of pretraining. In Fig. 8 , we can see that the accuracy achieved on the test set before data augmentation increases more; however, it increases slowly after data augmentation or even becomes gradual, and the overall increase is less than that before data augmentation. Therefore, we can notice that the use of the data augmentation technique adds more labeled data but has some negative effects on pretraining, which directly results in a decrease in model accuracy. That is, data augmentation and the use of more labeled data in the pretraining mode do not necessarily improve the performance of the model. Zoph et al. [66] studied the problems associated with pretraining in detection and segmentation tasks and suggested that using more powerful data augmentation techniques and more labeled data reduced the role of pretraining. However, by comparing Method IV and Method VI, the method improves the performance of the model by increasing the amount of data through data augmentation techniques. Therefore, with the pre-trained model frozen, the limited number of trainable parameters limits the ability of the model to learn more data-relevant features through the data augmentation technique. It can be seen that the number of trainable parameters has an impact on the data augmentation technique. After employing the data augmentation techniques without freezing the model, we can see that the accuracy, precision, recall, F1 value and Auc of the model are substantially improved, which implies that increasing the number of trainable parameters and combining them with more labeled data may effectively contribute to improving the model performance.

Fig. 8

Test set accuracies before and after data enhancement experiments.

Discussion

In the past period of time, many researchers have studied the detection of COVID-19, and developed some COVID-19 detection models based on deep learning. Table 7 compares our proposed method with the existing literature.Due to the different datasets, validation methods, and some corresponding evaluation metrics, a fair comparison with the model results proposed in the literature cannot be made. However, it is worth noting that our proposed method achieves relatively good performance on a dataset size of 746 CT images. The number of images we use is less than that used by other methods. The method by Xiao et al.[28] achieved high accuracy on the two datasets respectively. The method by Narendra et al.[67] achieved 99.12 % accuracy, 99 % recall and 99 % f-score in a balanced dataset of 400 COVID-19 and 400 normal images. The method of Nayeeb et al.[68] achieves 99.39 % precision, 99.39 % recall and 99.19 % f-score using a limited dataset. Mahesh et al.[69] achieved an accuracy of 98.30 %, a recall of 98.31 and an f-score of 98 % on a relatively large dataset. Bejoy et al.[31] method was able to achieve an accuracy of 91.15 % on the first dataset and 97.43 % on the second dataset containing only 71 COVID-19 images and 7nonCOVID images. Shome et al.[41] achieved 93.2 % precision and 96.09 % recall. The training of the model is a process of continuous adjustment and optimisation. Ruochi et al.[32] achieved 91.08 % accuracy on a small datasets.Mohammad et al.[29] achieved an accuracy of 99.11 % by continuously optimising the model to obtain the best combination of model hyperparameters. However, Panwar et al.[70] achieved an accuracy of 88.1 % on a balanced dataset with experimental results obtained on a relatively small dataset. And from the table, we can also conclude that the size of the data has a certain impact on the performance improvement of the model.The studies by [[67], [69], [70]] were based on popular convolutional neural network architectures such as VGG16, ResNet50, Xception. [[30], [68]] used a two-stage training scheme, which differed from the training scheme used in this paper in that they both separate network architectures, and both perform transfer learning on relevant datasets closely related to COVID-19.

Table 7

Comparison of our proposed method with the existing literature.

Method	dataset	Acc.	Precision	Recall	F1	Auc
Xiao et al.[28]	Test data 1:2567COVID-19,2567normal,2567 Pneumonia	0.9557	0.99	0.99	0.99	–
Xiao et al.[28]	Test data 2:756COVID-19,6284normal,3478 Pneumonia	0.9444	0.95	0.95	0.95	–
Narendra et al.[67]	400 COVID-19,400 normal	0.9912	0.99	0.99	0.99	–
Nayeeb et al.[68]	408COVID-19, 816Non-COVID	0.9939	0.9919	0.9939	0.9919	–
Mahesh et al.[69]	2249COVID-19,2396no-Findings	0.983	–	0.9831	0.98	0.999
Bejoy et al.[31]	Test data 1:453COVID-19,497 Non-COVID	0.9115	0.853	0.985	0.914	0.963
Bejoy et al.[31]	Test data 2:71COVID-19, 7non-COVID	0.9743	0.986	0.986	0.986	0.911
Govardhan et al.[30]	250COVID-19,965 other	0.9777	0.9714	0.9714	–	–
Shome et al.[41]	10819COVID-19,10,314 normal	0.9320	–	0.9609	–	–
Ruochi et al.[32]	189COVID-19,63 other,235 normal	0.9108	–	–	–	–
Mohammad et al.[29]	184COVID-19,5000 normal	0.9911	–	–	–	–
Panwar et al.[70]	142COVID-19,142Non-COVID	0.881	–	–	–	0.881
Proposed Method	349COVID-19,397Non-COVID	0.9324	0.96	0.9114	0.9351	0.9686

Comparison of our proposed method with the existing literature. The model proposed in this paper is developed with the aim of being used in clinical conditions for detecting COVID-19 patients from their chest CT images.Based on this, the model can be used to assist specialist practitioners in rapid diagnostic screening during an outbreak. Therefore, the purpose of this model is to rapidly screen COVID-19 from other diseases. From the experimental results, our model can be used for initial rapid screening of suspicious people and can provide effective assistance to front-line medical staff to further improve the efficiency of detection of COVID-19 and stop the spread of COVID-19. And as shown in the confusion matrix in Fig. 9 , the proposed model in this paper produces more false negatives, which will be a direction for our future improvement. As to whether the results of the deep learning method constitute a reliable diagnosis, due to the rigorous nature of medicine, we will conduct experiments on additional COVID-19 datasets in the future to further demonstrate the performance of our model.

Fig. 9

The confusion matrix of the proposed TL-Med framework.

Conclusion

In the past few decades, machine learning has developed rapidly and has been applied in many industries and fields. Since medical image recognition is the most basic and difficult problem in the medical field, the use of computers to assist doctors in identifying and detecting cases is a common application of machine learning. In this paper, a two-stage transfer learning model (TL-Med) based on the ViT is proposed to detect and identify CT data. To detect COVID-19 effectively, we first perform pretraining on the TB dataset, where the aim is to obtain medical features and use the best obtained results as a pretrained model, and then we train and test the resulting models on the COVID-19 dataset. To overcome the problem of data scarcity, we use a data augmentation technique. The data augmentation effectively improves the training of TL-Med by enabling it to learn more classification features and learning parameters from the rich training dataset. The classification model can effectively aid clinicians in the detection and identification of COVID-19. In the medical field, data scarcity is a common phenomenon. In the future, we plan to use TL-Med to explore other medical image detection and classification scenarios, such as breast cancer, brain tumors, and diabetic retinopathy.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

33 in total

1. Multi-task vision transformer using low-level chest X-ray feature corpus for COVID-19 diagnosis and severity quantification.

Authors: Sangjoon Park; Gwanghyun Kim; Yujin Oh; Joon Beom Seo; Sang Min Lee; Jin Hwan Kim; Sungjun Moon; Jae-Kwang Lim; Jong Chul Ye
Journal: Med Image Anal Date: 2021-11-04 Impact factor: 8.545

2. Multi-COVID-Net: Multi-objective optimized network for COVID-19 diagnosis from chest X-ray images.

Authors: Tripti Goel; R Murugan; Seyedali Mirjalili; Deba Kumar Chakrabartty
Journal: Appl Soft Comput Date: 2021-12-09 Impact factor: 6.725

3. A novel unsupervised approach based on the hidden features of Deep Denoising Autoencoders for COVID-19 disease detection.

Authors: Michele Scarpiniti; Sima Sarv Ahrabi; Enzo Baccarelli; Lorenzo Piazzo; Alireza Momenzadeh
Journal: Expert Syst Appl Date: 2021-12-16 Impact factor: 6.954

4. COVID-Transformer: Interpretable COVID-19 Detection Using Vision Transformer for Healthcare.

Authors: Debaditya Shome; T Kar; Sachi Nandan Mohanty; Prayag Tiwari; Khan Muhammad; Abdullah AlTameem; Yazhou Zhang; Abdul Khader Jilani Saudagar
Journal: Int J Environ Res Public Health Date: 2021-10-21 Impact factor: 3.390

5. A novel algorithm for detection of COVID-19 by analysis of chest CT images using Hopfield neural network.

Authors: Saeed Sani; Hossein Ebrahimzadeh Shermeh
Journal: Expert Syst Appl Date: 2022-02-24 Impact factor: 6.954

6. COVID-WideNet-A capsule network for COVID-19 detection.

Authors: P K Gupta; Mohammad Khubeb Siddiqui; Xiaodi Huang; Ruben Morales-Menendez; Harsh Pawar; Hugo Terashima-Marin; Mohammad Saif Wajid
Journal: Appl Soft Comput Date: 2022-03-29 Impact factor: 8.263

7. COVID19XrayNet: A Two-Step Transfer Learning Model for the COVID-19 Detecting Problem Based on a Limited Number of Chest X-Ray Images.

Authors: Ruochi Zhang; Zhehao Guo; Yue Sun; Qi Lu; Zijian Xu; Zhaomin Yao; Meiyu Duan; Shuai Liu; Yanjiao Ren; Lan Huang; Fengfeng Zhou
Journal: Interdiscip Sci Date: 2020-09-21 Impact factor: 2.233

8. AutoCovNet: Unsupervised feature learning using autoencoder and feature merging for detection of COVID-19 from chest X-ray images.

Authors: Nayeeb Rashid; Md Adnan Faisal Hossain; Mohammad Ali; Mumtahina Islam Sukanya; Tanvir Mahmud; Shaikh Anowarul Fattah
Journal: Biocybern Biomed Eng Date: 2021-10-20 Impact factor: 4.314