Literature DB >> 36160764

Ensemble multimodal deep learning for early diagnosis and accurate classification of COVID-19.

Santosh Kumar¹, Sachin Kumar Gupta², Vinit Kumar³, Manoj Kumar⁴, Mithilesh Kumar Chaube⁵, Nenavath Srinivas Naik¹.

Abstract

Over the past few years, the awful COVID-19 pandemic effect has become a lethal sickness. The processing of the gathered samples requires extra time due to the use of medical diagnostic equipment, methodologies, and clinical testing procedures for the early diagnosis of infected individuals. An innovative multimodal paradigm for the early diagnosis and precise categorization of COVID-19 is put up as a solution to this issue. To extract distinguishing features from the prepared chest X-ray picture and cough (audio) database, chest X-ray-based and cough-based model are used here. Other public chest X-ray image datasets, and the Coswara cough (audio) dataset containing 92 COVID-19 positive, and 1079 healthy subjects (people) using the deep Uniform-Net, and Convolutional Neural Network (CNN). The weighted sum-rule fusion method and ensemble deep learning algorithms are utilized to further combine the extracted features. For the early diagnosis of patients, the framework offers an accuracy of 98.67%.

Entities: Chemical

Keywords: COVID-19; Deep learning; Ensemble Learning; Fusion; Machine Learning

Year: 2022 PMID： 36160764 PMCID： PMC9485428 DOI： 10.1016/j.compeleceng.2022.108396

Source DB: PubMed Journal: Comput Electr Eng ISSN： 0045-7906 Impact factor: 4.152

Introduction

Recently, the COVID-19 pandemic has been caused by infection of the SARS-CoV-2 virus and proceeded to pose an essential threat to global health [1]. Infection in the human respiratory system, such as chest infection due to Tuberculosis (TB), and other severe infections, is significant and spread out globally. The pandemic recapitulates medical-related difficulties in several aspects, including salient advances in requirements for medical facilities, hospital resources, availability of diagnosis kits, and significant deficiencies [2]. In contrast, worriers, various healthcare staff members, and health workers have been contaminated and died due to this severe COVID-19 infection. Furthermore, the existing statistical learning models and unimodal systems are not accurate to perform the classification for COVID-19 because the labeled datasets are not accurate. Therefore, these unimodal systems cannot perform for early diagnosis of patients based on extracted features from scanned images of a different body of infected people. It needs colossal computation to pre-process the unlabeled medical datasets for extracting prominent features for further analysis of symptoms of infected people. Therefore, statistical learning models are not effectively used for the early diagnosis of COVID-19 patients in the general population. Moreover, the statistical learning models and unimodal learning systems also need to annotate the lesions, especially for disease diagnosis in Computerized Tomography (CT) volumes, for accurate prediction of COVID-19 [1]. For predicting the chest infection, the annotated lesions on chest CT scan images are segmented into different distinct regions for extraction of discriminatory features for early diagnosis of COVID-19 using machine learning techniques. It needs much effort, massive computation, and associated maximum costs for experts such as radiologists and other medical COVID-19 testing departments, which is not manageable and unacceptable when COVID-19 is spreading fast across the world. Significant shortages for radiologists and other medical staff members are the major problems in monitoring massive numbers of patients with no available vaccine or specific treatment options [1]. Therefore, there is a need to design and develop a multimodal framework for early diagnosis and accurate prediction of COVID-19 patients [1], [2], [3], [4]. The system realizes a multimodal health assessment methodology by monitoring multiple vital conditions and correlating the collected medical data to provide continual and real-time assessment of the patient’s health. Deep multimodal learning techniques are gaining proliferation due to wide applications and solutions for the early diagnosis of COVID-19. Thus, performing COVID-19 diagnosis in a multimodal framework is of great importance to improve the system’s overall performance for accurate prediction of COVID-19 [1], [2], [3]. One of the most straightforward labels for COVID-19 diagnosis is the patient label based on the cough and chest X-ray image database.

Motivation

The traditional clinical procedural method is used for the early diagnosis of infected people [1], [2], [3]. However, it takes more time to diagnose COVID-19 patients [5]. The pathological testing methods also are a very time-consuming process for the diagnosis of patients. To overcome this problem, the real-time-polymerase chain reaction (RT-PCR) method is used to diagnose infected people. However, RT-PCR approaches take a couple of hours to predict the diagnosis results for COVID-19 [6], [7], [8], [9], [10], [11]. To solve the early diagnosis of COVID-19, several researchers contributed to significantly alleviating early diagnostic procedures for COVID-19 patients using deep learning techniques [1], [2], [3], [12]. Due to the availability of a massive amount of large-scale annotated image datasets, deep learning techniques are gaining more proliferation due to their great success for different applications. Deep Convolutional Neural Networks (CNNs) and other deep models have been achieved using for image classification, and object recognition. However, annotated image data for medical analysis, the classification of medical images for early diagnosis, and accurate prediction of COVID-19 patients remain the biggest challenge in medical diagnosis. Interdisciplinary researchers have recently proposed a framework for efficient solutions for detecting COVID-19. Among the proposed framework-based solutions, deep multimodal learning-based and Artificial Intelligence (AI) learning models are getting proliferation due to broad applications. It plays an essential role in fighting the COVID-19 pandemic based on several datasets such as chest-X-rays and cough sound-based datasets for better analysis [5], [13], [14], [15]. No literature has yet been published to the best of our knowledge based on multimodal-based early diagnosis of the spread of COVID-19. Therefore, we hypothesize that the cough, chest X-rays, and diagnostic model can be used for early diagnosis of the COVID-19 patient. Hence, integrating multimodal’s discriminatory features and their possible interactions is essential for an accurate prediction model of disease spread. In this paper, we address the problem: how to early diagnosis and accurate classification of COVID-19 patients? To solve this problem, we proposed a novel multimodal framework for the accurate classification of COVID-19 patients based on chest X-ray images and cough (audio) sample datasets using deep multimodal learning techniques. The multimodal framework extracts discriminatory features from chest X-ray images and cough (audio) samples of COVID-19 patients using a convolutional neural network and speech signal processing techniques. The framework builds a classification model to classify COVID-19 and non-COVID-19 based on combined extracted features from chest X-ray and cough datasets using the weighted sum rule fusion method and deep multimodal learning techniques.

Contribution

The major contribution of this work is given as follows: A novel multimodal framework is proposed for early diagnosis and accurate prediction of COVID-19 patients based on chest-X-ray images and cough (audio) datasets using ensemble learning-based fusion techniques. In the proposed framework, a chest X-ray-based model and cough (audio) diagnostic model are integrated to extract discriminatory features such as texture and shape features from the chest X-ray image dataset using deep learning techniques. The cough diagnostic model processes the cough (audio) samples to extract discriminatory features such as, Mel-Frequency Cepstral Coefficients (MFCCs) features using speech signal processing and machine learning techniques. The framework integrates the extracted discriminatory features from the chest X-ray image model and cough modal using the weighted sum-rule fusion method to predict COVID-19 patients accurately. The proposed framework uses a weighted sum-rule fusion method and an average ensemble (WAE) strategy to fuse the accuracy of the chest X-ray model and cough diagnostic model for the accurate prediction of COVID-19. The integrated framework provides early diagnosis with the awareness of varied class-level accuracy of different ML models based on the chest X-ray image dataset and cough (audio) samples of COVID-19-infected people. The performance of the proposed framework is evaluated on the existing methods, and current state-of-the-art methods on different benchmark settings. The rest of the paper is organized as follows. Section 2 illustrates literature work for COVID-19 detection. Section 3 depicts the proposed framework including several steps. Section 4 shows the experimental results and performance evaluation based on different existing benchmark protocols and methods Finally, conclusions and prospected future directions are drawn in Section 5.

Literature work

In this section, the literature work is divided into different subsections describing the literature survey of each part of the system separately.

Chest-X ray based model

Various works done for COVID-19 detection using chest X-ray or similar inputs are mentioned in this subsection. Wang et al. [1] proposed a deep learning-based model named Covid-net with which they claimed 93.30% accuracy. They used a projection-expansion-projection-extension (PEPX) pattern. Tulin et al. [2] proposed the classification techniques to classify diseases into multi classes without transfer learning. The proposed model provided good accuracy without an over-fitting problem. They have implemented two kinds of classifiers binary for COVID-19 and no-findings and tertiary for COVID-19, Pneumonia, and no-findings. Yujin et al. [3] proposed a deep learning algorithm for lung segmentation for early diagnosis of COVID-19 and achieved better results. The lung segmentation is done using the FC-DenseNet technique and then provided the segmented images to the deep learning-based CNN classification network using ResNet and achieved 88.90% with segmentation and 79.8% without segmentation for accurate prediction of COVID-19. In [5], the author provided a comprehensive study of the work done so for COVID-19 detection using deep learning techniques and frameworks, especially defining various components of the works and comparing using various existing pre-trained models and the datasets used in different studies. According to this study, the ResNet-50 model is the most used learning model for COVID-19 detection [5], [13], [14], [15]. In this work also the mentioned accuracy is not enough to predict the COVID patient. In [14], a well-known CheXNet model, it is a 121-layer convolutional neural network trained on the Chest X-ray14 dataset consisting of 100,000 frontal-view chest X-ray images with 14 diseases of the patients. This model has been deployed for the detection of coronavirus including pneumonia based on chest features for localization of crucial segmented lung images. The literature work based on cough (audio) modality system is given in Table 1.

Table 1

Literature work based on chest X-ray modality system.

Ref.	Pros.	Cons	Tech.
[1]	NA	LA	VG
[3]	Lung segmentation	LA	FC-DenseNet
[2]	Effective	NS	DarkNet
CheXNet [15]	Precise localization	LA	CNN
VP [14]	RL	SL	CAAD model

Abreviation: Ref.=Reference, Tech.=Techniques, LA=Low Accuracy(%), VG=VGG-19, ResNet-50, NA=Not Available, VP=Viral pneumonia screening, RL=Reinforces one-class model, SL=Singular class, so non-useful in Covid detection, NS=No segmentation technique used.

Literature work based on chest X-ray modality system. Abreviation: Ref.=Reference, Tech.=Techniques, LA=Low Accuracy(%), VG=VGG-19, ResNet-50, NA=Not Available, VP=Viral pneumonia screening, RL=Reinforces one-class model, SL=Singular class, so non-useful in Covid detection, NS=No segmentation technique used.

Cough diagnostic based model

In this subsection, cough-based COVID-19 diagnosis has been reported in the literature, and several researchers have proposed different types of frameworks for the classification of COVID-19 patients. Imran et al. [8] collected cough samples for COVID-19 cases from open sources with bronchitis, and pertussis patients along with healthy individuals and created an AI engine/system for classification of COVID-19 [16]. The proposed system used AI techniques consisting of a cough detector and classifiers that use deep transfer learning and classical machine learning approaches. They developed a working prototype as an app that renders the results with three outcomes, namely; (i) COVID-19 likely, (ii) COVID-19 unlikely, and (iii) test inconclusive with an overall accuracy of 92.64%. The major shortcoming of this method is that it is time-consuming to process the cough sample datasets. Brown et al. [9] proposed a framework using machine learning techniques for classifying COVID-19 patients. They have used a crowd-sourced dataset of 4352 cough samples, out of which 235 declared having tested positive for COVID-19 cases. They analyzed handcrafted as well as features obtained through transfer learning. They tested classifiers such as Logistic Regression (LR), Gradient Boosting Trees (GBT), and Support Vector Machine (SVM), performing various classifications, namely, COVID-19-positive vs. not-declared positive, COVID-19 positive cough vs. Covid-positive without cough, and achieved an accuracy under area (AUC) of 80% for early diagnosis. In a similar direction, Laguarta et al. [10] collected variable length, cough audio recordings with a bit-rate of 16kbps. It consists of 2660 COVID-19 positive samples with a 1:10 ratio of positive to control subjects sample is split into 6 s audio chunks and processed with a Mel frequency cepstral coefficients (MFCCs) feature extraction technique. Pramono [304] proposed a method based on extracted features of the cough audio signals using a logistic regression model to classify cough events for early diagnosis. The major shortcoming of this work is that the method is tested on a small cough dataset. Hence, it suffers from an overfitting problem [304]. Table 2 shows literature work based on cough (audio)/sound modality system/frameworks for early diagnosis of COVID-19.

Table 2

Literature work based on cough sound modality system/frameworks.

Ref.	Advantage	Dis	Tech.
[8]	MD	HSR	CNN
[9]	HF	A	DL
[10]	Bio	Low-D	Bio-Model
[11]	TM	Not implemented	ML

Abbreviation: AD=Advantages, Dis=Disadvantages, Tech=Technique used, MD=Mediator based architecture, HSR=High sample rate & Mel-spectrogram, CNN=Convolution Neural Network, Low-D Require more dataset, Bio=Use of multiple Biomarkers, TM=Simple theoretical model and proof methods, FD=Fault way of data collection, DL=deep/machine learning & pattern recognition, A=Low AUC, EA=Easy availability, HF=Easy availability.

In [17], the author’s method that the proposed system based on electromyography (EMG) [17] signal is widely employed in the medical field, including biomedical and clinical domains, due to its capability to differentiate different neuromuscular diseases. Human nerves, muscles, and the spinal cord naturally control numerous neuromuscular disorders. Table 3 shows current state-of-the-art works in early detection of COVID-19 detection based on chest X-ray and cough sample dataset.

Table 3

Current state-of-the-art works in early detection of COVID-19 detection using ML techniques based on chest and coughing samples.

Study	Year	Rs	RSS	NS	NR	PS	TM	Accuracy
[18]	2020	SP	Cough	CO	3621	ST	ResNet-18	AUC(0.72)
[10]	2020	Web based	Cough	COD	5320		ResNet-50)	97.10%
[19]	2021	Web based	Cough	CODD	1502	MFCC	ENCNN	77%
[20]	2021	ES	B	COV10	10	2DFT	Inception-v3	80%
[8]	2022	Sound	200	Chest	HSR	SFs	CNN	70%
[9]	2020	cough	4352	Speech Pro.	MFCC	DL	CNN	80%

Abbreviation: Speech Pro=Speech processing, T=dataset consists of 4352 unique people collected from the web app + 2261 unique people from the Android app+4352 and 5634 samples. Cough=Crowdsourced Respiratory Sound Data, SF=Shape chest Features, 2DFT=Two-dimensional (2D) Fourier transformation, COVID10= [5 COVID19 5 healthy 10], ES=Electronic stethoscope, B=Breathing, Sp=Smartphone app, TM=Trained Model, Rs=Recordings Source, RSS=Respiratory Sound, NS=Number of Subjects, NR=Number of Recordings, PS=Pre-processing Steps, ST=Short-term magnitude spectrogram, CO=COVID-19 1620 healthy, Mel=Mel-frequency cepstral coefficients (MFCC), COD=2660 COVID-19 2660 healthy, CODD=114 COVID-19, 1388 healthy, MFCC=Spectrogram Mel spectrum, power spectrum Tonal, MFCC=Mel-frequency cepstral coefficients, ENCNN=Ensemble CNN.

Literature work based on cough sound modality system/frameworks. Abbreviation: AD=Advantages, Dis=Disadvantages, Tech=Technique used, MD=Mediator based architecture, HSR=High sample rate & Mel-spectrogram, CNN=Convolution Neural Network, Low-D Require more dataset, Bio=Use of multiple Biomarkers, TM=Simple theoretical model and proof methods, FD=Fault way of data collection, DL=deep/machine learning & pattern recognition, A=Low AUC, EA=Easy availability, HF=Easy availability. Based on overall observations, many coronaviruses diseases 2019 (COVID-19) and post-COVID-19 patients encounter muscle fatigue and severe infections. There is a need to design an early detection model to diagnose muscle fatigue and muscular paralysis to predict and prevent COVID-19 and post-COVID-19 patients. The clinical experts and examination play a considerable role in the early finding and diagnosis of these diseases; this research study provided an early diagnosis of post effects of COVID-19 [21] based on Chest X-ray and Cough samples for predicting COVID-19 effects using machine learning-based diagnosis techniques. Current state-of-the-art works in early detection of COVID-19 detection using ML techniques based on chest and coughing samples. Abbreviation: Speech Pro=Speech processing, T=dataset consists of 4352 unique people collected from the web app + 2261 unique people from the Android app+4352 and 5634 samples. Cough=Crowdsourced Respiratory Sound Data, SF=Shape chest Features, 2DFT=Two-dimensional (2D) Fourier transformation, COVID10= [5 COVID19 5 healthy 10], ES=Electronic stethoscope, B=Breathing, Sp=Smartphone app, TM=Trained Model, Rs=Recordings Source, RSS=Respiratory Sound, NS=Number of Subjects, NR=Number of Recordings, PS=Pre-processing Steps, ST=Short-term magnitude spectrogram, CO=COVID-19 1620 healthy, Mel=Mel-frequency cepstral coefficients (MFCC), COD=2660 COVID-19 2660 healthy, CODD=114 COVID-19, 1388 healthy, MFCC=Spectrogram Mel spectrum, power spectrum Tonal, MFCC=Mel-frequency cepstral coefficients, ENCNN=Ensemble CNN.

Proposed framework

In this section, the work of the proposed multimodal learning framework is discussed. The proposed framework consists of the following steps: (1) data preparation, (2) pre-processing and normalization, (3) feature extraction and representation from the pre-processed chest X-rays and cough (audio) samples, (4) selected features by feature selection method using weighted sum-rule fusion method, (5) classification of COVID-19. The working flowchart of the proposed framework is shown in Fig. 1.

Fig. 1

Working flowchart of proposed framework using deep learning model.

Data preparation and description

We prepared the database of chest X-rays and cough (audio) sample dataset of the infected COVID-19 people from the Department of Pulmonary Medicine & TB, AIIMS, Raipur, India. We collected the database from 20,000 patients with symptoms of COVID-19 and non-COVID-19 cases from different centers established in local regions to expedite the early diagnosis process and accurate prediction of COVID-19. Fig. 2 shows the prepared chest X-ray image database.

Fig. 2

CXR images for normal and illness people from prepared chest dataset, which categorizes Chest X-ray images into classes of: (a) Normal case, (b) pneumonia case, and (c) COVID-19 case.

Cough based COVID-19 detection model

The cough-based model consists of several steps:(1) input cough sample collection, (2) pre-processing and normalization of collected data, (3) cough burst detection, (4) segmentation (5) feature extraction, and (6) classification.

Cough data collection

For the detection of COVID-19 patients, we considered cough, breathing, and speech sound samples of individual patients (speakers/users) to train the proposed framework. We have used several datasets from different sources available on open-source platforms. The Coswara cough database [16] is used provided by Coswara, research center at IISc Bangalore, India under Coswara project [16] for diagnosis of COVID-19 patients (shown in Table 4. The cough audio recordings have been collected via worldwide crowd-sourcing using web applications from different speakers. The dataset comprises categories of cough (two kinds; heavy and shallow) and breathing (two kinds; heavy and shallow). It sustained vowel phonation (three kinds: a, e, o) and digit counting (two kinds: fast and regular) along with metadata information. Namely - age, gender, location (country, state), and presence of co-morbidity (pre-existing medical conditions) with non-COVID-19 and COVID-19 patients are required to prepare the database for early diagnosis and accurate prediction of COVID-19.

Table 4

Description of cough (audio) dataset.

Database	Size	Class ratio	Data types
Coswara [16]	8000	10:1	C+B+S
Coughvid [22]	20000	7:3	Cough sounds
DetectNow [23]	6500	8:5	Cough sounds
Virufy [13]	16	9:7	Cough sounds

Abbreviation: C+B+S = Cough (C), breathing (B), and speech sound (S).

We also recorded the cough dataset from the patients who met the COVID-19 screening criteria. They were interviewed using a predefined questionnaire to accumulate a demographic dataset. The primary objective is to determine the disease-related information and other infection-related information to help with the fast diagnosis process of COVID-19. Three voice recordings have been collected from each participant using the human–computer interface-enabled devices, including the user microphones on their smartphone with prepared sentences and individual words. The primary objective is to express speech characteristics in terms of product characteristics from infected COVID-19 patients. To study characteristics of cough sound units in terms of the excitation source and positions and movements of the articulators for early diagnosis of COVID-19. The manual is taken for cough voice recordings, consisting of a continuous ‘ah’ sound for 5 s, a Thai polysyllabic sentence selected by a voice specialist for vocal apparatus analysis, and a cough sound. We also recorded the sentence “Uuunt Eekh Khata Hai” five times to measure the Place of Articulation (POA) and Manner of Articulation (MOA). MOA concerns how the vocal tract restricts airflow. Completely stopping airflow by an occlusion creates a stop consonant. Vocal tract constrictions occur in fricatives and vowels, whereas lowering the velum causes nasal sounds. On the other hand, the POA method provides some more acceptable discrimination of phonemes. It refers to the point of the narrowest vocal tract constriction. Languages differ considerably depending on which places of articulation are used within the various MOA. Vowel, stops, nasal, and fricatives are present in almost every language. The POA—further, some phonemes bring distinction among the languages. All recording of cough voice is mono-channel sampled at 44,100 Hz with a maximum duration of 30 s. Both the training and testing datasets were binary labeled for the dataset of COVID-19. The Coughvid [22], DetectNow [23] and Virufy [13] datasets are publicly available. The audio recordings of cough samples were collected in a separate folder and compressed along with their metadata information in multiple files. The files were then uncompressed, and folders equal to the number of users were extracted, consisting of audio recordings in .wav file format (44.1 kHz). We used all the positive samples from the dataset, and the same number of negative samples were randomly chosen for balanced distribution. Pipeline for proposed cough based diagnostic model. Description of cough (audio) dataset. Abbreviation: C+B+S = Cough (C), breathing (B), and speech sound (S).

Processing of audio cough data

The collected cough audio sample database is taken into consideration by performing downsample at 44.10 kHz. The audio sample used the pulse-code modulation technique for transferring audio format in the mono channel. The audio sample is processed to remove noises by the low-pass filtering method, the artifacts and noises are reduced by low-pass filter techniques. Chebyshev filter technique is used with a transition frequency of cough audio at 10 Hz to maintain the high pitch of cough audio while attenuating environment sounds simultaneous. The pipeline of the working model is shown in Fig. 3.

Fig. 3

Pipeline for proposed cough based diagnostic model.

The pre-processing of cough(audio) is shown in Algorithm 1. The low-pass filter method is used to smooth spectrogram images. The short-term frequency is analyzed and used with the filter audio signals by utilizing the empirical mode decomposition. It splits the cough audio sequence into more minor sequences or simple modes. Each mode contains the energy associated with different vowels and digits utterances at a particular scale. We used the down-sampled technique to perform smoothing on the collected cough samples which are recorded at a 16 kHz sample rate. The significant challenges we faced with these collected cough datasets were that it consists of lots of dead space or not significant features and the recordings were variable in length. To deal with the former problem, we created an amplitude envelope with a threshold of 100 to get rid of dead spaces and tiny background noises which fall below the amplitude of value 100, leaving only the desired recordings. After getting the proper audio recordings, we divided the cough audio sample database into chunks of four seconds each, and we padded as needed. We used the 80% of the datasets for training the proposed cough model, and the remaining 20% of the complete database was used to test and validate the proposed model. Fig. 4 shows a wide-band spectrogram of a heavy cough sample.

Fig. 4

Cough (speech) signal for volume and its power spectrum.

Feature extraction

Coughing is one of the most significant symptoms of early diagnosis of COVID-19. The cough (audio) signal is represented as a sequence of spectral vectors. These representations of cough samples in the time vs. frequency representation depicted that the cough signal consists of a high energy band, referred to as spectrogram. The spectrogram of the cough sample provides significant information on cough signal strength over time at various frequencies which may be presented in a particular waveform of high-energy bands of spectrograms. We computed the Mel-Frequency Cepstral Coefficients (MFCCs) feature of the IMFs, and windowed cough signal sample from spectrograms used as MFCC features to classify cough using a deep convolutional neural network model. The proposed framework first extracts MFCCs features from the prepared cough audio datasets resulting in a 25000 × 36 feature matrix. Then, we selected the principal component features from the extracted features using the Principal Component Analysis (PCA) method. The PCA method finds the prominent features from the linearly projected features in the feature space. It is a linear projection method that includes the computation of covariance matrices of the MFCCs features, which keep 95% of the principal component information and integrate them into a feature vector with the dimension 5000 (500*10). Combining the MFCCs feature extraction method with the PCA method enables us to find the principal components from the cough sample to improve the accuracy of automatic speech recognition systems and reduce the feature dimension. The computation of MFCC features is illustrated in Algorithm 3. The performance of speech voice-based classification tasks is greatly enhanced by adding time derivatives to the basic static MFCC parameters of the cough samples. This is popularly known as delta coefficients (delta () and delta–delta () coefficients). It describes the vocal tract system in the human body based on computed acceleration values of the vocal cough system. Therefore, we have measured the derivatives to analyze MFCCs features and are computed using the following regression formula. The detailed description is given step by step procedures as follows (shown in Eq. (7)): The Discrete Fourier Transform (DFT) technique is applied to get the signal in the time interval from the frequency domain after dividing the speech signal into a small number of speech frames. Ultimately, the power spectrum has been obtained for mapping it onto the Mel scale. Fig. 4 depicts the power spectrum of a cough audio input volume. Next, log outputs are found using the DCT technique. Finally, delta () and delta–delta () coefficients are calculated as follows: Let us consider the MFCCs of a window frame (t) is denoted by . The delta coefficient () is computed as (shown in Eq. (7)): Where, I depict the delta window as usually set to 6 to 10 frames, as the consumer devices’ speech input may have different signal duration, the MFCCs feature vectors will also be of various lengths. Therefore, the proposed system normalizes the feature vector by constructing MFCCs with no sound for shorter signals. The MFCC feature extraction includes windowing the signal and taking the Fast Fourier Transform (FFT) of the input cough signal using Fourier Transform techniques. Then, the proposed system performs the wrapping method to measure the cough spectrum’s powers into the Mel scale to compute the coefficients by taking the powers’ logs. Then, the Mel log powers array is produced, which is a signal. The Mel log power is calculated using the DCT technique for cough samples for further analysis. The amplitudes of the cough sample are known as MFCCs feature for early diagnosis of COVID-19. The proposed system includes the following steps: (1) input cough (audio) sample., (2) transform the input cough sample into the frequency domain to calculate Mel scale using Discrete Fourier Transform (DFT), (3) evaluation of spectrum using power log transformation, (4) calculation of Mel scale of the processed cough signal, (5) calculate coefficients using DCT techniques and the amplitudes of the cough sample is known as MFCCs features are shown in Fig. 5.

Fig. 5

Feature extraction from the cough (voice) samples.

After windowing and frame blocking cough audio signal, DFT technique is applied. To each windowed frame to convert the audio signal to power spectrum moving from the time domain to frequency domain where the values we have for each frequency tell us how much each frequency component is present in the original waveform (shown in Eq. (8)). Feature extraction from the cough (voice) samples. Where is the number of feature points. It is used to compute the DFT and k is the range between 0 and N-1. After getting the spectrum, we apply logarithm to the spectrum to obtain the log power spectrum, which gives us the magnitudes in decibels. The cepstrum represents how these frequencies are present in the log power spectrum. The mathematical equation to obtain a cepstrum is mentioned below (shown in Eq. (8)). Where, (C) is the obtained cepstrum. The term is the inverse discrete Fourier Transform, and F is the DFT technique. The next step is the computation Mel spectrum. Mel is a unit of measure based on how the human ear perceives a frequency. Human auditory systems do not perceive pitch linearly on a physical frequency scale. The Mel approximation from physical frequency is expressed as (shown in Eq. (10)). Where denotes the perceived frequency. The term (f) denotes physical frequency and partitions the physical frequency scale into bins and, using overlapping triangular filters, transforms each bin into the corresponding bin in the Mel scale. A Mel spectrogram is computed by multiplying each triangular Mel weighing filter with the magnitude spectrum. We have considered the first 13 MFCC coefficients truncating the high-order DCT coefficients to make the system more robust. The zeroth coefficient was excluded since it represents the average log energy of the signal and holds little information about the signal. MFCC features are computed by using the following mathematical expression (shown in Eq. (11)): Where C is the number of MFCCs of cough samples, M and c(n) represent the number of cough samples and cepstral coefficients.

Classification

We build the classification model in this section using the extracted MFCCs feature vectors and . To classify the cough audio signal, the proposed method incorporates a deep CNN technique. The CNN model’s basic structure is comprised of convolution, pooling, and fully connected layers. The CNN model begins with an initial layer and then applies to the convolution layer a spectrogram of the collected MFCCs features/feature map). The ReLU non-linear activation method is used in the convolution layer of the CNN model to build feature maps. The extracted MFCC features are used as input for training the CNN learning model. The model is made up of two block-layer blocks. It has two convolutional layers followed by two two max-pooling layers: the batch normalization layer and a 0.20 dropout to prevent overfitting. The max-pooling layer is then used to attenuate the feature maps employing down-sampling approaches by summing the retrieved features. The batch normalization layer is used to standardize the input layer and stabilize the learning process. The convolutional layers in the first block use a kernel size of 5 × 5 with 100 filters. The second block uses a kernel size of 3 × 3 with 100 filters from each convolutional layer to learn complex features from these four convolutional layers. Finally, the output and softmax activation functions classify COVID-19 cases for the given input chest X-ray images. In this model, for all the convolutional layers, the ReLU activation function is used. We used the Adam optimizer technique with a learning rate of 0.0001. It is used due to its relatively better flexibility and efficiency. In the proposed framework, the cross-entropy loss function is used and the softmax activation function to measure the loss rates.

Chest X-ray based COVID-19 detection model

We collected the chest X-ray image database for extracting rich and distinct information. We focus on conducting diagnoses for COVID-19 patients and community-acquired pneumonia for characterizing the relationships between multiple types of discriminatory features from captured Chest X-ray images and these diseases, which caters to a possible pipeline for automatic diagnosis and investigation of COVID-19 using deep learning techniques. In the proposed multimodal system for COVID-19 detection, the system performed the prediction for infected and non-infected persons based on the collected chest X-ray database. The chest-X-ray-based working model consists of several steps: (1) pre-processing, (2) segmentation of images, (3) feature extraction and matching, and (4) classification. Working of chest X-ray model.

Pre-processing step

As mentioned earlier, we have obtained a database from online available open-source platforms (Kaggle & Github). We separated them into 500 COVID-19 positives and 1600 COVID-19 negative images from the collected database using the available chest X-ray images. After data collection, the chest X-ray image database, the model pre-processed images using image processing techniques. It includes image interpolation methods to downsampled images to remove noises and other artifacts from images before providing them as an input image sample to the chest-X-ray-based classification model (shown in Fig. 6, Algorithm 4).

Fig. 6

Working of chest X-ray model.

Segmentation of chest X-ray image

It is the process of dividing the given image into distinct regions of interest. In this model, we are concerned with the lungs, so we performed lung segmentation. The main objective of segmentation in collected data is proper extraction of information from this database, as is essential for classifying COVID-19 and No-findings. The chest X-ray dataset is segmented to mark lungs and regions of interest to identify the COVID-19-infected parts in the given image. Fig. 7 shows the segmentation of chest X-ray images using the proposed model.

Fig. 7

Segmentation of chest X-ray image using the proposed model.

This algorithm helps localize the lungs, which is the primary part of studying in this chest X-ray model. To improve the accuracy of the proposed system, we are using a deep learning U-Net-based segmentation mode used for the segmentation of the medical image. It consists of a contracting and an expansion path. The contraction path is used to capture the data from the image. In this path, we have convolutional layers followed by ReLU and max-pooling layers. After this, the expansion path is used, which localizes the data to be segmented. In expansion, path transposed convolution operation is used. Table 6 shows the sensitivity, specificity, precision, accuracy, and F1-Score using the proposed framework based on the segmented chest-X-ray images.

Table 6

Matrices of the proposed cough classification model.

Sensitivity	Specificity	Precision	Accuracy	F1-Score
0.8345	0.8126	0.8141	0.8461	0.8653

Segmentation of chest X-ray image using the proposed model.

Classification model

In this section, the working of the classification model is given. The working of CNN-based classification model is shown in Fig. 8. In the classification model, we have used 14 convolution layers and three max-pooling layers, and Softmax () along with the rectified linear activation function or ReLU activation function is used to classify the extracted features from the Chest-X-ray images. Each convolutional layer has a different number of filters and the number of filters increases as we go deeper into the architecture. The classification model consists of different layers to classify the chest-X-rays images as COVID-19 and non-COVID-19. The convolutional operation () is mathematically defined as (shown in Eq. (12) The classification model based on the CNN model compares chest X-ray images based on the patch-by-patch (piece by piece) matching technique. The pieces that it looks for are called features. Features are chosen and put on the input image. If it matches, then the image is classified correctly. The Max-pooling layer is used to shrink the image stack into a smaller size for a better representation of extracted features. A window of size is chosen with (3 × 3) stride. We selected the maximum value in the window and replaced the whole window with that max-value of the filter with size (3 × 3) and move the window across the filtered chest X-image by the value of stride.

Fig. 8

Chest classification model using CNN model with convolution and max pooling layer operations on chest X-ray images.

After convolution layers, the max-pooling layer and softmax activation functions are used. For pooling, we have used the max-pooling method, a down-sampling technique in the proposed framework with the CNN framework that uses the maximum value from each cluster of neurons at the last layer. Since it is a binary classification process for classifying COVID-19 and non-COVID-19 by the proposed framework. Chest classification model using CNN model with convolution and max pooling layer operations on chest X-ray images.

Results and discussion

In this section, we present the results of the proposed system. First, experimental results are computed based on different benchmark settings and protocols we conducted to evaluate the performance of classification models using the CNN model on Chest X-ray images. The performance of the proposed system is compared with the existing methods for the classification of patient diseases on chest X-ray and cough datasets. The experimental results are computed with fusion and non-fusion methods using the weighted sum-rule fusion method to classify diseases (COVID-19, pneumonia, and healthy). Moreover, all the experimental results are computed and reported five times through random sub-sampling-based cross-validation. For evaluating the performance of the proposed framework, we used to evaluate the performance of the proposed system using confusion metrics measures given as follows: Sensitivity: It is the method to measure a given model’s ability to predict the true positives of each present category accurately. Accuracy: The percentage of the number of images in the testing data correctly classified by the proposed system. Precision (P): Precision is the percentage of times that correctly predicted cases are classified positive properly (i.e., ). Recall (R)/Specificity: It is the percentage of times that a particular actual positive case, we can predict correctly by the model (i.e., ) F-1 measure score: It gives an integrated score using precision and recall, which is computed as () The overall results of the proposed chest X-ray classification model are mentioned in Table 5. It depicts 5-fold cross-validation performance matrices of the classification model with the segmented images. The overall accuracy of the proposed chest X-ray classification model is 98.35%, which is higher than other approaches in the current state of art methods listed in Table 8.

Table 5

Performance matrices of proposed chest X-ray model of 5-fold cross-validation results.

Fold	Sensitivity	Specificity	Precision	Accuracyto	F1
1	0.9459	1.0000	1.0000	0.9890	0.9722
2	1.00009	1.0000	1.0000	1.0000	1.0000
3	0.9453	0.91235	0.9367	0.9578	0.9646
4	0.9354	0.94285	0.9669	0.9677	0.9879
5	0.9891	0.97685	0.9556	0.9576	0.9789

Table 8

Comparisons with existing techniques for COVID-19 detection.

Ref.	Database used	Technique used	Accuracy(%)
[1]	CXR	COVID-Net	92.4%
[12]	CXR	ResNet50	95.4%
[2]	CXR	DarkCovidNet	98
[4]	Chest CT-Scan	ResNet	86.7%
[6]	CXR	VGG-19	93.48%
[7]	1,065 CT images	M-Inception	82.90%
Proposed	Chest X images	Deep fusion+CNN+U-net	98.33%

Performance matrices of proposed chest X-ray model of 5-fold cross-validation results.

Statistical analysis of cough samples

The experimental results demonstrated that our cough detection algorithm could classify COVID-19 as positive and non-COVID-19 with an accuracy of 86.53%. Fig. 12 shows an accuracy error graph of mean loss versus epochs. It also shows the accuracy vs. epochs for both training and testing data sets shown in Table 6.

Fig. 12

(a) accuracy (%) vs.epoch graph for cough classification model and (b) loss (%) vs. epoch for cough classification model.

Matrices of the proposed cough classification model.

Performance analysis with multiclass classification

We computed the performance of the proposed system based on datasets using fusion features. The primary objective of feature fusion is to use discriminatory features from various layers because different layers encode distinct types of information in images. Hence, fusing features from the chest X-rays and cough samples after encoding different types of information enhances the system’s accuracy for classifying three classes as COVID-19, pneumonia, and healthy for early diagnosis and accurate prediction. Fig. 9 shows the confusion metrics for COVID-19, pneumonia, and healthy (non-COVID-19) based on fusion and without fusion techniques. Fig. 10 shows the training and validation accuracy for COVID-19 using the chest-X ray dataset.

Fig. 9

Confusion matrix of the proposed system with fusion method.

Fig. 10

Training accuracy vs. validation accuracy for chest X-ray based COVID-19 classification.

The confusion matrix is measured with fusion and non-fusion. Fig. 9, Fig. 10 illustrate the proposed system’s confusion matrices without fusion, the proposed system with fusion, the ResNet50, Darknet, and the SqueezeNet, respectively. It can be observed that the performance of the proposed system increases and the achieved precision with fusion techniques is higher than recall without the fusion method. The optimal features are selected from the extracted discriminatory features from both modalities. The classification accuracy is less in all the classes (COVID-19, pneumonia, and healthy) than that without fusion. Fig. 11(a) shows the confusion matrix for chest X-ray image based diagnosis and Fig. 11(b) shows cough based model performance for early diagnosis of COVID-19. (see Fig. 13). For example, for the COVID-19 class, the precision increased from 0.898 to 0.952, and the recall increased from 0.855 to 0.902. The proposed system with fusion had the best precision and recall values among the compared systems for all three classes. Table 6 shows the average accuracy, precision, and recall of the systems. From the table, we can see that the proposed system with fusion.

Fig. 11

Confusion matrix for (a) chest X-ray images and (b) cough based diagnosis.

Fig. 13

Confusion matrix of the proposed system without fusion method.

Confusion matrix of the proposed system with fusion method. Training accuracy vs. validation accuracy for chest X-ray based COVID-19 classification. Confusion matrix for (a) chest X-ray images and (b) cough based diagnosis. (a) accuracy (%) vs.epoch graph for cough classification model and (b) loss (%) vs. epoch for cough classification model. Confusion matrix of the proposed system without fusion method. Classification (%) based on different ML models. Abbreviation: Ref.=Reference, Data=database used, Tech.=Techniques, A=Accuracy(%), Prop.=Proposed, DLT=DLT-based classifier, ED=Ensembled Deep Neural Network (DNN), TL=Transfer Learning with VGGish, chest-CT=Chest CT Scan, REs-A=ResNet and Location Attention, MI=M-Inception techniques, CNN=Convolutional neural Network.

Comparison with existing work

In this section, we have done a comparative analysis of the existing methodologies for COVID-19 detection. Each learning-based modality in multimodal, chest X-ray-based, and cough sound-based, is considered to validate the accuracy of the proposed multimodal-based framework with current state-of-the-art methods discussed. Table 7 shows a comparison of classification (%) accuracy using machine learning techniques.

Table 7

Classification (%) based on different ML models.

Model	Accuracy(%)	Precision(%)	Recall(%)
SqueezeNet	84.40	83.40	84.30
Darknet	92.5	91.8	93.2
ResNet50	90.0	89.78	90.8
REs-A	Chest CT	MI	82.90%
DLT	Chest-X-ray	DTL	92.1%
Proposed without fusion	86.6	85.7	86.86
Proposed with fusion	92.5	91.8	93.89

Abbreviation: Ref.=Reference, Data=database used, Tech.=Techniques, A=Accuracy(%), Prop.=Proposed, DLT=DLT-based classifier, ED=Ensembled Deep Neural Network (DNN), TL=Transfer Learning with VGGish, chest-CT=Chest CT Scan, REs-A=ResNet and Location Attention, MI=M-Inception techniques, CNN=Convolutional neural Network.

Chest X-ray Model: To demonstrate the importance of our proposed model for processing the chest X-ray images for early classification, we have compared the chest X-ray model with existing works done so far, shown in Table 8. We have compared the experiments of different X-ray models for COVID-19 diagnosis based on different data sets and current state-of-the-art methods. Cough (audio) Detection Model: We compared the performance of the existing method with the cough (audio) model for classifying COVID-19 patients. The comparative analysis of proposed cough model-based work with various work done so far in this field. We have compared the experiments of different cough-based models of COVID-19 diagnosis based on different datasets. In Table 9 we have mentioned various accuracy obtained by different machine learning (ML) models based on a different chest X-ray image and cough sample dataset. It shows the proposed method without fusion provides 86.6% 85.70% and 86.86% accuracy for early diagnosis of COVID-19. The presented ensemble fusion method yields higher accuracy as compared to other models and with the fusion method. The proposed framework uses the weighted average ensemble and weighted sum-rule fusion techniques to fuse the obtained accuracy from the chest X-ray model and cough (audio) sample model based on selected discriminatory features from the chest-X ray image dataset and cough (audio) sample dataset using linear discriminant analysis method. Therefore, the overall accuracy of the proposed framework with the ensemble method is 92.50% 91.80%, and 93.89% for early diagnosis and accurate prediction of COVID-19. Table 11 shows the summary table based on current state-of-the-art works in early detection of COVID-19 detection using ML techniques based on chest and coughing samples.

Table 9

Comparisons with existing techniques on cough for COVID-19 detection.

Ref.	Used dataset	Technique Used	Accuracy	Remark
[4]	chest-CT	Res-A	86.7%	Takes more time, less accuracy
[7]	Chest CT	MI	82.90%	Less dataset
[8]	X-ray	DTL	92.1%	Overfitting+ processing time is more
[13]	CC	ED	77.1%	Less accuracy
[15]	audioset/ ESC-50	TL	70.58%	Not Normalized dataset used
Prop.	Coswara	CNN	84%	Less processing time

Abbreviation: Ref.=Reference, Prop.=Proposed, DLT=DLT-based classifier, ED=Ensembles DNN, TL=Transfer Learning with VGGish, chest-CT=Chest CT Scan, REs-A=ResNet and Location Attention, MI=M-Inception techniques, CNN=Convolutional neural Network, Coswara=Coswara cough audio database, CC=Coswara/Coughvid.

Table 11

Summary table based on Current state-of-the-art works in early detection of COVID-19 detection using ML techniques based on chest and coughing samples.

Study	Year	Rs	RSS	NS	NR	PS	TM	Per(%)
[8]	2022	Sound	200	Chest	HSR	SFs	CNN	70%
[9]	2020	cough	4352	Speech	MFCC	DL	CNN	80%
[10]	2020	Web	Cough	COD	5320	DL	R2	97.10%
[18]	2020	SP	Cough	CO	3621	ST	R-18	AUC(0.72)
[20]	2021	ES	B	COV10	10	2DFT	In-v3	80%
[19]	2021	Web based	Cough	CODD	1502	MFCC	ENCNN	77%
[24]	2021	Web-based	Cough	COT	16	MFCCs	SFT SVM	94.21%
[10]	2022	B	BB	220	MFCCS	DL	Low-D	70%
prop.	2022	CT&cough	Total	36610	DL	ML	MultM.	98.33%

Abbreviation: MultM=Multimodal, In-v3=Inception v3, R=(ResNet-18), R1=ResNet-50 COT=7 COVID-19+9 Healthy, SFT+Singlet time Fourier transformation, SVM=Support vector machine, Speech Pro= Speech processing, T= dataset consists of 4352 unique people collected from the web app + 2261 unique people from the Android app+4352 and 5634 samples. Cough=Crowd sourced Respiratory Sound Data, SF=Shape chest Features, 2DFT=Two-dimensional (2D) Fourier transformation, COVID10= [5 COVID19 5 healthy, ES=Electronic stethoscope, BB=120 COVID+100 healthy) B=Breathing, Sp=Smartphone app, TM=Trained Model, Per.=Performance, Rs=Recordings Source, RSS=Respiratory Sound, NS=Number of Subjects, NR=Number of Recordings, PS=Pre-processing Steps, ST=Short-term magnitude spectrogram, CO=COVID-19 1620 healthy, Mel=Mel-frequency cepstral coefficients (MFCC), COD=2660 COVID-19 2660 healthy, CODD=114 COVID-19, 1388 healthy, MFCC=Spectrogram Mel spectrum, power spectrum Tona, Mel-frequency cepstral coefficients (MFCC), ENCNN=Ensemble CNN, Total=(COVID-19 34516 samples+ 2100(1500 (chest X-ray positive+600(Chest X-ray: Negative), proposed method=MFCC+Darknet+ensemble ML+weighted sum rule fusion method, Model=Unet+Handcrafted features.

Comparisons with existing techniques for COVID-19 detection. Comparisons with existing techniques on cough for COVID-19 detection. Abbreviation: Ref.=Reference, Prop.=Proposed, DLT=DLT-based classifier, ED=Ensembles DNN, TL=Transfer Learning with VGGish, chest-CT=Chest CT Scan, REs-A=ResNet and Location Attention, MI=M-Inception techniques, CNN=Convolutional neural Network, Coswara=Coswara cough audio database, CC=Coswara/Coughvid. Classification accuracy (%) of 5-fold cross-validation of Proposed Framework. Summary table based on Current state-of-the-art works in early detection of COVID-19 detection using ML techniques based on chest and coughing samples. Abbreviation: MultM=Multimodal, In-v3=Inception v3, R=(ResNet-18), R1=ResNet-50 COT=7 COVID-19+9 Healthy, SFT+Singlet time Fourier transformation, SVM=Support vector machine, Speech Pro= Speech processing, T= dataset consists of 4352 unique people collected from the web app + 2261 unique people from the Android app+4352 and 5634 samples. Cough=Crowd sourced Respiratory Sound Data, SF=Shape chest Features, 2DFT=Two-dimensional (2D) Fourier transformation, COVID10= [5 COVID19 5 healthy, ES=Electronic stethoscope, BB=120 COVID+100 healthy) B=Breathing, Sp=Smartphone app, TM=Trained Model, Per.=Performance, Rs=Recordings Source, RSS=Respiratory Sound, NS=Number of Subjects, NR=Number of Recordings, PS=Pre-processing Steps, ST=Short-term magnitude spectrogram, CO=COVID-19 1620 healthy, Mel=Mel-frequency cepstral coefficients (MFCC), COD=2660 COVID-19 2660 healthy, CODD=114 COVID-19, 1388 healthy, MFCC=Spectrogram Mel spectrum, power spectrum Tona, Mel-frequency cepstral coefficients (MFCC), ENCNN=Ensemble CNN, Total=(COVID-19 34516 samples+ 2100(1500 (chest X-ray positive+600(Chest X-ray: Negative), proposed method=MFCC+Darknet+ensemble ML+weighted sum rule fusion method, Model=Unet+Handcrafted features.

Weighted sum-rule based fusion method

In the past few years, much attention has been devoted to multimedia indexing by fusing multimodal information. There are two kinds of ensemble fusion techniques, (1) early fusion and (2) late fusion technique [14]. In this work, a late ensemble fusion technique is used based on the weighted sum-rule method for combining the accuracy of chest and cough models for the classification of COVID-19 cases. For analysis of fusion accuracy, we used the 5-fold validation process to compute and validate model accuracy (shown in Table 10 of cough model and chest X-ray image-based modal for early diagnosis based on predicted cases. The sum rule-based fusion has been done in the following manner: The notation Wi stands for the Weight of I the model, and Si stand for mean accuracy of 5-cross-validation of individual modality. Ensemble learning techniques are used to solve cough and CT scan-based early diagnosis. We used a hybrid learning paradigm of produces more reliable prediction results than the unimodal learning techniques by fusing multiple learners and models effectively. In the past few years, much attention has been devoted to multimedia indexing by fusing multimodal information. There are two kinds of ensemble fusion techniques: (1) early fusion and (2) late fusion technique [14]. In this work, a late ensemble fusion method is used based on the weighted sum-rule method for combining the accuracy of chest and cough models for the classification of COVID-19 cases.

Table 10

Classification accuracy (%) of 5-fold cross-validation of Proposed Framework.

Fold	Chest X-ray(%)	Cough(%)
1	0.9890	0.7776
2	0.9778	0.8278
3	0.9833	0.8116
4	1.000	0.7926
5	0.9832	0.8453

Let M=[, , ] defines the number of classification models for the chest-X rays model and cough (audio). The term () illustrates the probability of the output of COVID-19, and non-COVID-19 class by the classifier concerning the input sample . The average probability of class is given as follows (shown in Eq. (14)): For (k) [1, K] of the input data , with (K) number of classes. Where, shows the maximum likelihood probability of each model. The weighted sum-rule-based ensembling approach is used in statistical machine techniques based on the solid hypothesis that all the models have the same weights. Based on overall observation, it can be concluded that we cannot wholly consider each model equally for training models during the model ensembling of chest X-ray and cough (audio) models for early diagnosis and classification of COVID-19 cases. It considers each model equally for training models during the model ensembling of chest X-ray and cough (audio) models for early diagnosis and classification of COVID-19 cases.

Weights estimation for weight sum-rule method

We calculated the 5-fold cross-validation accuracy over different datasets of training and testing sets; therefore, we made an inference that the mean deviation (Md) value gives an idea of the system’s error rate. For analysis of fusion accuracy, we used the 5-fold validation process to compute and validate model accuracy (shown in Table 10 of cough model and chest X-ray image-based modal for early diagnosis based on predicted cases. Here, S is the mean of the accuracy of the chest X-ray model and cough audio model for accurate prediction. For example, the computation of mean accuracy of each model (S1 and S2) are calculated as follows (Shown in Eq. (16)–(19)): Also, the computation of weights (W) for each model is illustrated as follows ((18): The absolute accuracy for accurate prediction for classifying COVID-19 patients based on cough and chest X-ray is achieved using the weighted sum rule method (shown in Table 12). The final fusion-based classification accuracy of the multimodal framework is achieved as follows:

Table 12

Weighted Sum rule fusion method based accuracy for classification of COVID-19 patients.

Modality	Weight	Mean accuracy (S_i)	W _sm score
Chest X ray	0.54	98.67	53.35
Cough audio	0.46	86.53	39.80

Weighted Sum rule fusion method based accuracy for classification of COVID-19 patients.

Conclusion and future directions

The proposed frameworks include two modalities, namely chest X-ray and cough sound models to diagnose patients. We used different deep learning architectures such as the CNN and U-net model, deep learning-based segmentation methods to make different models for extracting features from chest X-ray images, and cough (audio) samples for accurate prediction. To improve the performance of the proposed framework, we fuse extracted features from both models using the weighted sum-rule fusion technique for the accurate classification of patients. For chest X-ray, we achieved an accuracy of 98.67%, and for the cough sound mode (87.53%). Further, the proposed model provided 79.50% accuracy for the classification of patients. Based on overall observations, the accuracy of the proposed framework is 92.03% for the early diagnosis of patients. The future directions for the proposed multimodal are illustrated as follows: We will collect the large databases for chest X-ray images and cough sound sample database to train our proposed model to analyze experimental results. We will be working on developing a smart system based on the android application for the early diagnosis of common patients and users. An android application will be developed for deploying the system for common patients and users. We will include statistical machine learning techniques for computing significant features from several modalities and adding more testing features to improve the overall accuracy. The proposed system will be handy (remotely) in different places, especially where there is a lack of medical staff and other basic facilities.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

15 in total

1. COVID-19 Artificial Intelligence Diagnosis Using Only Cough Recordings.

Authors: Jordi Laguarta; Ferran Hueto; Brian Subirana
Journal: IEEE Open J Eng Med Biol Date: 2020-09-29

2. AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app.

Authors: Ali Imran; Iryna Posokhova; Haneya N Qureshi; Usama Masood; Muhammad Sajid Riaz; Kamran Ali; Charles N John; Md Iftikhar Hussain; Muhammad Nabeel
Journal: Inform Med Unlocked Date: 2020-06-26

3. Deep Learning COVID-19 Features on CXR Using Limited Training Data Sets.

Authors: Yujin Oh; Sangjoon Park; Jong Chul Ye
Journal: IEEE Trans Med Imaging Date: 2020-05-08 Impact factor: 10.048

4. Prediction of muscular paralysis disease based on hybrid feature extraction with machine learning technique for COVID-19 and post-COVID-19 patients.

Authors: Prabu Subramani; Srinivas K; Kavitha Rani B; Sujatha R; Parameshachari B D
Journal: Pers Ubiquitous Comput Date: 2021-03-03 Impact factor: 3.006

5. A novel unsupervised approach based on the hidden features of Deep Denoising Autoencoders for COVID-19 disease detection.

Authors: Michele Scarpiniti; Sima Sarv Ahrabi; Enzo Baccarelli; Lorenzo Piazzo; Alireza Momenzadeh
Journal: Expert Syst Appl Date: 2021-12-16 Impact factor: 6.954