Literature DB >> 34149100

Accurate detection of COVID-19 patients based on distance biased Naïve Bayes (DBNB) classification strategy.

Warda M Shaban1, Asmaa H Rabie2, Ahmed I Saleh2, M A Abo-Elsoud3.   

Abstract

COVID-19, as an infectious disease, has shocked the world and still threatens the lives of billions of people. Early detection of COVID-19 patients is an important issue for treating and controlling the disease from spreading. In this paper, a new strategy for detecting COVID-19 infected patients will be introduced, which is called Distance Biased Naïve Bayes (DBNB). The novelty of DBNB as a proposed classification strategy is concentrated in two contributions. The first is a new feature selection technique called Advanced Particle Swarm Optimization (APSO) which elects the most informative and significant features for diagnosing COVID-19 patients. APSO is a hybrid method based on both filter and wrapper methods to provide accurate and significant features for the next classification phase. The considered features are extracted from Laboratory findings for different cases of people, some of whom are COVID-19 infected while some are not. APSO consists of two sequential feature selection stages, namely; Initial Selection Stage (IS2) and Final Selection Stage (FS2). IS2 uses filter technique to quickly select the most important features for diagnosing COVID-19 patients while removing the redundant and ineffective ones. This behavior minimizes the computational cost in FS2, which is the next stage of APSO. FS2 uses Binary Particle Swarm Optimization (BPSO) as a wrapper method for accurate feature selection. The second contribution of this paper is a new classification model, which combines evidence from statistical and distance based classification models. The proposed classification technique avoids the problems of the traditional NB and consists of two modules; Weighted Naïve Bayes Module (WNBM) and Distance Reinforcement Module (DRM). The proposed DBNB tries to accurately detect infected patients with the minimum time penalty based on the most effective features selected by APSO. DBNB has been compared with recent COVID-19 diagnose strategies. Experimental results have shown that DBNB outperforms recent COVID-19 diagnose strategies as it introduce the maximum accuracy with the minimum time penalty.
© 2021 Elsevier Ltd. All rights reserved.

Entities:  

Keywords:  COVID-19; Classification; Feature selection; NB; Optimization; Particle swarm; Wrapper

Year:  2021        PMID: 34149100      PMCID: PMC8205562          DOI: 10.1016/j.patcog.2021.108110

Source DB:  PubMed          Journal:  Pattern Recognit        ISSN: 0031-3203            Impact factor:   7.740


Introduction

The new coronavirus (also called COVID-19) has resulted in a global epidemic problem due to its quick spread from one individual to another in society [1]. The terrifying spread of COVID-19 is the greatest challenge humanity has faced since the Second World War. World Health Organization (WHO) declared COVID-19 as a global Pandemic in March 2020 [2]. The most common symptoms of COVID-19 are dry cough, sore throat, and fever [2,3]. Symptoms can progress to a severe form of pneumonia with critical complications, including septic shock, and pulmonary edema [1]. Unfortunately, clinical characteristics alone cannot determine the diagnosis of COVID-19, especially for patients at the early-onset of symptoms. According to the recent study, it is shown that once coronavirus begins to spread, it takes no time to make the medical system collapse (e.g., hospital) [4]. Hence, early detection of COVID-19 patients is a vital process to quarantine the infected people. Machine learning is a fancy term used for the concept of software that learns automatically how to solve a problem or execute a task, and which becomes more and more accurate over time [5,6]. All machine-learning techniques work on the same principle [7,8]. They receive some training data as an input, build a mathematical model based on the input data, and then use the mathematical model to solve the problem in hand [9]. Several techniques have been introduced for COVID-19 diagnosis based on machine learning techniques [10]. However, they suffer from several drawbacks such as; (i) low diagnose accuracy, (ii) long prediction time, and (iii) high complexity. Naive Bayes (NB) classifier is a simple but surprisingly powerful machine learning technique [11,12]. Despite its naive design and apparently oversimplified assumptions, NB has worked quite well in many complex real-world situations such as; real-time prediction, spam filtering, weather forecast, and medical diagnosis [13,14,15]. However, in some cases, the performance of NB is sometimes thumping due to the unrealistic assumption that all features are independent and equally important given the class value [11]. To overcome such hurdle, several solutions have been introduced such as; feature selection and weighting [11,12]. The originality of this paper is concentrated in introducing a Distance Biased Naïve Bayes (DBNB) classification strategy for accurate diagnosis of Covid-19 patients. DBNB consists of two phases, namely; Feature Selection Phase (FSP) and (ii) Classification Phase (CP). During the former (e.g., FSP), the input features that are extracted from patient's laboratory findings are collected, and then the effective features are selected from those extracted features by using advanced Particle Swarm Optimization (APSO). APSO is a new proposed method that combines between filter and wrapper approaches, in which it composes of two stages called IS2 using many filter methods and FS2 using Binary Particle Swarm Optimization (BPSO) as a wrapper method. APSO aims to utilize the benefits of both filter and wrapper methods for overcoming their drawbacks. In fact, filter methods can provide fast selection, but it cannot give high performance in which it ignores features dependencies. On the other hand, BPSO as a wrapper method can provide accurate detection because it depends on features dependencies and the interaction with the used classifier, but it cannot provide fast selection. Consequently, APSO can select the most informative subset of features as; (i) it can provide fast selection by using filter methods, (ii) it can provide accurate selection by using wrapper method, and (iii) it takes in the consideration the feature dependencies and the interaction with the classifier. During the second phase (e.g., CP), fast and accurate detection of COVID-19 patients based on the selected features is provided through a new classification model. The proposed classification model involves the benefits of the traditional NB and overcomes its problems. The proposed classification model aims to improve performance and overcome the drawbacks of traditional NB by (i) assigning weights to the elected features, hence, the result is a Weighted Naïve Bayes (WNB) classifier, and (ii) fine tuning the decision of WNB using distance based biasing between the input item to be classified and the center of the target classes in the employed feature space. Consequently, the proposed classification model consists of two modules, namely; (i) Weighted Naïve Bayes Module (WBNM) in which WNB classifier takes the initial decision based on the belonging degree of the input item to be classified to each of the considered classes, and (ii) Distance Reinforcement Module (DRM) in which the final decision is taken. The proposed DBNB classification strategy has been compared against recent COVID-19 diagnose strategies. Experimental results have shown that DBNB outperforms all competitors as it introduced the maximum diagnose accuracy as well as the minimum error. The rest of the paper is organized as follows; Section 2 describes a problem definition about COVID-19. Section 3 discuss DBNB applicability For COVID-19 diagnose. Section 4 discusses Naïve Bayes Haste problem. Section 5 introduces the previous efforts about COVID-19 patients’ classification. Section 6 focuses on the proposed Distance Biased Naïve Bayes (DBNB) classification strategy. Section 7 depicts the experimental results. Finally, conclusions and future works are presented in section 8.

Problem definition

Coronavirus pneumonia is a new species appeared in Wuhan, China, that subsequently termed COVID-19. Once coronavirus appeared, it grown at rapid rate around the whole world [16,17]. Detection and isolation of infected cases is the only solution for the healthcare system protection from becoming overwhelmed, and accordingly, it will flat the epidemic curve as shown in Fig. 1 . Social and physical distancing measures aim to slow the spread of disease by stopping chains of transmission of COVID-19 and preventing new ones from appearing in order to keep hospitals and doctors’ offices from becoming overcrowding with patients.
Fig. 1

A graphic representation of the rapid spike in infections.

A graphic representation of the rapid spike in infections.

DBNB applicability for COVID-19 diagnose

What a pandemic represented by the terrifying spread of the COVID-19 virus. No doubt, it is the greatest challenge the humanity has faced since World War Two. However, COVID-19 is much more than a health crisis, it has the impact to create devastating economic, political, and social crises that will certainly leave deep scars [17]. Generally, COVID-19 diagnosis can be accomplished via three different treatments as illustrated in Fig. 2 , which are; (i) Using Real-Time reverse transcriptase- Polymerase Chain Reaction (RT-PCR), (ii) using chest CT imaging scan, and (iii) using numerical laboratory tests Among nucleic acid tests, polymerase chain reaction (PCR) laboratory test, and more precisely, Real-time reverse transcriptase-PCR (RT-PCR) is currently used as the ‘gold standard’ for confirming COVID-19 positive patients.
Fig. 2

Different COVID-19 diagnosis techniques.

Different COVID-19 diagnosis techniques. RT-PCR tests are fairly quick, sensitive and reliable. A sample is collected from a person's nose or throat, chemicals are used to remove any fats, proteins and other molecules, leaving only RNA behind [3]. Such separated RNA is a mixture of a person's own genetic material and, if present, the coronavirus’ RNA. However, RT-PCR test suffers from the risk of eliciting false-positive and false-negative results, and accordingly, it doesn't pick up all infections [18]. Thus, a negative result of RT-PCR test does not negates the possibility of COVID-19 infection. Due to COVID-19 exponential spread, such undiagnosed cases can cause catastrophic effects. Accordingly, RT-PCR should not be used as the only criterion for detecting COVID-19 patients [19]. Chest CT has become a critical diagnostic tool for COVID-19, which detects hazy, patchy, “ground glass” white spots in the lung, a telltale sign of Covid-19. Several studies observed that the sensitivity of CT in diagnosing COVID-19 is significantly higher than that of RT-PCR [20]. However, current evidence suggests that CT scans and x-rays are NOT specific enough to either diagnose or rule out COVID-19, this is due to the following reasons; (i) CT Scans sometimes fail to detect coronary lung tissue. Like ultrasounds, a CT scan is unable to differentiate coronary tissue from non- Coronary tissue. (ii) CT Scans Lack Detail as it cannot identify the most aggressive tumors, hence it is unable to differentiate between cancerous tissue, cysts (or fibroids), and coronary tissue, (iii) Although CT scan can result in rapid diagnose of COVID-19, rapid results mean rapid false-negatives and rapid false reassurance. This also means the rapid release of people with COVID-19, allowing them to mingle with people without the infection who may be potentially vulnerable. Based on the above discussion, we claim that COVID-19 diagnoses and treatment plans based on a CT scan or RT-PCR are far less effective than those based on better accurate Numerical Laboratory Tests (NLTs). They are not recommended as primary screening tools. On the other hand, the use of NLTs can be considered as the most accurate method for diagnosing COVID-19. Recently, the use of NLTs is the only method that the Centers for Disease Control (CDC) currently endorses [21]. Hence, it makes perfect sense that the use of NLTs will provide more accurate diagnosis with less waiting time. To the best of our knowledge, Distance Biased Naïve Bayes (DBNB), the proposed diagnose strategy proposed in this paper, is the first to use NLTs as the main criteria for detecting COVID-19 patients. It relies on data mining techniques and more precisely on classification for diagnosing COVID-19. Although several classification techniques can be used, DBNB relies on Naïve Bayes (NB), which is a supervised learning classification method based on probability. As a predictive model, we claim that NB is the best applicable classifier that can be used for COVID-19 diagnosis because of the following reasons; (i) NB is simple, flexible, fast, and appropriate to the real world scenarios, (ii) NB requires a small amount of training data to estimate the parameters necessary for building the classification model, hence, it can make accurate predictions even with small amount of training data, (iii) it is suitable for incremental training, which means that NB can train new samples in real time, (iv) NB is less sensitive to missing data, it is also resistive resistance to noisy data which avoids over-fitting the dataset, (v) it depends on a set of pre-computed probabilities, hence, the prediction time is very small, the classification of one instance has order O(1) when the model has been constructed, which makes it suitable for real time applications such as COVID-19 diagnoses [11,12]. DBNB is not only inherits the advantages of traditional NB, but also it has been enhanced by novel feature selection methodology as well as a distance biasing. The combination is done in a logical way, which is supposed to increase the performance over the traditional NB, and will be consistent in nature. As will be seen in the experimental results, the implementation of DBNB reflects this issue and proves the applicability of the proposed DBNB as the first COVID-19 diagnose strategy that completely relies on accurate NLTs rather than CT chest imaging or RT-PCR test.

Naïve Bayes haste problem

No doubt Naïve Bayes (NB) is a popular classifier in machine learning applications. It has been applied to the different domains such as; image and pattern recognition intrusion detection, weather forecasting, bioinformatics, and COVID-19 patients diagnosis. NB allows each feature to contribute towards the classification decision both equally and independently of other features. Although such simplicity promotes to computational efficiency, it sometimes makes NB incompatible with real world conditions. Consider F={f to be a set of feature vectors of a new item I to be classified and C={c be set of target classes. The probability of a new item being in class c using NB is given by (1).Where, P(c is the conditional probability of class c given the feature vector F (also called posterior probability), P(F |c is the conditional probability of class F given the class c (also called likelihood), and P(c is the prior probability of class c. Since features are independent, this yields; Substitute in (1), this yields (2). Since denominator in (2) remains constant for a given input for all target classes, it can be removed as illustrated in (3). However, the performance of NB is sometimes poor due to the unrealistic assumption that all features are independent and equally important given the class value. The performance of NB can be improved by mitigating this assumption. Several enhancements have been proposed to resolve this problem including feature selection and feature weighting. Generally, feature selection can be applied to improve the performance of the traditional Naïve Bayes classifier. Hence, the target class can be identified by (4). However, assigning equal weight to all considered features violates the nature of real-world applications. Accordingly, different weights can be assigned to each feature as a generalization of feature selection as illustrated in (5). As depicted in (5), unlike traditional NB, each feature f has its own weight w which can be any positive number representing the significance of the feature. However, both traditional and Weighted Naïve Bayes (WNB) classifiers depend mainly on probabilities, namely; the conditional probabilities of the input features given the considered target classes as well as the classes’ prior probabilities. From another point of view, promoting the performance of WNB classifier can be achieved by compensating its performance with another heuristic beside conditional and prior probabilities. In this paper, distance based heuristic will be employed to derive a weighted Naïve Bayes classifier aiming to improve its performance.

Related work

In this section, the previous research efforts about COVID-19 patients classification will be reviewed. As depicted in [22], an automated COVID-19 diagnosis method using the implementation of a convolutional neural network (CNN) was introduced as a new classification method. The proposed CNN has been developed using EfficientNet architecture to be able to perform binary and multi-class classification using X-ray images. Experimental results in [22] showed that the average accuracy values for binary and multiclass are 99.62% and 96.70%, respectively. As presented in [23], the Group Method of Data Handling (GMDH) was used as binary classification model. GMDH is a type of artificial neural networks that used to predict the number of confirmed COVID-19 cases in Hubei province. In fact, many different features were used as inputs to GMDH to predict the confirmed number of COVID-19 patients in the next 30 days. These features (factors) such as maximum, minimum, and average daily temperature, the density of city, humidity and wind speed. The results in [23] demonstrated that the proposed model introduced higher performance capacity in predicting the confirmed number of COVID-19 patients. As presented in [24], a new Corona Patients Detection Strategy (CPDS) was introduced to detect COVID-19 patients. CPDS consists of two phase called Data Preprocessing (DP) and Patient Detection Phase (PDP). During DP, two main processes which are; feature extraction and feature selection were performed to extract and then select the most informative feature from CT images. On the other hand, during PDP, fast and accurate detection of COVID-19 patients based on the selected features was provided by the proposed Enhanced KNN (EKNN) classifier. Experimental results in [24] proven that CPDS outperforms recent ones in which it introduces the best detection accuracy with the minimum time penalty. In [25], an automated COVID-19 detection model called DarkCovidNet was introduced as a new detection method based on using chest X-ray images. DarkCovidNet model represented a development of deep learning technique to be able to perform binary and multi-class classification. The experimental results in [25] proven that the proposed model could perform binary tasks better than multi-class tasks in which the accuracy of binary is higher than multi-class. As presented in [26], a machine learning techniques were used to identify pneumonia caused by COVID-19 from other types and also healthy lungs using only chest X-ray (CXR) images on flat and hierarchical classification scenarios. The proposed Classification Schema (CS) consists of feature extraction process, the early fusion techniques, and the data resampling. According to the results in [26], the proposed approach achieved the best nominal rate obtained for COVID-19 identification in an unbalanced environment with more than three classes.

The proposed distance biased Naïve Bayes (DBNB) classification strategy

In this section, the proposed Distance Biased Naïve Bayes (DBNB) classification strategy will be explained in details. The main aim of DBNB is to quickly and accurately detect COVID-19 cases. Automatic medical diagnosis for COVID-19 patients has become very important, especially when rapid decisions are needed for such a serious infectious disease [27,28,29]. Quick detection of COVID-19 cases allow rapid treatment and isolation of patients and according breaks down the spread of infection of the disease. In this paper, an intelligent classification strategy called Distance Biased Naïve Bayes (DBNB) has been introduced in healthcare system to provide more accurate and rapid diagnostic results. As illustrated in Fig. 3 , the proposed DBNB classification strategy composes of two phases, which are; (i) Feature Selection Phase (FSP), and (ii) Classification Phase (CP). The next subsections will be discussed.
Fig. 3

The proposed DBNB classification strategy

The proposed DBNB classification strategy

Feature selection phase (FSP)

The existence of irrelevant features in the input dataset is one of the main causes of overfitting problem especially in the domain of medical diagnosis of COVID-19 patients [30,31,32]. The main issue during FSP in the proposed DBNB classification strategy is to select the most effective features for COVID-19 diagnosis. In fact, it is important to eliminate the least effected features on the output because it can decrease the accuracy of the diagnostic model. Accordingly, feature selection process should be performed before beginning to learn the diagnostic model to improve its performance to be a faster and more cost-effective model [33,34,35,36,37]. Initially, patient features should be extracted from the input dataset, and then feature selection process can be performed on those extracted features to select the most informative features. The extracted features from the input dataset such as white blood cell, lymphocytes, d-dimer, C-reactive protein …,etc. In this section, a simple but effective feature selection methodology called Advanced Particle Swarm Optimization (APSO) method is provided as a new feature selection method. APSO is a hybrid technique that integrates between filter and wrapper methods to quickly and accurately select the main subset of features that includes the most effective features for COVID-19 diagnosis. It mainly composes of two stages, called; (i) Initial Selection Stage (IS2) using many filter methods as fast selection methods and (ii) Final Selection Stage (FS2) using Binary Particle Swarm Optimization (BPSO) as a wrapper method that can accurately select the best subset of features. Although BPSO can accurately select the informative features, it suffers from the computational time and its convergence is very much dependent on the initial population of the particles in the swarm. For this reason, the main objective of IS2 is to determine the initial population of BPSO by using the results of fast selection methods in IS2 as an initial population in FS2 to reduce BPSO's computational time and to give it the ability to select an optimal subset of features. Finally, the best subset of features is used to improve the performance of COVID-19’s classification model. Particle Swarm Optimization (PSO) was initially designed to tackle problems in continuous numbers search space, but there are many optimization problems such as feature selection that occur in binary search spaces [38,39]. Consequently, PSO is modified to be Binary PSO (BPSO) to solve the discrete optimization problems. In fact, BPSO extended the original PSO by using the sigmoid transfer function that transforms the velocity's value from the continuous search space into discrete space. According to this transformation, velocity can indicate the probability of a particle in the position vector to take the value 1. Thus, particle's velocity in BPSO is still updated in the same manner as in the original PSO, but particle's current position, particle's best position, and global best position in the swarm can only have binary values (0 or 1). Although BPSO can accurately select the most significant features for COVID-19 diagnose in the binary space, it is so slow and randomly initializes the population of the particles in the swarm that makes its convergence difficult. Accordingly, APSO is provided as a new selection method to speedily and optimally select the most effective features for COVID-19 diagnosis by utilizing the benefits of BPSO algorithm and tackling its problems. Before starting to implement the BPSO in FS2, the number of the particles and their initial values in the initial population of the swarm are generated from the filter methods in IS2. In other words, the particles have the same number of the filter methods and also have their results as initial values to enable BPSO in FS2 to provide fast and accurate subset of informative features for COVID-19 diagnosis. Fig. 4 illustrates the sequential steps of APSO method using ‘g’ filter methods.
Fig. 4

The sequential steps of APSO method.

The sequential steps of APSO method. Firstly, COVID-19’s dataset after performing feature extraction process on laboratory findings should be passed to IS2 to implement ‘g’ filter methods on it in parallel manner. Then, the results of these filter methods will be passed to FS2 to generate the initial population of BPSO. In Fig. 4, it is noted that the number of particles in the initial population of the swarm equals ‘g’ that is the same number of filter methods in IS2. Additionally, the values of particles are the results of filter methods in IS2. Secondly, BPSO iterations will be performed until a termination condition is satisfied. At the end, the global best position in the swarm provide the best subset of features that should be evaluated by using classifier such as Naïve Bayes (NB) as a standard classifier [34,40]. Generally, BPSO is a biologically-inspired optimization algorithm that was motivated by the social behavior of bird flocking or fish schooling to optimally solve the optimization problem depending on its fitness value [38,39]. Hence, BPSO can provide near-optimal solutions for fitness function of an optimization problem. Initially, BPSO begins with a group of particles (or “birds”) as solutions called a Swarm (S). In BPSO, each particle represents a potential solution (i.e. a subset of informative features) in an m-dimensional search space (e.g. m=12; the number of extracted features from the laboratory findings). Thus, a subset of features is represented in each particle as a binary string in which its length is the same number of features presented in the COVID-19’s dataset. The value of particle bits may be zero or one. While zero in the j position in the particle denotes the elimination of the j feature in the particular subset, one denotes the selection of the j feature. An example for clarification, a single particle is represented in Table 1 , assuming m=12, thus; FS={f.
Table 1

An example of single particle.

f1f2f3f4f5f6f7f8f9f10f11f12
010110101100
An example of single particle. Each particle is represented in m-dimension (m= no. of features) as a vector, (P where P represents the position of i particle; P=(P) and VP is the velocity of i particle; VP=(VP). Additionally, P represents the best previous position of the i particle that possesses the best fitness value; P=P=(P). The global best position among all the particles determined by competition and cooperation among themselves in the swarm is called P=P=(P). Through iterations, particles adjust themselves based on its own flying experience (P) and its companions’ flying experience (P). Finally, the swarm converges to the global optimum solution. Hence, implementing BPSO as a feature selection technique requires many essential steps as shown in Fig. 4. In FS2, ‘g’ particles are represented in S and then the fitness (evaluation) function of BPSO is implemented to measure the fitness degree of each particle P (subset of input features) based on an accuracy index of the classifier. Actually, fitness function represents an accuracy of the employed classifier such as NB classifier to select the most effective features for COVID-19 diagnosis. The fitness value of each particle can be calculated using (6).Where Accuracy(p represents the classification accuracy according to a subset of features in i particle. The algorithm searches for the best particle with the aim of maximizing Fit(p. According to fitness values for the particles in S, P and P in each particle memory will be updated using (7) and (8) [36]. Where P represents the best solution of each i particle and P represents the current position of i particle. Additionally, P represents the personal best position of i particle. Fit(P represents the fitness value of the i particle based on its current position. Fit(P represents the fitness value of the i particle based on its best position. P is the best particle in whole swarm S and Fit(P represents the fitness value of the (i+1) particle based on its best position. P represents the personal best position of (i+1) particle. Furthermore, P and P are used for updating every particle's velocity VP in the next iteration (t+1) using (9) [36]. Where t represents the current iteration and VP represents the velocity of i particle at the next iteration. VP is the velocity of i particle at the current iteration and P represents the personal best position of i particle at the current iteration; P. Additionally, P represents the global best position in the swarm S at the current iteration; P represents the current position of i particle at the current iteration. w is the inertia weight; w∈[0.9-1.2] [36]. w is used to control the impact of the previous history of velocities on the current velocity. c and c are the cognitive and social acceleration constants; c∈[2-4]. Additionally, r and r are uniformly distributed random numbers in the range [0,1]; r∈[0-1]. Consequently, the adjusted velocity of i particle VP depended on three main terms. The first term is w*VP as a current motion term, the second term is c as a cognitive term, and the third term is c as a social term. After calculating the velocity VP for every particle in S, the particle velocity can indicate the probability distribution with the main role to randomly produce the particle position. Hence, the particle position is adjusted by applying the sigmoid function that is used to identify new particle position based on binary values using (10). Where P represents the value of i particle at j th position in the next iteration t+1. In other words P indicates to the value of j feature in i particle; j=1,2,3,…..,m. rand(0,1) is a random value between [0,1]. Additionally, sig(VP is the sigmoid transfer function that indicates the probability of j bit in which it takes 0 or 1 value. sig(VP is calculated by using (11). Where e is the base of the natural logarithm. Based on the new position P of every particle in S, every particle is evaluated using the fitness function in (6). Then, these calculations are continued until the number of generations is finished. At the end, the best particle of the whole swarm P is the output and the algorithm terminates. All features donated by 1 in this particle represent the most effective features for COVID-19 diagnosis. After applying APSO algorithm on the COVID-19’s dataset that contains the features, six different features will be selected as the best subset of features. These selected features are White Blood Cell (WBC), Lymphocyte (LYM), D-Dimer (D-D), C-Reactive Protein (CRP), Procalcitonin (PCT), and Locate Dehydrogenase (LDH). To implement APSO, assume that there are ‘m’ dimensional Feature Space; FS={f Additionally, the input training data of ‘n’ patients can be expressed by D={T and the testing data of ‘q’ patients can be expressed by Q= {E. Each item of T and E is expressed as an ordered set of ‘m’ features; T and E. Hence, each item T and E can be expressed in an ‘m’ dimensional space of features. For COVID-19 diagnosis, it is an important to reduce m-dimensions or eliminate non-informative features in COVID-19’s dataset to avoid overfitting and enhance the performance of the classification model. The sequential steps of APSO method using ‘g’ filter methods is illustrated in Algorithm 1.
Algorithm 1.

Feature selection using APSO algorithm.

Image, table 19
Feature selection using APSO algorithm. To illustrate the idea, assume that there are four filter methods in IS2, which are; Information Gain (IG) [31,32], Chi-square (CHI) [41,42,43], Fisher score (F) [44], and Correlation Based Feature Selection (CBFS) [45]. Additionally, consider that the number of features in COVID-19’s dataset is six (m=6); FS={f. After implementing IG, CHI, F, and CBFS on the dataset, it is assumed that the subset of selected features according to these methods are; {f, and {f respectively. Hence, these four subsets of features are used as four particles (P) in the initial swarm (S) of BPSO in FS2. Then, BPSO is implemented related to many assumptions in Table 2 .
Table 2

The assumptions for employing BPSO in FS2.

No.AssumptionValue
1No. of generations to process2
2Swarm size (no. of particles)4”No of filter methods in IS2”(g)
3Initial Ppersonal(pi)Pi
4Initial PGlobal0
5Initial VPi0
6Particle size “P”6” No. of features” (m)
7Fitness functionAccuracy of NB classifier
8Initial swarmP1={1, 0, 1, 0, 1, 1}@@P2={0, 0, 1, 1, 0, 1}@@P3={1, 1, 1, 1, 0, 1}@@P4={1, 1, 0, 0, 1, 1}
9w1.1
10c1=c22
11r1=r20.6
The assumptions for employing BPSO in FS2. According to these assumptions, it is assumed that BPSO is implemented through two iterations providing new swarm that includes new values at four particles; P, and P. After evaluating P and P, it is considered that P achieves the highest fitness value, thus, P is the best particle in the swarm P that provides the best subset of features. Finally, the most effected features in COVID-19’s dataset are; {f.

Classification phase (CP)

Due to its easiness as well as its good performance, Naive Bayes (NB) is widely employed to address classification problems in various real-world applications. However, NB sometimes suffers from degraded performance due to the unrealistic assumption that all features are independent and equally important. In order to alleviate such defectiveness, two issues will be considered to guarantee the maximum performance and to compensate the drawbacks of that robust classifier, which are; (i) assigning weights to the elected COVID-19 features, hence, the result is a Weighted Naïve Bayes (WNB) classifier, and (ii) fine tuning the decision of WNB using distance based biasing. Such biasing is based on the distance between the input item to be classified and the center of the target classes in the employed feature space [46]. The Classification Phase (CP) is divided into two modules. The first is the Weighted Naïve Bayes Module (WNBM) at which a WNB classifier is employed to take the initial decision of the degree of belonging of the input item to be classified to each of the considered classes. A feature weight vector is generated then employed to derive the decision taken by the WNB classifier. On the other hand, the second module is the Distance Reinforcement Module (DRM) at which the item belonging degree estimated by WNBM is finely tuned to take the final decision. Hence, the new item can be easily classified to one of the considered target classes. The next subsections explain the details of both modules of the classification phase.

Weighted Naïve Bayes module (WNBM)

Despite its simplicity, NB classifier has exhibited surprisingly performance on a variety of data mining and machine learning problems for COVID-19 patients diagnose. However, due to the assumption that all features are independent and equally important, the predictions estimated by NB are sometimes poor. For illustration, for predicting whether a patient has a COVID-19 disease, his White Blood Cell count is supposed to be much more important than his height. The performance of NB can be improved by mitigating this assumption by giving a weight for each elected feature. In this subsection, an initial belonging score is assigned to the input item to be classified given each class label based on a WNB classifier. As the efficiency is essential in COVID-19 disease diagnoses system, each elected feature will be weighted based on an efficiency of a base classifier. The weight of the feature f, denoted as; w is an indication to the feature impact and is defined as the degradation percentage of the model accuracy after discarding f from the input feature set. Several base classifiers can be used to implement the underlying model such as; classical Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Support Vector Machines (SVM). Feature weighting is a critical task that can promote the diagnose accuracy. The weight of a feature can be defined as the positive effect of the feature on the overall system accuracy. It can be modeled as the difference between the accuracy of the model in the presence of the feature and in its absence. The feature weight can be calculated by (12). Where w is the weight (impact) of feature f is the accuracy of the model when the feature f is included in the feature set, and accuracy(-f is the accuracy of the model when f is removed. The normalized weight of each feature is calculated using (13). A feature weight vector is constructed that stores the normalized weight of all the elected features by the feature selection phase. The Belonging Score (BS) of an input item I to a class c can be calculated by (14).Where BS(I is the belonging score for the input item I given the class label c is the prior probability of the class c is the normalized weight of the jth feature, P(f is the conditional probability of the feature f given the class c.

Distance reinforcement module (DRM)

To take the final decision, the input item should be classified to one of the target classes. To accomplish such aim, initially, all items (of different target classes) are projected into the considered n dimensional feature space. The center of each class containing t examples in n dimensional feature space for can be accomplished using (15).Where C is the class center in the considered n dimensional feature space, t is the number of examples within the class, and is the value of the i dimension of the q example. The input item to be classified (e.g., I) is also projected into the n dimensional feature space. Then, the Affiliation Degree (AD) of the input item given each target class is determined using (16).Where AD(I is the affiliation degree of the input item I given the class c is the belonging score for the input item I given the class label c is the Euclidian distance between the input item I and the center of class c in the feature space. Calculating the distance between two points p and p in the n dimensional feature space can be calculated using (17):Where and is the value of the i dimension of the points p and p respectively in the n dimensional feature space as illustrated in Fig. 5 considering 3 target classes. Finally, the target class of the input item I , denoted as Target(I, can be identified using (18). Where P(c is the prior probability of the class c is the normalized weight of the jth feature, P(f is the conditional probability of the feature f given the class c.
Fig. 5

Calculating the distance to class centers.

Calculating the distance to class centers.

Illustrative example

In this subsection, an illustrative example showing how the diagnose decision can be taken in the classification phase of the proposed Distance Biased Naïve Bayes (DBNB) classification strategy. As illustrated in Table 3 , consider a COVID-19 diagnose database for 10 persons considering 5 features labeled f and two target classes, namely; “True” and “False” diagnose. The symbols L, M, and H represents Low, Medium, and High respectively, while T and F represents True or False diagnose of COVID-19 virus. The weight of each feature is also reported in the last row of Table 3. On the other hand, the conditional probability for each feature value given different classes as well as the prior probability for each class are illustrated in (Tables 4–9 ).
Table 3

An example of single particle.

Case numberf1f2f3f4f5Diagnose
1LLHMLT
2HHHMLF
3MLLLHT
4LHLLHT
5MLLLHT
6HHHMLF
7MHLHHF
8LLHLLF
9MHLHLT
10HLLLHT
Normalized Weight10.70.80.50.6
Table 4

Conditional probabilities for feature f1.

ValuesClasses
P(f1|T)P(f1|F)
TF
L212/61/4
M313/61/4
H121/62/4
Total64100%100%
Table 9

Prior probabilities of the target classes.

DiagnoseCountPrior probability
T6P(T)=6/10
F4P(F)=4/10
Total101
An example of single particle. Conditional probabilities for feature f1. Conditional probabilities for feature f2. Conditional probabilities for feature f3. Conditional probabilities for feature f4. Conditional probabilities for feature f5. Prior probabilities of the target classes. Assume an input case I that represents a probable COVID-19 patient. Such input case can be expressed as a point in the n-dimensional feature space using its numerical values of those considered features. The distance from the input case to the centers of “True Diagnosed” and “False Diagnosed” classes are assumed to be 6.3 and 12.7 respectively. The features numerical values of I can be expressed by the following feature vector F which represents the corresponding linguistic values for features f, and f respectively. It is needed to detect the corresponding target class of I. Initially, the belonging score to each class (e.g., “T” or “F”) is determined as following; Then, the Affiliation Degree (AD) of the input item given “True” and “False” diagnose classes is determined, which are found to be 0.006 and 0.00065 as shown below. Since, AD(I then the input case for a COVID-19 infected patient.

Experimental results

In this section, the proposed Distance Biased Naïve Bayes (DBNB) classification strategy will be evaluated. DBNB is implemented through two sequential phases, which are; Feature Selection Phase (FSP), and Classification Phase (CP). In FSP, APSO was proposed as a new feature selection method to select the most significant features extracted from patients laboratory findings. Then, those elected features are weighted during feature weighting module using classical NB classifier as a base classifier to assign a weight to each identified feature based on its effect on the classification accuracy to determine the importance of each feature. Those weighted features used to enable WNB classifier to take the initial decision. Then, depending on the distance between the testing item and the class center in DRM, the final decision is taken. During the following experiments, four evaluation criteria called accuracy, precision, sensitivity, and the inference time will be used to evaluate each part of the proposed strategy [36]. Our implementation is based on a set of laboratory findings for COVID-19 patients collected from Mansoura University Hospital in Egypt. This collected data (patients dataset) has been employed to allow reproduction of the results introduced in this paper. Due to the small number of available dataset, cross-validation is used to validate the classification model. In this paper, 10-fold cross-validation is used to divide the dataset into 10 equal partitions in which it uses one of these sets as a testing set and the remaining nine as training sets. Hence, the number of training and testing patients are 2700 (90%) and 300 (10%) respectively. The applied parameters with the corresponding implemented values are depicted in Table 10 .
Table 10

The applied parameters with the corresponding used values.

ParameterDescriptionApplied value
mNo. of extracted features12
wInertia weight1.1
c1The cognitive acceleration2
c2The social acceleration
r1Uniformly distributed random number0.6
r2
The applied parameters with the corresponding used values. To implement the proposed DBNB in the training phase, (i) the number of patients in the dataset is 3000 patients including both COVID-19 and non COVID-19 patients, and (ii) the number of used features in both training and testing dataset is twelve features routine blood exams. These features are; {White Blood Cell (WBC), Aspartate aminotransferase (AST), Alanine Aminotransferase (ALT), Locate Dehydrogenase (LDH), C-reactive protein (CRP), Procalcitonin (PCT), Creatinine (Cr), Fibrinogen (FIB), Neutrophils (NEU), Lymphocytes (LYM), Interleukin-6 (IL-6), D-Dimer (D-D). To select the most effective features, APSO is used to select the most significant features extracted from patients routine blood exams. Firstly, the patients laboratory findings pass in parallel manner through a set of filter methods which are; Information Gain (IG) [31,32], Chi-square (CHI) [41,42,43], Fisher score (F) [44], and Correlation Based Feature Selection (CBFS) [45]. After implementing IG, CHI, F, and CBFS on the patients laboratory findings, the subset of selected features according to these methods are; {WBC, LDH, CRP, Albumin, NEU, IL-6}, { WBC, AST, ALT, PCT}, { WBC, LDH, LYM, NEU, D-D}, and {WBC, LDH, LYM, NEU, D- D} respectively. Hence, these four subsets of features are used as the initial swarm (S) of BPSO in FS2 to select the best subset of features. These features are six features, which are; {WBC,LDH, LYM,CRP, NEU, D- D}.

Dataset description

Dataset represents medical records of data collected on patients from Mansoura University Hospital in Egypt [47]. These records contain results of laboratory findings from various cases who have different ages, sex (male or female), and diseases. The number of the collected dataset equal to 3000 cases. In fact, the cases in the collected dataset are categorized into COVID-19 patients, and non COVID-19 patients as presented in Table 11 . COVID-19 patients are people who suffer from COVID-19 disease. On the other hand, non COVID-19 patients are people who do not suffer from COVID-19 disease, but they may suffer from other diseases. Thus, the cases of non COVID-19 patients have been categorized into normal people who do not suffer from any disease, patients who suffer from other lung diseases, and patients who have chronic diseases such as diabetes and pressure diseases. The distribution of the used cases in the collected dataset has been represented according to “Age”, “Sex”, and “type of disease”, as shown in (Figs. 6–8 ).
Table 11

Dataset description.

CriteriaValue / Description
Total number of casesmalefemale
19691031
Not sick (ordinary) cases410
Sick casesCOVID-19Other
1990600
COVID-19 patients<1515-2525-3535-4545-5555-65>65
2098170287395420600
Fig. 6

The total number of cases according to age.

Fig. 8

The presentation of COVID-19 patient and non COVID-19 patient distribution.

Dataset description. The total number of cases according to age. The total number of COVID-19 cases according to age and sex. The presentation of COVID-19 patient and non COVID-19 patient distribution.

Statistical analysis

COVID-19 Patient characteristics were compared by using t tests for continuous variables and chi-squared or Fisher exact tests for categorical variables. Descriptive statistics are expressed as (SD). P ≤ .05 was considered statistically significant. All statistical analyses were performed using SPSS, version 23.0, software (SPSS, Chicago, IL). Table 12 shows the statistical analysis of COVID-19 patients laboratory findings.
Table 12

Clinical laboratory data for 1990 COVID-19 patients.

FeaturesNormal RangeSevere group n=696Mild group N=1294
WBC, x109 per L3.5-9.54.96 ± 1.854.26 ± 1.64
AST, U/L15-4033.21 ± 18.2427.80 ± 11.42
ALT, U/ L9-5024.50 (15.75, 37.75)27.00 (21.00, 41.00)
LDH, U/L120-250360–540183–360
CRP, mg/L0-1018.76 ± 22.2039.37 ± 27.68
PCT, ng/ml˂0.10.02 (0.01, 0.04)0.04 (0.02,0.09)
FIB, g/L2-43.11 ± 0.833.84 ± 1.00
Cr, µmol/L74.3-10766.96 ± 13.3865.33 ± 15.55
NEU, x109 per L1.8-6.33.43 ± 1.632.65 ± 1.49
LYM, x109 per L1.1-3.21.07 ± 0.401.20 ± 0.42
IL-6, pg/L≤ 2010.60 (5.13, 24.18)36.10 (23.00, 59.20)
D-D, μg/ L0-0.550.21 (0.19, 0.27)0.49 (0.29, 0.91)
Clinical laboratory data for 1990 COVID-19 patients.

Testing the proposed advanced particle swarm optimization (APSO)

In this section, the proposed Advanced Particle Swarm Optimization (APSO) will be evaluated based on NB classifier as a standard classifier. Results are shown in Table 13 . Also, to prove the effectiveness of the proposed method, many features selection techniques are compared to the proposed features selection technique APSO. The most recent feature selection techniques used for evaluation are Hybrid Fuzzy ARTMAP and Brain Storm Optimization (FAM-BSO) [48], Opposition-based Crow Search (OCS) algorithm [49], Filter-Wrapper Feature Subset Selection (FWFSS) [50], and parallelized Hybrid Feature Selection (HFS) [51]. Results are depicted in Table 14 .
Table 13

Performance of APSO in terms of accuracy, precision, and recall.

FoldAccuracyPrecisionRecall
193.4%91.68%92.28%
294.5%95.5%95.5%
392.63%94.2%94%
494.72%95.01%94%
595.52%95.8%94%
695.36%94.54%95.76%
794.52%94.98%95.61%
894.52%95.98%96.79%
993%91.5%95.29%
1094.54%94.6%91.5%
Average94.271%94.379%94.473%
Table 14

Comparison between APSO and the existing feature selection techniques in terms of accuracy, precision, recall, and inference time.

Used TechniqueAccuracyPrecisionRecallInference time (Sec)
FAM-BSO92.8%92%92.2%14
OCS89%89.234%89.98%11
FWFSS91.12%90.58%90.8%12.5
HFS93.3%92.89%92.5%12
APSO94.271%94.379%94.473%9
Performance of APSO in terms of accuracy, precision, and recall. Comparison between APSO and the existing feature selection techniques in terms of accuracy, precision, recall, and inference time. As shown in Table 13, the results are presented for each fold, and the average values are also calculated. According to Table 13, the average accuracy, precision, and recall for APSO are 94.271%, 94.379%, and 94.473% respectively. APSO can provide fast and efficient feature selection method as shown in Table 14. Consequently, APSO is much better than FAM-BSO, OCS, FWFSS, and HFS.

Testing the proposed weighted Naïve Bayes module (WNBM)

During this subsection, the proposed Weighted Naïve Bayes Module (WNBM) will be evaluated. WNBM is compared against the most recently used classification methods which are; (i) Enhanced K-Nearest Neighbors (EKNN) [52], (ii) Naïve Bayes-Probabilistic Kernel Classifier (NB-PKC) [53], and (iii) Whale Optimization Algorithm-SVM (WOA-SVM) [54]. Results are shown in Tables 15 and 16 .
Table 15

Performance of WNBM in terms of accuracy, precision, and recall.

FoldAccuracyPrecisionRecall
196.5%95.2%96.6%
295.98%94.9%93.98%
397.5%96.8%96.99%
497.5%96.8%96.99%
595.36%94.6%93.6%
696.365%95.87%94.12%
797%96.8%94.2%
895.98%94.6%93.9%
996.78%95%94.9%
1096.89%95.5%94.99%
Average96.585%95.607%95.027%
Table 16

Comparison between WNBM and the existing classification techniques in terms of accuracy, precision, recall, and inference time.

Used techniqueAccuracyPrecisionRecallInference time (Sec)
EKNN93.5%90.9%92.3%18
NB-PKC94.02%91.68%90.78%13
WOA-SVM92.6%90.89%91%14
WNBM96.585%95.607%95.027%11
Performance of WNBM in terms of accuracy, precision, and recall. Comparison between WNBM and the existing classification techniques in terms of accuracy, precision, recall, and inference time. Table 15, presents the accuracy, precision, and recall results for WNBM. As shown in Table 15, the lower performance values of WNBM are reported for 2, 5 and 8-fold, while the best values are reported for 1, 3, 4, 6, 7, 9, and 10 fold. The average accuracy, precision, and recall for WNBM are 96.585%, 95.607%, and 95.027% respectively. As illustrated in Table 16, it is concluded that WNBM is much better and faster than EKNN, NB-PKC, and WOA-SVM.

Testing the proposed distance biased Naïve Bayes (DBNB) classification strategy

Through this subsection, it is the time to test the proposed DBNB. All capabilities proposed are used in our DBNB, hence, APSO is employed for feature selection, and the proposed classification model that contains two modules called WNBM and DRM is used for classification. Results are shown in Table 17 . Also, to argue the effectiveness of our proposed strategy for diagnosing COVID-19 patients, it is compared against some of the recently used COVID-19 classification methods which are; CNN [22], GMDH [23], CPDS [24], DarkCovidNet [25], and CS [26]. Results are shown in Table 18 .
Table 17

Performance of DBNB in terms of accuracy, precision, and recall.

FoldAccuracyPrecisionRecall
197.78%96%96.5%
296.86%95.9%95%
396.86%95.5%95.5%
497.78%96%96.5%
597.78%96%96.5%
697.78%96.6%96.85%
796.86%95.86%95%
897.78%96.5%96.5%
997.78%96%96.85%
1097.78%97.1%97.2%
Average97.504%96.146%96.24%
Table 18

Comparison between DBNB and the existing classification techniques in terms of accuracy, precision, and recall.

Used techniqueAccuracyPrecisionRecall
CNN84.2%85.3%82.12%
GMDH92.4%93%91%
CPDS94.6%90.06%91.63%
DarkCovidNet85%87.2%85.21%
CS90.2%89.19%89.4%
DBNB97.504%96.146%96.24%
Performance of DBNB in terms of accuracy, precision, and recall. Comparison between DBNB and the existing classification techniques in terms of accuracy, precision, and recall. Table 17 presents the accuracy, precision, and recall, for DBNB to detect COVID-19 patient. The lower performance values of the DBNB model are presented for 2, 3, and 7-fold, while the best values are presented for 1, 4, 5, 6, 8, 9,and 10-fold. The average accuracy, precision, and recall are 97.504 %, 96.146%, and 96.24% respectively. According to Table 18, it is concluded that the performance of DBNB is much better than CNN, GMDH, CPDS, DarkCovidNet, and CS. The reason is that DBNB gives fast and accurate detection for the infected COVID-19 patients based on using the proposed modules in CP which are; WNBM and DRM depending on the most effective and significant features for COVID-19 diagnosis which are selected through FSP. Finally, DBNB is much better than other recent methods according to many metrics of measurement as it has the ability to quickly and accurately diagnose COVID-19 patients. Also, DBNB is also more simple, flexible and able to detect any disease. DBNB has proven to be a safe decision-making system for detecting COVID-19 patients. Consequently, it protects the healthcare system from exhaustion.

Conclusions and future work

COVID-19 infection was grown at rapid rate and still threatens the lives of billions of people. Therefore, early detection of COVID-19 patients is a vital for disease cure and control. The literature review work shows that no optimal technique can be determined yet. In this work, we have presented an accurate and intelligent classification strategy which can potentially provide smart medical diagnosis. In our classification strategy, DBNB is built upon two essential parts, which are; features selection, and new classification model. The proposed feature selection methodology is called APSO which combines between the benefits of both filter and wrapper selection methods. APSO elects the most informative and effective features from the extracted features from patients laboratory findings. Then, the elected features are weighted to feed the proposed classification model that contains two modules called WNBM and DRM to make accurate and correct decision. Experimental results showed that the proposed DBNB provides fast and accurate results comparing to other recent methods in terms of accuracy, precision, sensitivity, and inference time. In spite of its effectiveness in diagnosing COVID-19 patients, the proposed DBNB relies only on numerical data. However, nominal data such as the data extracted by analyzing CT images may be valuable for confirming the infection. With regard to future research, more work can be done to combine the proposed diagnose strategy (e.g., DBNB) with other diagnose strategies that depend on nominal data. This may promote the diagnose accuracy. Moreover, test-cost should be performed on the proposed APSO to achieve the maximum accuracy as well as the minimum cost.

Declaration of Competing Interest

The authors declare that they have no conflict of interest. ‘‘This paper does not contain any studies with human participants or animals performed by any of the authors.”
Table 5

Conditional probabilities for feature f2.

ValuesClasses
P(f2|T)P(f2|F)
TF
L414/61/4
H232/63/4
Total64100%100%
Table 6

Conditional probabilities for feature f3.

ValuesClasses
P(f3|T)P(f3|F)
TF
L515/61/4
H131/63/4
Total64100%100%
Table 7

Conditional probabilities for feature f4.

ValuesClasses
P(f4|T)P(f4|F)
TF
L414/61/4
M121/62/4
H111/61/4
Total64100%100%
Table 8

Conditional probabilities for feature f5.

ValuesClasses
P(f5|T)P(f5|F)
TF
L232/63/4
H414/61/4
Total64100%100%
  4 in total

1.  ISW-LM: An intensive symptom weight learning mechanism for early COVID-19 diagnosis.

Authors:  Lingling Fang; Xiyue Liang
Journal:  Comput Biol Med       Date:  2022-05-17       Impact factor: 6.698

2.  Cov-Net: A computer-aided diagnosis method for recognizing COVID-19 from chest X-ray images via machine vision.

Authors:  Han Li; Nianyin Zeng; Peishu Wu; Kathy Clawson
Journal:  Expert Syst Appl       Date:  2022-07-05       Impact factor: 8.665

3.  Binary Particle Swarm Optimization Intelligent Feature Optimization Algorithm-Based Magnetic Resonance Image in the Diagnosis of Adrenal Tumor.

Authors:  Jian Xu; Fei Tian; Lei Wang; Zhongchang Miao
Journal:  Contrast Media Mol Imaging       Date:  2022-02-28       Impact factor: 3.161

4.  Expecting individuals' body reaction to Covid-19 based on statistical Naïve Bayes technique.

Authors:  Asmaa H Rabie; Nehal A Mansour; Ahmed I Saleh; Ali E Takieldeen
Journal:  Pattern Recognit       Date:  2022-04-06       Impact factor: 8.518

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.