Dimple Tiwari1, Bhoopesh Singh Bhati1, Fadi Al-Turjman2, Bharti Nagpal1. 1. Ambedkar Institute of Advanced Communication Technologies and Research, Govt of NCT of Delhi Delhi India. 2. Artificial Intelligence Engineering Department, Research Center for AI and IoT Near East University Nicosia Turkey.
Abstract
Pandemic novel Coronavirus (Covid-19) is an infectious disease that primarily spreads by droplets of nose discharge when sneezing and saliva from the mouth when coughing, that had first been reported in Wuhan, China in December 2019. Covid-19 became a global pandemic, which led to a harmful impact on the world. Many predictive models of Covid-19 are being proposed by academic researchers around the world to take the foremost decisions and enforce the appropriate control measures. Due to the lack of accurate Covid-19 records and uncertainty, the standard techniques are being failed to correctly predict the epidemic global effects. To address this issue, we present an Artificial Intelligence (AI)-based meta-analysis to predict the trend of epidemic Covid-19 over the world. The powerful machine learning algorithms namely Naïve Bayes, Support Vector Machine (SVM) and Linear Regression were applied on real time-series dataset, which holds the global record of confirmed, recovered, deaths and active cases of Covid-19 outbreak. Statistical analysis has also been conducted to present various facts regarding Covid-19 observed symptoms, a list of Top-20 Coronavirus affected countries and a number of coactive cases over the world. Among the three machine learning techniques investigated, Naïve Bayes produced promising results to predict Covid-19 future trends with less Mean Absolute Error (MAE) and Mean Squared Error (MSE). The less value of MAE and MSE strongly represent the effectiveness of the Naïve Bayes regression technique. Although, the global footprint of this pandemic is still uncertain. This study demonstrates the various trends and future growth of the global pandemic for a proactive response from the citizens and governments of countries. This paper sets the initial benchmark to demonstrate the capability of machine learning for outbreak prediction.
Pandemic novel Coronavirus (Covid-19) is an infectious disease that primarily spreads by droplets of nose discharge when sneezing and saliva from the mouth when coughing, that had first been reported in Wuhan, China in December 2019. Covid-19 became a global pandemic, which led to a harmful impact on the world. Many predictive models of Covid-19 are being proposed by academic researchers around the world to take the foremost decisions and enforce the appropriate control measures. Due to the lack of accurate Covid-19 records and uncertainty, the standard techniques are being failed to correctly predict the epidemic global effects. To address this issue, we present an Artificial Intelligence (AI)-based meta-analysis to predict the trend of epidemic Covid-19 over the world. The powerful machine learning algorithms namely Naïve Bayes, Support Vector Machine (SVM) and Linear Regression were applied on real time-series dataset, which holds the global record of confirmed, recovered, deaths and active cases of Covid-19 outbreak. Statistical analysis has also been conducted to present various facts regarding Covid-19 observed symptoms, a list of Top-20 Coronavirus affected countries and a number of coactive cases over the world. Among the three machine learning techniques investigated, Naïve Bayes produced promising results to predict Covid-19 future trends with less Mean Absolute Error (MAE) and Mean Squared Error (MSE). The less value of MAE and MSE strongly represent the effectiveness of the Naïve Bayes regression technique. Although, the global footprint of this pandemic is still uncertain. This study demonstrates the various trends and future growth of the global pandemic for a proactive response from the citizens and governments of countries. This paper sets the initial benchmark to demonstrate the capability of machine learning for outbreak prediction.
The Covid‐19 has originated from Severe Acute Respiratory Syndrome Coronavirus 2 (SARS‐CoV‐2) and became a global public health concern in 2020. Person‐to‐person transmission of SARS‐CoV2 led to the patient's isolation. Most of the people infected by the Covid‐19 disease complained about mild to moderate respiratory illness. This leads to crucial illness, with multiorgan failure (MOF) and acute respiratory distress (ARDS). Specific treatment and vaccines are not yet available for this disease, which makes it a dreadful issue in the world. The current pandemic Covid‐19 initiates a grave threat to global health and has quickly spread from its origin Wuhan city of Hubei Province of China to all over the world (Wang, Horby, et al., 2020). Till 25 May 2020, around 5,520,684 confirmed cases, 2,313,167 recovered cases and 347,013 death cases have been reported around the globe (Coronavirus Outbreak, 2020). On 31 December 2019, China informed the outburst to the World Health Organisation (WHO), and from 01 January 2020, the Human Seafood market had padlocked. On 7 January 2020, the virus was recognized as Coronavirus, which had >95% similarity with bat Coronavirus and >70% similarity with the SARS‐CoV. Environmental samples that had been taken from the Human Seafood market were also tested positive and marked that this virus genesis from there (World Health Organization Situation reports, 2020). To control the outbreak, the lockdown was announced in all the cities of China from 23 January 2020. A Strictly limited travel and less social gathering had introduced, an extension of national holidays was announced, public places were closed, and nation‐wise rigorous temperature measuring was started.However, it is unpredictable to what extent these controls are helpful, but as this disease has spread globally, most of the countries have applied these measures as the only solution. Coronavirus has spread continuously and rapidly across the world, with more than 5.3 million confirmed cases within about 188 countries till May 2020. At the same time, most affected countries along with the US are in Europe, the Middle East and North Africa. They are: US, Italy, France, Spain, Mexico, Russia, Germany, UK, Japan, Algeria, Egypt, Israel, Iran, Iraq, Bahrain, Turkey, Romania, Greece, Belgium, Norway and Sweden. The affected countries in the Asia region are: China, India, South Korea, Japan, Thailand, Hong Kong, Vietnam, the Philippines, Malaysia, Singapore, Indonesia and Hubei.Covid‐19 viruses, namely NL63, HKU1, OC43 and 229E, have been transmitted among humans and usually generate mild respiratory problems (Singhal, 2020). The Covid‐19 is less harmful to young and healthy people but causes severe symptoms in old and sick people, it escalates pneumonia, multiorgan dysfunction and acute respiratory distress syndrome. Some laboratories discovered that Covid‐19 patient has normal/low white cells count and high C‐reactive protein (CRP) in their body. Whereas, computerized tomographic scan of the chest generally looks abnormal even in those who have a mild disease or no symptoms at all. Home isolation of suspected cases is a significant solution to prevent the disease. The virus spreading rate is fast but has a lower fatality rate. Starting symptoms of this disease include cough, fever, headache, sore throat, fatigue, breathlessness and myalgia. Systemic and respiratory disorders can cause Coronavirus in a patient's body. This virus takes approximately 5.2 days for incubation. It has been discovered that previous beta‐Coronavirus and Covid‐19 have many similarities. However, some other symptoms have also been discovered in a Covid‐19 patient such as sore throat, sneezing and Rhinorrhoea. An intestinal symptom like Diarrhoea is also a symptom found in a Covid‐19 affected patient. Figure 1 depicts the list of some common systemic and respiratory disorders in the body of a Covid‐19 patient (Rothan & Byrareddy, 2020).
FIGURE 1
Covid‐19 symptoms (systemic disorders VS respiratory disorders)
Covid‐19 symptoms (systemic disorders VS respiratory disorders)Covid‐19 disease symptoms start to appear after its incubation period, which is approximately 5 days (Bai et al., 2020). The cycle of Covid‐19 is around 6 to 41 days, that is starting from initial symptoms of the disease to the death of the patient, with a median of 14 days. The status of the patient's age and their immune system is very much essential criteria for this period (Wang, Tang, & Wei, 2020). The count of the platelets in the blood is a biomarker that is directly associated with disease severity and mortality risk in the Intensive Care Unit (ICU) (Khurana & Deoke, 2017). Moreover, low platelet counts correspond with the higher severity score of disease like Multiple Organ Dysfunction Score (MODS), Acute Physiology and Chronic Health Evaluation (APACHE) and Simplified Acute Physiology Score (SAPS) (Vanderschueren et al., 2000).Information Technology and Artificial Intelligent are playing an essential role in the prediction and analysis of Covid‐19 trends. Various powerful machine learning algorithms have become a handy tool for acquiring the great result of Covid‐19 predictions. Mardani et al. (2020) extends fuzzy approach of Hesitant Fuzzy Set (HFS) approach using Weighted Aggregated Sum Product Assessment (WASPAS) and Stepwise Weight Assessment Ratio Analysis (SWARA) method to rank the issues and challenges of Digital Technologies intervention to control Covid‐19 pandemic. Data Mining techniques, applied on medical science topics, have gain popularity due to their incredible performance for predicting the outcomes and help to take a real‐time decision (Asri et al., 2016). By various algorithms and statistical techniques of machine learning, here we have been trying to find out the hidden trends, unknown facts and their relationship from the real‐time time‐series dataset of the Covid‐19 epidemic. Data Mining applications are helpful for making better health policies and hospital error prevention (Patel et al., 2015). We have selected three algorithms Naïve Bayes, Support Vector Machine and Regression for predicting the future trends of spreading Coronavirus in the world as taken base on the current records of this disease. The WHO has maintained a large number of real‐time confirmed case records of Covid‐19 cases to discover the unknown facts. Machine learning techniques can be helpful for health care professionals to take further decisions for the prevention and control of this pandemic. This paper suggests an intelligent prediction system for the Covid‐19 pandemic that incorporates the benefits of (1) real‐time Covid‐19 pandemic time‐series data, (2) facts visualization related to a pandemic for the world and (3) automatic future prediction for Covid‐19. The major contribution of this work is as follows:A meta‐analysis to predict and analyse the trend of epidemic Covid‐19 over the world with a graphical representation of Covid‐19 symptoms, active cases and a list of the top‐20 coronavirus affected countries.A deep literature survey regarding prediction, screening, contact tracing, forecasting, medication and treatment of Covid‐19 using AI techniques.The AI‐based prediction and forecasting for analysing the trends and growth of novel Covid‐19 outbreak.A comparative analysis of Naïve Bayes (NB), Linear Regression (LR) and Support Vector Machine (SVM) techniques on the real‐time epidemiological dataset.Further section of this study is organized as follows: Section 2 presents the literature of Covid‐19 and machine learning‐based predictions, Section 3 visualized the facts related to Covid‐19 from a time‐series dataset, Section 4 presents the machine learning algorithms that are applied for analysing the fact and trend of Covid‐19 pandemic, Section 5 presents the methodology, experiments and results of this analytical study, finally, Section 6 presents the conclusion of this meta‐analysis.
LITERATURE REVIEW
The outbreak of pandemic Covid‐19 generates a need for research in this area. Therefore, various researchers present their views and ideas for this pandemic. Although it is the latest spread that started at the end of the year 2019, it has spread in the various provinces of the countries and a bunch of papers have proposed theories and research related to this outbreak in the world. This section presents the researches related to Covid‐19 and machine learning clinical predictions.
Pandemic novel Coronavirus (Covid‐19) effects
Clinical mortality prediction and analysis of Covid‐19 has been made on 150 dead Chinese patient's records (Ruan et al., 2020). Rothan and Byrareddy (2020) highlights on the transmission, symptoms, epidemiology, pathogenesis and future direction to control this epidemic, and has concluded that reducing person‐to‐person transmission is only solution to control the current outbreak. Kucharski et al. (2020) presents a mathematical model for the early control and transmission of Coronavirus. A combined mathematical model with four datasets of SARS‐CoV2 from within and outside Wuhan assesses the potential of human‐to‐human transmission of this disease. Yang, Zheng, et al. (2020) presented a meta‐analysis of the prevalence of comorbidities and their effects on Covid‐19 infected patients and discovered that the most prevalent symptoms of this pandemic are fever, cough and fatigue. Whereas, most prevalence comorbidities of this disease are hypertension and diabetes. Lippi et al. (2020) investigates the platelet count in blood samples of normal Covid‐19 patients is different from severe disease infected patients. Srivastava et al. (2020) predicted the effects of the Covid‐19 parameter estimation method. The effects of lockdown, speed of Coronavirus spread, reproduction number and contact ratio were also analyzed. Rahman et al. (2020) proposed a clustering‐based framework to analyse the economic impact of the Covid‐19 outbreak. Malaysian context was used as a case study to validate the experiments of the proposed algorithm. Karmore et al. (2020) focused on developing a cost‐effective Medical Diagnosis Humanoid (HDM) for testing the symptoms of Coronavirus in the human body.Additionally, the relation of thrombocytopenia with severe Covid‐19 has also been evaluated, and results showed that low platelet counts correspond to the severity of Covid‐19 infected patients. Systematic review and meta‐analysis have performed using three datasets to assess imaging features, laboratory, clinical and confirmed Covid‐19 cases (Rodriguez‐Morales et al., 2020). Fang et al. (2020) discovered that diabetes and hypertension patients are prone to get infected by Coronavirus and suggested that cardiac patients, hypertension patients, diabetic patients and people who are treated with ACE2‐increasing drug are at more risk of Covid‐19. Alimadadi et al. (2020) has suggested that Machine‐Learning and Artificial‐Intelligence are powerful techniques to fight with Covid‐19 epidemic that can be helpful in prevention, therapeutics, diagnosis and in‐hospital operations. Wynants et al. (2020) presents critical appraisal and systematic review of prediction models to find the infection of Coronavirus. It has been concluded that prediction models achieved a better place in the literature for supporting medical decisions. Al‐Turjman and Deebak (2020) presented a Privacy‐Aware Energy‐Efficient Framework (P‐AEEF) protocol for securing the information of Covid‐19 patient. The proposed protocol improved energy efficiency and security features against malicious access. Yang, Zeng, et al. (2020) predicts Covid‐19 epidemic trends by integrating data before and after 23 January 2020 with Susceptible‐Exposed‐Infectious‐Removed (SEIR) to generate the epidemic curve. It has concluded that the epidemic in China was at a peak in late February, which shows gradual declines by April end. Peng et al. (2020) presented dynamic modelling to analyse the epidemic Covid‐19 in China.
Covid‐19 time‐series forecasting
The researchers have presented a time‐series forecasting regarding the Coronavirus trends prediction for the different countries. Chimmula et al. (2020) has presented the Covid‐19 time series prediction for Canada. Various features had evaluated to predict the trends of pandemic and approximate stopping time has also provided for the outbreak of Canada and near about the world by their research forecasting. Long Short‐Time Memory (LSTM) model has been used to forecast the future Coronavirus cases along with transmission rates of the Canada, UK, and Italy. Melin et al. (2020) has been used various ensemble, Neural Network models with fuzzy response aggregation to forecast Covid‐19 time‐series trends of Mexico. Fuzzy logic aggregates the prediction of various ensembles and handles the uncertainty of their forecast. The simulated results of various ensembles with fuzzy logic on the Mexico time‐series Coronavirus dataset provides great prediction and low error rate. Maleki et al. (2020) presents Coronavirus recovered and confirmed cases forecasting model to control the outbreak and efficient health care resource management. The statistical methodology has been used for accurate time‐indexed data forecasting. The Autoregressive Two‐Piece Scale Mixture Normal Distributions (TP‐SMN‐AR) model is a family of various symmetric/asymmetric and light/heavy‐tailed models that have been used to forecast Covid‐19 cases.Petropoulos and Makridakis (2020) introduces a powerful objective approach for the continuous prediction of Covid‐19. The forecast suggests the continuous increment of Coronavirus confirmed cases with associated uncertainty. The exponential smoothy family has been used to produce forecasting, which has an excellent capability to forecast short‐duration patterns with additive and multiplicative combinations. Hu et al. (2020) has presented AI‐based forecasting of Covid‐19 to find the trends and the effects of the pandemic in China. It estimates the length, size and ending time of Coronavirus outbreak across China. The modified stacked encoder has been developed for the prediction that has the ability of Covid‐19 real‐time confirmed cases forecasting. Ceylan (2020) various ARIMA models have been formulated with different parameters. Forecasting and predictions made by the model provide help to decide precaution and policy formulation for the outbreak. Salgotra et al. (2020) provides genetic programming‐based forecasting of Covid‐19 trends in India. Various statistical parameters and explicit formulas had been used to calculate the effectiveness of the forecasting model. It has concluded that genetic programming‐based models are based on simple linkage function and provides highly reliable time‐series forecasting results.Lalmuanawma et al. (2020) presented a comprehensive review to show the role of AI and machine learning in the arena of predicting, forecasting, screening and drug development Covid‐19 and its related epidemic. They stated that AI and machine learning has remarkably improved medication, screening, predicting and forecasting for Covid‐19 and reduce human interruption in medical practice. Tuli et al. (2020) applied machine learning‐based mathematical model to measure the threat of Covid‐19 over the world. An iterative weighting‐based generalized framework was developed for real‐time prediction of the epidemic. The proposed model achieved higher accuracy and can be helpful in taking Covid‐19 related decisions. Vaishya et al. (2020) presented the role of AI as a decisive technology to fight with Coronavirus. It has concluded that healthcare departments need AI technology to handle the Covid‐19 outbreak and require proper suggestions in real‐time to reduce the spread. Wang, Zheng et al. (2020) had integrated Covid‐19 most updated epidemiological dataset and fitted it into the Logistic model to analyses the epidemic trends. After that fed the cap value into the Fbprophet model to draw the pandemic curve and predictions. The proposed mathematical model estimated that the global pandemic will peak in late October, with approximated 14.12 million people will be infected correlatively. Tiwari and Bhati (2020) presented a prediction of Covid‐19 using Gradient‐Boost, Extra‐Tree, AdaBoost and Random‐Forest for India and concluded that machine learning is an efficient approach to predict the outbreak.
Machine‐learning
Machine Learning is a very much functional and practical tool for the prediction and classification of problems, which is helpful for decision‐makers to take decisions in various fields and it also provides great results in medical diagnosis and disease‐related fact predictions. As A. R. Mishra et al. (2020) proposed a novel approach related to an intuitionistic fuzzy set to assess the health‐care waste disposal techniques and works on new measures of parametric divergence. Asri et al. (2016) used machine‐learning algorithms for predicting and diagnosing the effects and risk of breast cancer. Wisconsin breast cancer real dataset has been used for the prediction of disease. It has been stated that SVM performs greater than Naïve Bayes, k Nearest Neighbour, and Decision Tree in terms of 97.13% accuracy. Kourou et al. (2015) said machine learning tools can reveal key features from complex datasets, and a variety of techniques like Decision Trees (DTs), SVMs, Bayesian Networks (BNs) and Artificial Neural Networks (ANNs) are widely applicable for the prediction and prognosis of the disease. However, it is also evident that ML increases the understanding level of detecting cancer and resulting in effective decision making. Bhatla and Jyoti (2012) develop an analysis study for predicting heart disease by various machine learning techniques and discovered that Neural Network with 15 attributes outperforms for predicting heart disease. Whereas, Decision Tree also provides good accuracy with the combination of feature subset selection and genetic algorithms.Nilashi et al. (2017) has proposed an analytical method for disease prediction using machine learning algorithms and used Expectation–Maximization (EM), Principal Component Analysis (PCA), Classification and Regression Trees (CART) and Fuzzy rule‐based technique for extracting the rule from medical datasets for the disease prediction task. The results showed that the combination of CART, Fuzzy rule‐based and noise removal clustering technique outperforms for disease predictions of the medical dataset. Patel et al. (2015) applied Random Forest, J48 algorithm and Logistic model tree algorithm on a Cleveland database of UCI repository for the diagnosis of heart disease. It has been concluded that J48 performs best in terms of accuracy and takes the least total time to build. Rani et al. (2020) applies a fuzzy assessment with a new score and entropy function in type 2 diabetes pharma logical therapy selection. Chen et al. (2017) presented machine‐learning techniques for predicting the outbreak of the chronic disease in communities. To overcome the problem of missing values in the dataset, the latent factor has been used and proposed a new Convolutional Neural‐Network based disease risk prediction (CNN‐MDRP) model with 94.8% accuracy. Gokul et al. (2013) has proposed the application of Fully Complex‐Valued Radial Basis Function (FC‐RBF), Metacognitive Fully Complex‐Valued Radial Basis Function Network (Mc‐FCRBF) and Extreme Learning Machine (ELM) for Predicting the Parkinson's disease. Nilashi et al. (2018) proposed a hybrid intelligent system for predicting the Unified Parkinson's Disease Rating Scale (UPDRS) and take advantage of the Incremental Machine Learning Technique and Incremental SVM. That model outperformed and generated Mean Absolute Error (MAE) = 0.4656 for total UPDRS and MAE = 0.4967 for Motor UPDRS.Książek et al. (2019) proposed a Machine‐Learning based novel approach to detect hepatocellular carcinoma disease at the initial stage. 5‐folds Genetic Algorithm, SVM, Feature Selection and Normalization has applied for getting the best results of prediction in terms of F1‐Score as 0.8849 and 0.8762. Long et al. (2015) proposed a heart disease diagnosis system by using Interval type‐2 Fuzzy Logic System (IT2FLS) and Rough sets‐based reduction system, that handles uncertainties and high‐dimensional challenges of the dataset. This literature review related to machine learning‐based prediction on medical diagnosis motivates us for predicting the Covid‐19 outbreak facts, effects and future trends in the entire world using machine learning techniques. Medhekar et al. (2013) presents Naïve Bayes heart disease prediction using five basic categories low, avg, high, very high and no. It provides great accuracy as 88.76, 89.58 and 88.96 along with heart disease risk prediction. Pattekari et al. (2012) developed a Naïve based intelligent system to predict the risk of heart disease, which is capable of answering the complex queries related to the heart disease diagnosis and can assist a medical practitioner to take decisions. It has been concluded that the Naïve Bayes system is the most effective model to predict the disease. M. W. Huang et al. (2017) says out of various statistical and machine learning techniques, SVM is one of the best techniques for predicting the disease. The prediction performance of the SVM and SVM ensemble model assess on the small and large‐scale datasets. It has concluded that Linear kernel‐based SVM ensemble with bagging performs well on small scale dataset and RBF kernel‐based SVM ensemble with boosting performs better on large scale dataset. Hamzenejad et al. (2020) uses the k‐Nearest Neighbour approach to diagnose and classify the brain disease and introduced a new robust algorithm. Dolatabadi et al. (2017) presented an optimized SVM‐based automated coronary artery disease diagnosis. The proposed model provides 99.2% accuracy, 98.43% sensitivity and 100% specificity. Z. Y. Huang et al. (2020) presents weighted Linear Regression‐based prediction for the morbidity of chronic obstructive pulmonary disease. The efficiency of the model has been measured by Mean Absolute Percentage Error (MAPE). Successful experiments have been done by Linear Regression and generate a minimum prediction error of 9.03. V. K. Mishra et al. (2019) has used Linear Regression for dengue disease forecasting that achieves 19.81 mean square error that is the least from other machine learning techniques such as Neural Network, Support Vector Machine, Random Forest, Boosted Tree and XGBoost.
STATISTICAL ANALYSIS OF COVID‐19 FACTS IN THE WORLD
Coronavirus has a large family of viruses that can affect animals or humans. In humans, the Coronavirus affects the respiratory system, ranging from the simple cold to high severe diseases like Middle East Respiratory Syndrome (MERS) and Severe East Respiratory Syndrome (SERS). Covid‐19 is a recent outbreak that has affected the entire world, which is caused by a recently discovered Coronavirus. This novel disease was unknown before it first surfaced in Wuhan city, China in December 2019. In this section, we focused on the symptoms of Covid‐19 and how this affects the entire world in terms of confirmed, recovered, death and active cases. Two real‐time datasets were collected from Kaggle.com. The first dataset was contained a cumulative count of worldwide recovered, confirmed and death cases of Covid‐19 from 22 January 2020 to 19 May 2020 and the second dataset were stored the global time‐series records of Covid‐19 from 22 January 2020 to 19 May 2020. Table 1 depicts the symptoms that are usually found in Covid‐19 affected patients in higher to a lower frequency (Coronavirus Symptoms information, 2020).
TABLE 1
Symptoms of Covid‐19 pandemic
S. No.
Symptom
Percentage
0
Fever
87%
1
Dry cough
67%
2
Fatigue
38%
3
Sputum production
33%
4
Shortness of breath
18%
5
Muscle pain
14%
6
Sore throat
13%
7
Headache
13%
8
Chills
11%
9
Nausea or vomiting
5%
10
Nasal congestion
4%
11
Diarrhoea
3%
12
Hemoptysis
0.9%
13
Conjunctival congestion
0.8%
Symptoms of Covid‐19 pandemicFigure 2 depicts a Covid‐19 symptoms percentage chart, whereas Figure 3 represents the word cloud of Covid‐19 symptoms. According to the Table 1 fever is the most common symptom in Covid‐19 patients, and Dry cough, fatigue, sputum production and shortness of breath are primary symptoms of Covid‐19. Whereas, muscle pain, sore throat, headache, chills, nausea or vomiting, nasal congestion, diarrhoea, hemoptysis and conjunctival congestion has been found in rare cases in the patient of Covid‐19. Word Cloud (Figure 3) of these symptoms shows the high‐frequency words that present in the Covid‐19 symptoms dataset.
FIGURE 2
Percentage chart of Covid‐19 symptoms
FIGURE 3
Word cloud of Covid‐19 common symptoms
Percentage chart of Covid‐19 symptomsWord cloud of Covid‐19 common symptomsTable 2 depicts the Country/Region wise record of confirmed, active and death cases from 22 January 2020 to 19 May 2020 that are arranged in ascending order. As the table shows, the US is in the top countries that are affected by the Covid‐19 pandemic and Russia, Brazil, the UK, and Spain are in top‐5. The table shows, Mainland China is a country that has the lowest Covid‐19 cases now, and Saudi Arabia is a country that has the lowest number of total death cases till 19 May 2020. Figure 4 depicts the active cases of Covid‐19 pandemic of countries from 22 January 2020 to 19 May 2020. Here active cases have been calculated by subtracting the number of recovered cases and the number of death cases from the total number of confirmed cases, and the darker shades represent a higher number of active cases. Colour of geographical map is classified as >1, >200, >400, >600, >800 and >1000. Whereas, >1000 shows high alert countries of Covid‐19 outbreak.
TABLE 2
Top 20 Covid‐19 affected countries record (confirmed, active and deaths) collected from 22 January 2020 to 19 May 2020
S. No.
Country/Region
Confirmed
Active
Deaths
1
US
1,528,568
1,147,255
91,921
2
Russia
299,941
220,974
2837
3
Brazil
271,885
147,108
17,983
4
UK
250,138
213,617
35,422
5
Spain
232,037
204,259
27,778
6
Italy
226,699
65,129
32,169
7
France
180,933
90,230
28,025
8
Germany
177,778
14,016
8081
9
Turkey
151,615
34,521
4199
10
Iran
124,603
20,311
7119
11
India
106,475
60,864
3302
12
Peru
99,483
60,045
2914
13
Mainland China
82,963
88
4634
14
Canada
80,493
34,396
6028
15
Saudi Arabia
59,854
27,891
329
16
Belgium
55,791
31,996
9108
17
Mexico
54,346
11,355
5666
18
Chile
49,579
27,563
509
19
The Netherlands
44,449
38,548
5734
20
Pakistan
43,966
30,538
939
FIGURE 4
Active cases of Covid‐19 in the world
Top 20 Covid‐19 affected countries record (confirmed, active and deaths) collected from 22 January 2020 to 19 May 2020Active cases of Covid‐19 in the worldFigure 5 represents Covid‐19 confirmed, recovered, deaths and active cases of the entire world, these graphs are drawn based on the Covid‐19 time‐series dataset from 22 January 2020 to 19 May 2020. After that, Figure 6 shows daily basis increase and decrease in confirmed, recovered and death cases of Covid‐19 pandemic based on time‐series dataset from 22 January 2020 to 19 May 2020. Finally, Figure 7 represents a graph of confirmed, recovered and death cases of Top‐5 Covid‐19 affected countries.
FIGURE 5
Covid‐19 cases in the entire world (a) Represents Coronavirus confirmed cases in the world (b) Represents Coronavirus recovered cases in the world (c) Represents Coronavirus death cases in the world (d) Represents Coronavirus active cases in the world
FIGURE 6
Daily increase in Covid‐19 pandemic cases in the world (a) The daily increase in confirmed cases (b) The daily increase in recovered cases in the world (c) The daily increase in death cases in the world
FIGURE 7
Top‐5 countries (US, Russia, Brazil, UK, Spain) Covid‐19 cases (a) The number of confirmed Covid‐19 cases in top‐5 countries (b) The number of recovered Covid‐19 cases in top‐5 countries (c) The number of deaths Covid‐19 cases in top‐5 countries
Covid‐19 cases in the entire world (a) Represents Coronavirus confirmed cases in the world (b) Represents Coronavirus recovered cases in the world (c) Represents Coronavirus death cases in the world (d) Represents Coronavirus active cases in the worldDaily increase in Covid‐19 pandemic cases in the world (a) The daily increase in confirmed cases (b) The daily increase in recovered cases in the world (c) The daily increase in death cases in the worldTop‐5 countries (US, Russia, Brazil, UK, Spain) Covid‐19 cases (a) The number of confirmed Covid‐19 cases in top‐5 countries (b) The number of recovered Covid‐19 cases in top‐5 countries (c) The number of deaths Covid‐19 cases in top‐5 countries
MACHINE‐LEARNING PREDICTION ALGORITHMS
This section presents the machine learning‐based algorithms that have been used for predicting the world effects and trends of the Covid‐19 outbreak. Naive Bayes, SVM and Linear Regression are powerful machine learning algorithms that were used by various researchers for predicting and diagnosing diseases. Ak et al. (2006); Pattekari and Parveen (2012); Vijayarani and Dhayanand (2015), Dulhare (2018). Here we will discuss in brief Naïve Bayes, SVM and Linear Regression predictive algorithms. The Literature work presents the efficiency of Naïve Bayes, SVM and Linear Regression techniques to predict the various disease that motivates to apply these techniques for novel Coronavirus prediction.
Naïve Bayes
Naïve Bayes is a simpler yet robust algorithm for predicting the results, by Machine‐Learning, we are frequently interested in selecting the best hypothesis (h) based on given data (d). Naïve Bayes works based on Bayes' Theorem, which provides a way to calculate the probability of hypothesis based on our prior knowledge.
whereas, P(h|d) represents the probability of hypothesis h on the data d, P(d|h) shows the probability of data (d) on the given hypothesis (h) was true, P(h) prior probability of hypothesis h and P(d) prior probability of data d. By this, we calculate the posterior probability of P(h|d) from P(h) with P(d) and P(d|h). Prediction can be made for new data by using Bayes's Theorem.Maths for Naïve Bayes is quite deep, but relatively implementation is simple. The probability of class k predictor value X is one over Z times the probability of class k (Naïve Bayes information, 2020).
where P represents the probability of class k on given predictor value X over the Z times the probability of k, times the probability of each x given class k. Naïve Bayes provides the facility to catch uncertainty about the model based on the probabilities of the outcome, and it can be helpful for solving the predictive and diagnostic problems (Medhekar et al., 2013).
Support vector machine (SVM)
SVM is a supervised algorithm that works based on nonlinear mapping to restore the training data into higher dimensions and has examined the linear optimal separating hyperplane (Sonavane et al., 2013). The SVM sets the hyperplane with the help of margins and support vectors. SVM has the advantage that it is less prone to overfitting than other methods and provides a condensed description of the learned model (Vijayarani et al., 2015). SVM is based on finding the best hyperplane. Hyperplanes are the boundary of the decision in multi‐dimensional space. In one dimension it is called a line, in two dimensions, it has called a plane, and for more dimensions, it can be called a hyperplane. The function of the line can be formulated as:
whereas, x and y are selected as a feature and naming them as x1, x2……… xn. Equation of hyperplane is written as:SVM works on the hypothesis, and the hypothesis function can be defined as:For computing, the margin of the hyperplane equation is as follows:
Linear regression
Linear Regression is a popular predictive technique. It searches the best variable set for prediction and then the perfect variable from the set for predicting the outcome. It is based on sign and beta estimates; these Regression estimates explain the relationship between one dependent (y) variable and many independent (x) variables. The Linear Regression equation is as follows:
where y represents the dependent variable, x1, x2…………xn are independent variables, b0 is intercepted and b1, b2 are coefficients and n represent the number of observations. Linear regression models are more accessible and more practical for solving prediction problems (Aghdaei et al., 2017). When there is a single input variable, it is called a simple linear regression, and when there is a multiple‐input variable, it is called a multiple regression model. Ordinary Least Square is a common technique to train the linear regression model.
METHODOLOGY AND EXPERIMENTAL PREDICTIONS ANALYSIS
Experiments have been conducted through Jupyter Notebook Python on the cumulative count and time‐series dataset of the Covid‐19 pandemic. Our motive is to evaluate and predict the future cases of Covid‐19 based on the previous trend by machine learning algorithms. For achieving this goal, Naïve Bayes, SVM, and Linear Regression techniques have been applied and comparatively tested; these belong to one of the most potent predictive techniques. The framework is given in Figure 8, it represents the flow and procedure of prediction model implementation on the Covid‐19 pandemic dataset. First, the procedure initially starts from domain understanding, where the problem is analysed, and the objective of the problem is discussed. The second phase is data understanding; before the implementation of any problem, it is to be required to understand the structure of data. The third feature, selection, is a very much important phase in which it must be decided that on which feature of data, future predictions are made and which attribute is directly related to the prediction. Before the implementation part, pre‐processing of the dataset is also done for getting the effective results, then after only our real‐time dataset of Covid‐19 pandemic is ready to perform operations. In the fourth stage, data is split into two parts: the training part and the testing part, where a 0.42 percent portion of the data is selected for testing predictions. Fifth, prediction algorithms Naïve Bayes, SVM and Linear Regression, have been applied on Covid‐19 realistic dataset. The sixth and the final phase represents the comparative study between algorithms for getting the predictive results of the worldwide spread of Covid‐19.
FIGURE 8
The procedure of Covid‐19 analytical study using Machine‐Learning techniques
The procedure of Covid‐19 analytical study using Machine‐Learning techniques
Data collection
The process of forecasting starts from data collection. It is very much required to have an accurate dataset for trustworthy forecasting results. The actual time‐series dataset of the Covid‐19 outbreak has been used to predict the world effects and trends. The dataset has collected from Kaggle.com, which is a popular website to provide useful datasets. The various datasets have been used to perform the experiments related to the Covid‐19 prediction. Table 3 describes the information regarding all datasets.
TABLE 3
Dataset information
S. No.
Name
Columns
1.
Covid‐19 World Cases Data
Observation Date, Province/State, Country/Region, Last Update, Confirmed, Deaths, Recovered Cases
2.
Symptoms Data
Symptom, Percentage
3
Confirmed Cases
Province/State, Country/Region, Lat, Long, Dates
4.
Recovered Cases
Province/State, Country/Region, Lat, Long, Dates
5.
Death Cases
Province/State, Country/Region, Lat, Long, Dates
Dataset information
Feature selection
Feature selection is known as an appropriate variable selection from the dataset. It plays a significant role in boosting the performance and accuracy of prediction techniques. Feature selection is the process of dimensionality reduction that is helpful to acquire needful information from a large dataset and reduce processing time with better performance. From Confirmed, Recovered, and Death cases datasets, fourth column to last columns have been selected that holds the initial and last date of Covid‐19 cases from 22 January 2020 to 19 May 2020. From the Covid‐19 world cases dataset Country/Region, Confirmed and Deaths columns have been selected as a key feature. Whereas, active cases have calculated as:
Model training and testing
The model has trained on the training set that is known as the learning phase. Once the machine learns about the features and attributes of the data, it applies to a test set for future predictions. Where 42% of the dataset has been used for testing purposes and 58% for training purposes for getting more accurate predictions and results for the Covid‐19 outbreak. The larger testing set ensures the higher accuracy of predictions rather than the smaller testing set. Hyperparameter tuning is very much required to optimize the performance of AI algorithms. Various hyperparameters have been selected for Naive Bayes, SVM and Linear Regression algorithms. Table 4 represents the selected hyperparameters of the applied techniques.
The Outcomes of these predictive algorithms are measured in terms of Mean‐Absolute‐Error (MAE) and Mean‐Squared‐Error (MSE). The future prediction of Covid‐19 cases all around the world is also depicted by the graph as actual cases versus predictive cases. MAE is a difference between the actual and predicted values, where absolute difference means ignoring the negative values, and it is calculated as:MAE calculates the outcome by averaging the error from each sample of the dataset, which is represented as:
where AE represents absolute error and yi shows the true values, MAE is a very natural measure for predictions (Willmott & Matsuura, 2005). MSE is calculated by averaging the squares of the errors; it shows the difference of the average squares between the actual and predicted values. MSE of the ensemble mean has never been larger than the MSEs arithmetic means of individual simulators (Rougier, 2016). The equation of MSE is as follows:MAE and MSE are calculated for all the prediction algorithms that are applied in this study for predicting the Covid‐19 pandemic cases over the world, which shows the average difference between the correct and predicted cases of Covid‐19 pandemic, this shows the effectiveness of the predicted model.
Naïve Bayes prediction
The Naïve Bayes prediction algorithm has been applied for predicting the future cases of Covid‐19, and the best parameters have been selected for experimenting with Naïve Bayes. Naïve Bayes produces MAE = 488806.7492 and MSE = 400919367451.7439 on a testing set of Covid‐19 pandemic realistic dataset, which shows the best prediction of a pandemic. Figure 9 depicts the graph of test‐confirmed cases versus Bayesian predictions.
FIGURE 9
The test confirmed cases vs Bayesian prediction of Covid‐19 in the world
The test confirmed cases vs Bayesian prediction of Covid‐19 in the worldFigure 10 represents the total confirmed cases of Covid‐19 in the world from 22 January 2020 to 19 May 2020 and predicted cases by the Naïve Bayes prediction algorithm. Where the x‐axis shows the number of cases and the y‐axis shows the date of the occurrence of confirmed cases. This graph depicts that Naïve Bayes effectively predicted the results of Covid‐19 pandemic confirmed cases for the world. Table 5 represents, the future 10 days prediction of the Covid‐19 pandemic from 20 May 2020 to 29 May 2020 for the entire world by Naïve Bayes.
FIGURE 10
Total confirmed cases vs Bayesian predictions for Covid‐19 in the world
TABLE 5
Covid‐19 pandemic future forecasting of confirmed cases by Bayesian model
S. No.
Date
Bayesian prediction
0
20/05/2020
6586004
1
21/05/2020
6814990
2
22/05/2020
7049893
3
23/05/2020
7290814
4
24/05/2020
7537856
5
25/05/2020
7791120
6
26/05/2020
8050712
7
27/05/2020
8316734
8
28/05/2020
8589292
9
29/05/2020
8868493
Total confirmed cases vs Bayesian predictions for Covid‐19 in the worldCovid‐19 pandemic future forecasting of confirmed cases by Bayesian model
Support vector machine prediction
SVM works with hyperplanes and support vectors, where the support vector is the data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. The main objective of this technique is to find the best hyperplane that has the maximum margin. We experimented this technique on the Covid‐19 pandemic dataset with some best hyperparameters as gamma, epsilon, shrinking and degree, best tuning of these hyperparameters boosted the performance of the prediction technique. The experiment of SVM with Covid‐19 time series data provide MAE = 718150.1344 and MSE = 565545811024.1667, which is greater than Naïve Bayes MAE and MSE that shows Naïve Bayes produced better prediction than SVM. Figure 11 represents the graph of the tested confirmed cases versus SVM prediction. As the MAE and MSE values are greater, the margin between test data and SVM prediction is also higher, which shows less effectiveness of the SVM prediction technique. Figure 12 represents the total number of Coronavirus confirmed cases all over the world versus SVM prediction, and it also shows a little more difference between the actual values and predicted values.
FIGURE 11
The test confirmed cases Vs SVM prediction of Covid‐19 in the world
FIGURE 12
Total confirmed cases Vs SVM predictions for Covid‐19 in the world
The test confirmed cases Vs SVM prediction of Covid‐19 in the worldTotal confirmed cases Vs SVM predictions for Covid‐19 in the worldTable 6 represents, the future 10‐day prediction of Covid‐19 from 20 May 2020 to 29 May 2020 by using the SVM technique.
TABLE 6
Covid‐19 pandemic future forecasting of confirmed cases by SVM model
S. No.
Date
SVM prediction
0
20/05/2020
4828123
1
21/05/2020
4991561
2
22/05/2020
5159136
3
23/05/2020
5330918
4
24/05/2020
5506976
5
25/05/2020
5687381
6
26/05/2020
5872204
7
27/05/2020
6061516
8
28/05/2020
6255390
9
29/05/2020
6453898
Covid‐19 pandemic future forecasting of confirmed cases by SVM model
Linear regression prediction
The regression technique finds the relation between input (x) and output (y), it is a very much popular technique used for future forecasting. In our experiment, the Linear Regression technique has been used for predicting the future confirmed cases of Covid‐19 by using the trends of previously confirmed cases. The MAE = 648733.0991, and MSE = 913583889578.4996 are produced by Linear Regression for Covid‐19 pandemic dataset, which shows greater accuracy than SVM but lower accuracy than Naïve Bayes. Figure 13 depicts the test confirmed cases of Covid‐19 versus Regression predicted cases in the world. Figure 14 represents the graph of the total no. of Coronavirus confirmed cases in the world versus Regression prediction. Here on the Linear Regression graph, prediction values are almost the same till 80 percent of the time, then after, little differences are visualized. But in SVM model prediction, values are the same as actual value till 60 percent but for 40 percent of the time, prediction values are different from actual confirmed cases.
FIGURE 13
The test confirmed cases Vs Regression prediction of Covid‐19 in the world
FIGURE 14
Total confirmed cases Vs Regression predictions for Covid‐19 in the world
The test confirmed cases Vs Regression prediction of Covid‐19 in the worldTotal confirmed cases Vs Regression predictions for Covid‐19 in the worldTable 7 represents, the future 10‐day prediction of Covid‐19 by Linear Regression from 20 May 2020 to 29 May 2020. The Regression model predicts higher no. of confirmed cases of Covid‐19 on 29 May 2020 than SVM and Naïve Bayes.
TABLE 7
Covid‐19 pandemic future forecasting of confirmed cases by the Regression model
S. No.
Date
Regression prediction
0
20/05/2020
7656549
1
21/05/2020
7907640
2
22/05/2020
8164164
3
23/05/2020
8426178
4
24/05/2020
8693741
5
25/05/2020
8966909
6
26/05/2020
9245742
7
27/05/2020
9530296
8
28/05/2020
9820629
9
29/05/2020
10116800
Covid‐19 pandemic future forecasting of confirmed cases by the Regression model
Comparative discussion
The above sections represent predictive results of Covid‐19 that have been calculated by various AI‐based techniques. Although all the techniques generate higher accuracy and low false rate, all have their benefits and drawbacks. Table 8 shows the calculated MAE and MSE of Naïve Bayes, Linear Regression and SVM.
TABLE 8
The MAE and MSE score of techniques
S‐No
Technique
MAE
MSE
1
Naïve Bayes
488806.7492
400919367451.7439
2
Linear Regression
648733.0991
913583889578.4996
3
SVM
718150.1344
565545811024.1667
The MAE and MSE score of techniquesTable 8 shows that Naïve Bayes produced the least MAE = 488806.7492 and MSE = 400919367451.7439 value than SVM and Linear Regression, which shows the better performance of the Naïve Bayes technique. Naïve Bayes is a simple approach, which not require more training to learn the model. It performs great on both type of discrete and continuous dataset with fast prediction speed. The MAE = 718150.1344 of SVM is greater than the Linear Regression. Whereas, MSE =565545811024.1667 of SVM is lesser than Linear Regression. Figures 10, 12 and 14 shows the total confirmed cases versus predicted cases by Naïve Bayes, SVM and Linear Regression. Where Naïve Bayes predicts more closely with actual confirm cases than SVM and Linear Regression. It has concluded that Naïve Bayes outperforms and predicted accurately for the Covid‐19 outbreak.
CONCLUSION
Since the outbreak of Covid‐19, researchers and medical organizations around the world have urged to find alternative prediction methods and rapid screening processes to fight against the epidemic. Machine learning and AI are favourable techniques adopted by healthcare organizations. Hence, we implemented machine learning‐based techniques namely Naïve‐Bayes, SVM and Linear‐Regression on the real‐time dataset of Covid‐19 to predict the future growth and effects of the outbreak. The demonstration shows, Naïve Bayes performs better and predict better‐Covid‐19 confirmed cases globally than Regression and SVM with minimum MAE and MSE value. Where Linear Regression produce less MAE and MSE value than SVM and predicts better than SVM. The predicted outcomes of Naïve Bayes are almost similar to the actual confirmed cases of Coronavirus. So, it can be conveyed that the future forecasting of Covid‐19 cases by Naïve Bayes is more trustworthy than SVM and Regression.Further, a meta‐analysis has been presented, which shows the various perspective of the novel Coronavirus. The graphs of Section 3, plotted the statistics related to the major symptoms, active cases and list of top‐20 Covid‐19 affected countries till 19 May 2020. Figure 6. shows the daily increment of Covid‐19 confirmed, recovered and death cases from 22 January 2020 to 19 May 2020. The US, Russia, Brazil, the UK and Spain were the top five countries, facing the Covid‐19 outbreak till 19 May 2020. This paper also focuses on previous research conducted on Covid‐19 trends prediction and conveying that machine learning and AI drastically gain more popularity in forecasting, screening, drug development and contact tracing. AI is not only convenient for treating the Covid‐19 patients but also helpful for the government for taking appropriate decisions. However, most of the AI techniques are not compatible to work with real‐environment, but still remarkable to tackle with the outbreak.
Authors: Alfonso J Rodriguez-Morales; Jaime A Cardona-Ospina; Estefanía Gutiérrez-Ocampo; Rhuvi Villamizar-Peña; Yeimer Holguin-Rivera; Juan Pablo Escalera-Antezana; Lucia Elena Alvarado-Arnez; D Katterine Bonilla-Aldana; Carlos Franco-Paredes; Andrés F Henao-Martinez; Alberto Paniz-Mondolfi; Guillermo J Lagos-Grisales; Eduardo Ramírez-Vallejo; Jose A Suárez; Lysien I Zambrano; Wilmer E Villamil-Gómez; Graciela J Balbin-Ramon; Ali A Rabaan; Harapan Harapan; Kuldeep Dhama; Hiroshi Nishiura; Hiromitsu Kataoka; Tauseef Ahmad; Ranjit Sah Journal: Travel Med Infect Dis Date: 2020-03-13 Impact factor: 6.211
Authors: Adam J Kucharski; Timothy W Russell; Charlie Diamond; Yang Liu; John Edmunds; Sebastian Funk; Rosalind M Eggo Journal: Lancet Infect Dis Date: 2020-03-11 Impact factor: 25.071