| Literature DB >> 34179541 |
Rayner Alfred1, Joe Henry Obit1.
Abstract
Machine learning (ML) methods can be leveraged to prevent the spread of deadly infectious disease outbreak (e.g., COVID-19). This can be done by applying machine learning methods in predicting and detecting the deadly infectious disease. Most reviews did not discuss about the machine learning algorithms, datasets and performance measurements used for various applications in predicting and detecting the deadly infectious disease. In contrast, this paper outlines the literature review based on two major ways (e.g., prediction, detection) to limit the spread of deadly disease outbreaks. Hence, this study aims to investigate the state of the art, challenges and future works of leveraging ML methods to detect and predict deadly disease outbreaks according to two categories mentioned earlier. Specifically, this study provides a review on various approaches (e.g., individual and ensemble models), types of datasets, parameters or variables and performance measures used in the previous works. The literature review included all articles from journals and conference proceedings published from 2010 through 2020 in Scopus indexed databases using the search terms Predicting Disease Outbreaks and/or Detecting Disease using Machine Learning. The findings from this review focus on commonly used machine learning approaches, challenges and future works to limit the spread of deadly disease outbreaks through preventions and detections.Entities:
Keywords: Detection; Disease outbreak; Infectious disease; Machine learning; Prediction
Year: 2021 PMID: 34179541 PMCID: PMC8219638 DOI: 10.1016/j.heliyon.2021.e07371
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Figure 1Five primary stages of the systematic literature review.
Research question.
| ID | Research Question |
|---|---|
| RQ1 | What are the roles of machine learning models in limiting the spread of deadly diseases outbreak? |
| RQ2 | What disease datasets in the literature have been used to build the models? |
| RQ3 | What type of parameters or variables have been used? |
| RQ4 | What type of problems are addressed using these machine learning models? |
| RQ5 | What are the individual models used? |
| RQ5.1 | What are the best performing individual models? |
| RQ6 | What are the evaluation measures and approaches used to assess the performance of the machine learning models? |
| RQ7 | What type of ensemble models are used in the machine learning models? |
| RQ7.1 | Do the ensemble models outperform the individual models? |
Online digital libraries.
| No | Online Digital Libraries | Websites |
|---|---|---|
| 1 | Elsevier | |
| 2 | Springer | |
| 3 | IEEE eXplore | |
| 4 | ACM Digital Library | |
| 5 | Wiley online library | |
| 6 | Medline (life sciences and biomedicine) |
Number of studies screened and reviewed.
| No | Online Digital Libraries | Retrieved | Screened | Reviewed | Average Score | Quality |
|---|---|---|---|---|---|---|
| 1 | Elsevier | 987 | 54 | 18 | 0.889 | Excellent |
| 2 | Springer | 559 | 46 | 6 | 0.817 | Good |
| 3 | IEEE eXplore | 456 | 15 | 7 | 0.771 | Good |
| 4 | ACM | 380 | 13 | 7 | 0.814 | Good |
| 5 | Wiley | 28 | 8 | 1 | 0.800 | Good |
| 6 | Medline (PubMed) | 158 | 25 | 8 | 0.825 | Good |
| 0.838 |
Quality Assessment Question.
| ID | Ten Assessment Questions |
|---|---|
| AQ1 | Does the study define a main research objective or problem related to the spread of deadly diseases outbreak (e.g., prediction, detection, responses)? |
| AQ2 | Does the study specify the relevant disease datasets used? |
| AQ3 | Does the study specify the availability of these datasets (e.g. public datasets, private datasets)? |
| AQ4 | Does the study define the parameters or variables used or learnt by the machine learning algorithms? |
| AQ5 | Does the study define the type of parameters used or learnt by the machine learning algorithms? |
| AQ6 | Does the study specify the type of machine learning models used (e.g. classification, regression, clustering) in solving the problem? |
| AQ7 | Does the study specify the individual models explicitly (e.g., neural network, linear regression)? |
| AQ8 | Does the study specify the evaluation measures (e.g., Accuracy, Precision, Recall, F-Measure, ROC) used to assess the performance of the proposed machine learning approach? |
| AQ9 | Does the study specify the evaluation approaches (e.g., cross-validation, holdout) used to assess the performance of the proposed machine learning approach? |
| AQ10 | Does the study specify the ensemble models (e.g., bagging, boosting) used and compare the performance with individual models? |
Number of studies reviewed based on year (2010 - 2020).
| 2010 - 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|
| Studies | 7 | 3 | 2 | 9 | 19 | 7 |
Type of Machine Learning Problems and Related Studies.
| Problems | Roles | Related Studies |
|---|---|---|
| Regression | Predict disease outbreaks | |
| Classification | Detect disease outbreaks |
Structured Data: Datasets and Parameters Used.
| Databases (Frequency) | Features |
|---|---|
| Epidemiology Data (18) | Number of Disease Outbreak Incidences, Signs and Symptoms of Diseases, Treatment Information, Seasonal Information |
| Spatial Data (4) | GPS Coordinates, Topology, Distance, Area |
| Remotely Sensed Data (2) | Normalized Difference Vegetation Index, Normalized Difference Water Index, Land Surface Temperature |
| Meteorological Data (24) | Temperature, Humidity, Precipitation, Air Pressure, Solar Radiation, Wind Speed |
| Physiological Data (3) | Blood Pressure, Cholesterol, Obesity, Heart Rate, Risk Factor (e.g., Smoking) |
| Demographic Data (6) | Age, Gender, Race, Ethnicity, Marital Status, Income, Education, Occupation, Employment |
Unstructured Data: Datasets and Parameters Used.
| Databases (Frequency) | Features |
|---|---|
| Social Media Data (12) | Posted Text, Post Time, Post Date, Post Geo-Location, Number of Comments, Number of Likes |
| Search Keywords (9) | Keywords Searched, Keywords Volumes, Keywords Trends |
| News Articles (1) | Original News Texts, News Published Date, Symptoms Detected |
Regression: Types of Machine Learning approaches and Individual Models Used.
| Study | Objectives | Models Applied | Best Model |
|---|---|---|---|
| Predicting the number of new outbreaks of diseases | ARMA(1,1), ARMA(1,0), ARMA(0,1) | ARMA(0, 1) (MAE = 1.257) | |
| Incidence prediction of communicable diseases using remote sensing | BPNN | BPNN⁎ (MSE = 0.100) | |
| Predicting dengue outbreak | HNN, ANN, NLR | HNN⁎ (MSE = 0.239) | |
| Prediction of province-level outbreaks of foot-and-mouth disease | ZI | ZI | |
| Forecasting influenza like illness | ARIMA, LASSO, LSTM, FNN, MARS | LSTM⁎ (MAPE = 0.320) | |
| Antibiotic resistance outbreaks prediction | GPR, SVM, | SVM (MAE = 0.100) | |
| Forecasting the endemic infectious diseases | LASSO | LASSO (MAPE = 0.404) | |
| Modeling Dengue vector population using remotely sensed data and machine learning | LR, RR, SVR, MLP, DTR, | MLP, | |
| Predicting influenza outbreaks | ARIMA, SVM, RF, ANN | ANN⁎ (MAE = 0.119) | |
| Predict infectious diseases | XGBoost, LSTM, RR, ARIMA | LSTM⁎ (MAPE = 0.099) | |
| Prediction of Malaria disease outbreak | ARIMA, SARIMA, BPNN, LSTM | LSTM⁎ (RMSE = 0.072) | |
| Time Series Analysis of Dengue Fever | SARIMA | SARIMA(1,2,2) (MAPE = 0.050) | |
| Prediction of avian influenza H5N1 outbreaks | ARIMA, RF | RF (MSE = 0.248) | |
| Predicting new and urgent trends in epidemiological data | RNN, LSTM | LSTM⁎ (RMSE = 0.140) | |
| Predicting the spread of influenza epidemics by analyzing twitter messages | ARX, ARMAX, NARX, DeepMLP, CNN | CNN⁎ (MAE 0.250) | |
| Predicting of Dengue outbreaks | |||
| Influenza Trends Prediction | LSTM | LSTM⁎ (RMSE = 0.015) | |
| Forecast of Dengue Cases in China | LSTM-TL, LSTMs, BPNN, GAM, SVR, GBM | LSTM-TL⁎ (RMSE = 0.322) | |
| Predicting Infectious Disease in Korea | OLS, ARIMA, NN, LSTM | LSTM⁎ (RMSE = 0.179) | |
| Forecasting Hepatitis incidence | ARIMA, RNN, ARIMA + RNN | ARIMA + RNN⁎ (MAPE = 0.045) | |
| Prediction of Haemorrhagic fever with renal syndrome in China | ARIMA, RNN, ARIMA + RNN | ARIMA + RNN⁎ (MAPE = 0.178) | |
| Forecasting dengue incidence in Guadeloupe, French West Indies | SARIMA | SARIMA (RMSE = 0.850) | |
| Dengue prediction model based on climate | SARIMA | SARIMA (MSE = 0.839) | |
| Forecasting incidence of hand, foot & mouth disease | ARIMA, BPNN | BPNN⁎ (MAPE = 0.200) |
Models: Exogenous Inputs (ARX), Autoregressive Moving Average with Exogenous Inputs (ARMAX), Auto Regressive Integrated Moving Average (ARIMA), Autoregressive Moving Average (ARMA), Artificial Neural Network (ANN), Back Propagation Neural Network (BPNN), Convolutional Neural Network (CNN), Decision Tree Regression (DTR), Feedforward Neural Network (FNN), Gradient Boosting Machine (GBM), Gaussian Process Regression (GPR), Hybrid Neural Network (HNN), k-Nearest Neighbour (k-NN), k-Nearest Neighbour Regression (k-NNR), Least Absolute Shrinkage and Selection Operators (LASSO), Linear Regression (LR), Long Short Term Memory (LSTM), Multilayer Perceptron (MLP), Multivariate Adaptive Regression Splines (MARS), Nonlinear Autoregressive Exogenous (NARX), Non-Linear Regression (NLR), Random Forest (RF), Recurrent Neural Network (RNN), Ridge Regression (RR), Seasonal Autoregressive Integrated Moving Average (SARIMA), Support Vector Machine (SVM), Support Vector Regression (SVR), Zero-Inflated (ZI). Note:⁎Belongs to Neural Network family.
Classification: Types of Machine Learning approaches and Individual Models Used.
| Study | Objectives | Models Applied | Best Model |
|---|---|---|---|
| Predicting influenza outbreaks in Iran | SVM, RF, ANN | SVM (MAE = 0.132) | |
| Detecting Disease Outbreaks among Physiological Variables | FL | FL ( | |
| Predicting outbreak of hand-foot-mouth diseases | RR, | LSTM⁎ (ROC = 0.841) | |
| Predicting death and cardiovascular diseases in dialysis patients. | LR, | SVC-RBF (ACC = 0.953) | |
| Event detection and Situational Awareness of disease outbreaks | NB, SVM, LSTM | LSTM⁎ ( | |
| Modelling disease outbreak events | CRF | CRF ( | |
| Infection detection using physiological and social data in social environments | |||
| Detection and prevention of mosquito-borne diseases | NB, RDT, J48, F | F | |
| Detecting the occurrence of Zika | BPNN, GBM, RF | BPNN⁎ (ROC = 0.966) | |
| Influenza Detection and Surveillance | NB, ME, DLM | NB (ACC = 0.700) | |
| Detection on Dengue Diseases | MAA | MAA (ACC = 0.750) | |
| Detection of Meningitis Outbreaks in Nigeria | RF, ANN, | NN⁎ (ACC = 0.951) | |
| Detecting global African swine fever outbreaks | RF | RF (ACC = 0.847) | |
| Detecting disease epidemics using a symptom-based approach | M | M |
Models: Artificial Neural Network (ANN), Back Propagation Neural Network (BPNN), Dynamic Language Model (DLM), Fuzzy k-Nearest Neighbor (FkNN), Fuzzy Logic (FL), Gradient Boosting Machine (GBM), Long Short Term Memory (LSTM), Classification Decision Tree (CART), Conditional Random Field (CRF), J48 classifier (J48), Linear Regression (LR), k-Nearest Neighbour (k-NN), Random Forest (RF), Maximum Entropy (ME), Modified Apriori Algorithm (MAA), Modified k-Nearest Neighbor (MkNN), Naive Bayes (NB), Random Decision Tree (RDT), Ridge Regression (RR), Support Vector Classifier RBF kernel (SVC-RBF), Support Vector Machine (SVM). Note:⁎Belongs to Neural Network family.
Ensemble Methods Used for Regression Problems.
| Study | Objectives | Models Applied | Best Model |
|---|---|---|---|
| Forecasting influenza activity | SAAIM, LSTM, LASSO | SAAIM (MAPE = 0.104) | |
| Predicting Influenza-like-illness (ILI) using multiple open data sources | AR, VAR, GPR, RNN, RNN-CNN, CNN-RNN-ResNet | CNN-RNN-ResNet (RMSE = 0.259) | |
| Prediction of Malaria disease outbreak | ARIMA, SARIMA, BPNN, LSTM, ARIMA+SARIMA+BPNN+LSTM | ARIMA + SARIMA + BPNN + LSTM (RMSE = 0.068) | |
| Prediction of dengue outbreak | EPRA, LASSO, RR, ENet | EPRA (MAE - 1.069) | |
| Forecasting Ebola disease epidemic | GGM, GLM, GGM+GLM | GGM+GLM (RMSE = 0.374) | |
| Forecasting respiratory syncytial virus outbreaks | Superensemble | Superensemble (MAE = 0.1011) | |
| Forecasting seasonal influenza epidemic | XGBoost, LASSO, SAAIM | SAAIM (RMSE = 0.374) |
Models: Autoregression (AR), Auto Regressive Integrated Moving Average (ARIMA), Back Propagation Neural Network (BPNN), Convolutional Neural Network (CNN), Elastic Net (ENet), Ensemble Penalized Regression Algorithm (EPRA), Generalized-Growth Model (GGM), Generalized Logistic Model (GLM), Long Short Term Memory (LSTM), Residual Neural Network (ResNet), Seasonal Autoregressive Integrated Moving Average (SARIMA), SARIMA + XGBoost (SAAIM), Least Absolute Shrinkage and Selection Operators (LASSO), VAR, GPR, Recurrent Neural Network (RNN), Ridge Regression (RR).
Ensemble Methods Used for Classification Problems.
| Study | Objectives | Models Applied | Best Model |
|---|---|---|---|
| Detecting and Classifying diseases | RKRE, SKRE, KG_ResNet | RKRE (ACC = 0.886) | |
| Predicting Disease Risk | DPMM, COOC, CBC, eDPMM, eCOOC, eCBC | eCBC (ACC= 0.765) | |
| Classification of risk areas using am ensembled bootstrap-aggregated | Ensemble DTs with bootstrap aggregating | eDT (ROC = 0.91) |
Models: ResNet, Residual Neural Network (ResNet), ResNet + KG_ResNet (RKRE), Knowledge Graph + Residual Neural (KG_ResNet), SVM + KG_ResNet (SKRE), Dirichlet Process Mixture Mode (DPMM), DPMM trained on disease occurrence (COOC), Co-occurrence Based Clustering (CBC), Ensemble Dirichlet Process Mixture Mode (eDPMM), Ensemble DPMM trained on disease occurrence (eCOOC), Ensemble Co-occurrence Based Clustering (eCBC), Ensemble Decition Tree (eDT).
Diseases, Database Sources and Studies.
| Diseases | Database Sources or Parameters |
|---|---|
| Dengue | Meteorological Data |
| Epidemiology Data | |
| Demographic Data | |
| Social Media Data | |
| Remotely Sensed Data | |
| Spatial Data | |
| Zika | Epidemiology Data |
| Meteorological Data | |
| Demographic Data | |
| HFMD | Meteorological |
| Spatial Data | |
| Search Keywords | |
| ILI | Social Media Data |
| Meteorological | |
| Search Keywords | |
| Epidemiology Data | |
| Spatial Data | |
| Others | Epidemiology Data |
| Demographic Data | |
| Meteorological Data | |
| Spatial Data & Remotely Sensed Data | |
| Social Media Data | |
| News Articles | |
| Search Keywords |
⁎Dependent variable: Number of disease outbreak incidences (EP1) (see Table 7).