| Literature DB >> 33977022 |
Reza Safdari1, Sorayya Rezayi2, Soheila Saeedi2,3, Mozhgan Tanhapour2, Marsa Gholamzadeh1.
Abstract
The main objective of this survey is to study the published articles to determine the most favorite data mining methods and gap of knowledge. Since the threat of pandemics has raised concerns for public health, data mining techniques were applied by researchers to reveal the hidden knowledge. Web of Science, Scopus, and PubMed databases were selected for systematic searches. Then, all of the retrieved articles were screened in the stepwise process according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses checklist to select appropriate articles. All of the results were analyzed and summarized based on some classifications. Out of 335 citations were retrieved, 50 articles were determined as eligible articles through a scoping review. The review results showed that the most favorite DM belonged to Natural language processing (22%) and the most commonly proposed approach was revealing disease characteristics (22%). Regarding diseases, the most addressed disease was COVID-19. The studies show a predominance of applying supervised learning techniques (90%). Concerning healthcare scopes, we found that infectious disease (36%) to be the most frequent, closely followed by epidemiology discipline. The most common software used in the studies was SPSS (22%) and R (20%). The results revealed that some valuable researches conducted by employing the capabilities of knowledge discovery methods to understand the unknown dimensions of diseases in pandemics. But most researches will need in terms of treatment and disease control. © IUPESM and Springer-Verlag GmbH Germany, part of Springer Nature 2021.Entities:
Keywords: COVID-19; Data mining; Pandemics; Review
Year: 2021 PMID: 33977022 PMCID: PMC8102070 DOI: 10.1007/s12553-021-00553-7
Source DB: PubMed Journal: Health Technol (Berl) ISSN: 2190-7196
Fig. 1The PRISMA diagram for the identification, screening, and eligibility of studies
Characteristics of papers based on publication years
| Years | Frequency | Percentage |
|---|---|---|
| 2011 | 2 | 4% |
| 2012 | 1 | 2% |
| 2014 | 1 | 2% |
| 2015 | 1 | 2% |
| 2019 | 1 | 2% |
| 2020 | 44 | 88% |
| Total | 55 | 1 |
Fig. 2The distribution of papers by their month of publications in 2020
The characteristics of reviewed articles
| Author | Main approaches | Clinical scope | The applied method of data mining | Software (Environment) | Data source |
|---|---|---|---|---|---|
| Abd-Alrazaq A et al. [ | Infoveillance | Social behavior | Text mining | Python | |
| Ahamad MM [ | Disease characteristics | Infectious disease | Decision Tree, Random Forest, Gradient Boosting Machine, SVM | SPSS | Github repository |
| Ren X et al. [ | Treatment | Pharmacology | Association rule mining method, and association knowledge network | R | Traditional Chinese medicine system pharmacology database |
| Sudirman ID et al. [ | Risk factors | Infectious disease | Random forest and AdaBoost algorithm | Python | Kaggle |
| Zhang Y et al. [ | Infoveillance | Psychology | Time series, NLP, and deep learning | Python | Weibo social network |
Sudirman ID Nugraha DY [ | Risk factors | Infectious disease | Naive Bayes method | RapidMiner | Ministry of Public Health Thailand |
| Huang C et al. [ | Disease characteristics | Infectious disease | Text mining | Python | Sina Weibo social network |
| Han X et al. [ | Infoveillance | Infectious disease | Time series, Random forest, Spatial Distribution | Python | Sina Weibo social network |
| Qin L et al. [ | Infoveillance | Infectious disease | Regression, Forward selection, subset selection, Elastic net | Personal software | The Baidu index |
| Maram B et al. [ | Disease characteristics | Respiratory medicine | Random forest, Decision tree, SVM, KNN | Python | Kaggle |
| Ibrahim et al. [ | Tracing transmission | Epidemiology | ANN | Not mentioned | CDC |
| Fan Q et al. [ | Risk factors | Cardiology | Logistic regression | SPSS | Wuhan Tongji hospital |
| Martin-Rodriguez F et al. [ | Prevention and management | Virology | Logistic regression | XLSTATO and Excel | Valladolid university |
Ketu S and Mishra PK [ | Prevention and management | Epidemiology | Support Vector Regression (SVR), Random forest, LR | Not mentioned | WHO |
| Foieni F et al. [ | Disease characteristics | Respiratory medicine | Multi variant Regression, | SPSS | WHO |
| Ma XX et al. [ | Disease characteristics | Infectious disease | Random forest | R | Hospitals in China |
| Masand VH et al. [ | Virus characteristic | Virology | Genetic algorithm–multi-linear regression | QSARINS | Not mentioned |
| Zhao ZR et al. [ | Patient monitoring and follow-up | Respiratory medicine | Regression model | SPSS | COVID-19 PUI registry |
| Luo Y et al. [ | Disease characteristics | Infectious disease | Logistic Regression model | SPSS | Tongji hospital |
Ciucurel C Iconaru EI [ | Disease characteristics | Infectious disease | Cluster analysis, logistic Regression | SPSS | Online questionnaire |
| Lei MT et al. [ | Tracing transmission | Epidemiology | CART, Linear regression | SPSS | Macao Meteorological and Geophysical Bureau |
| Alzahrani SI et al. [ | Active case prediction | Epidemiology | Autoregressive Model, Time series | Python | Saudi Ministry of Health |
| Dong YL et al. [ | Patient monitoring and follow-up | Infectious disease | Logistic regression | SPSS | Wuhan union hospital |
| Roland LT et al. [ | Disease characteristics | Respiratory medicine | Logistic regression | SPSS | San Francisco (USF) institutional review board |
| Pinter G et al. [ | Active case prediction | Epidemiology | ANFIS, Time series | R | Statistical reports |
| Cheng FY et al. [ | Patient monitoring and follow-up | Respiratory medicine | Time series, Random forest | Not mentioned | Mount Sinai hospital |
| ZhouYW et al. [ | Early diagnosis | Infectious disease | Logistic regression, Nomograms | R | 47 locations in Sichuan province |
| Yan L et al. [ | Early diagnosis | Infectious disease | XGBOOST classifier, Decision tree | Not mentioned | Tongji hospital |
| Jiang X et al. [ | Early diagnosis | Infectious disease | Predictive analytics and decision tree | Not mentioned | China hospitals |
| Li S et al. [ | Early diagnosis | Psychology | Text mining | Text mind system and SPSS | Weibo posts |
| Ayyoubzadeh SM et al. [ | Infoveillance | Epidemiology scope | Linear regression and long short-term memory (LSTM) models | Python | Google data |
| Qiang X et al. [ | Active case prediction | Infectious disease | Random forest (RF) method | R | China national genomics data center |
| Moftakhar L et al. [ | Prevention and management | Epidemiology scop | Statistical Model Building: The autoregressive integrated moving average (ARIMA) model and time-series | R | Iranian Ministry of Health |
| Yongjian Z et al. [ | Prevention and management | Epidemiology | Generalized additive model (GAM) with a Gaussian distribution family | R | National meteorological information center |
| Chintalapudi N et al. [ | Outbreak prediction | Epidemiology | The auto-regressive integrated moving average (ARIMA) time-series analysis | R | Italian health ministry |
| Ghosal. S et al. [ | Patient monitoring and follow-up | Epidemiology | Multiple regression and linear regression and auto-regression technique | Python | WHO |
| Liu. Q et al. [ | Disease characteristics | Infectious disease | Logistic regression | SPSS | Union Hospital, Tongji Medical College, Huazhong University of Science and Technology |
| Khan MA et al. [ | Outbreak prediction | Epidemiology | Deep extreme learning machine (DELM): ANN | Matlab | WHO |
| Kargarfard F et al. [ | Virus characteristic | Virology | CBA (classification based on association rule mining), Ripper and Decision tree algorithms | Not mentioned | Influenza research database (IRD) |
| Kargarfard F et al. [ | Outbreak prediction | Virology | Integrated classification and association rule mining algorithm (CBA) | MUSCLE software | Influenza research database (IRD) |
| Kostkova P et al. [ | Outbreak prediction | Public health | Text mining | Not mentioned | |
| Kostoff RN [ | Infoveillance | Informatics | Text mining | Not mentioned | Medical literature |
| Szomszor M et al. [ | Infoveillance | Informatics | Text mining, linked resource analysis | Not mentioned | |
| Mudunuri M et al. [ | Outbreak prediction | Virology | Apriori algorithm | Not mentioned | Not mentioned |
| Neuraz.A et al. [ | Treatment | Infectious disease | Text mining, NLP | R | EHR |
| Li D et al. [ | Disease characteristics | Psychology | Text mining | Not mentioned | |
| Sarker A et al. [ | Disease characteristics | Infectious disease | Text mining | Not mentioned | |
| Wahbeh A et al. [ | Infoveillance | Infectious disease | Unsupervised and supervised machine learning techniques and text analysis | Others |
Fig. 3Word cloud of most applied words in reviewed articles
Fig. 4The distribution of papers based on countries
Frequency of main approaches
| Main objectives | Frequency | Percentage | References |
|---|---|---|---|
| Disease Characteristics | 11 | 22.00% | [ |
| Infoveillance | 8 | 16.00% | [ |
| Outbreak Prediction | 5 | 10.00% | [ |
| Patient monitoring and follow-up | 5 | 10.00% | [ |
| Active case prediction | 4 | 8.00% | [ |
| Early diagnosis | 4 | 8.00% | [ |
| Prevention and Management | 4 | 8.00% | [ |
| Risk factors | 3 | 6.00% | [ |
| Tracing transmission | 2 | 4.00% | [ |
| Treatment | 2 | 4.00% | [ |
| Virus characteristic | 2 | 4.00% | [ |
Frequency of data mining techniques in reviewed studies
| DM techniques | Frequency | Studies | |
|---|---|---|---|
| NLP techniques | 11 | 22.00% | [ |
| Logistic regression | 10 | 20.00% | [ |
| Time series | 7 | 14.00% | [ |
| Random forest | 7 | 14.00% | [ |
| Regression models | 7 | 12.00% | [ |
| Decision tree | 6 | 12.00% | [ |
| ANN | 5 | 10.00% | [ |
| Naive Bayes | 3 | 6.00% | [ |
| SVM | 2 | 4.00% | [ |
| Association rule mining | 2 | 4.00% | [ |
| Clustering | 2 | 4.00% | [ |
| Apriori algorithm | 1 | 2.00% | [ |
| Genetic algorithm | 1 | 2.00% | [ |
| Fuzzy algorithm | 1 | 2.00% | [ |
Fig. 5Distribution of employed DM techniques regarding main approaches
Fig. 5The frequency of main health disciplines in reviewed articles