| Literature DB >> 29882866 |
Md Saiful Islam1, Md Mahmudul Hasan2, Xiaoyi Wang3, Hayley D Germack4,5,6, Md Noor-E-Alam7.
Abstract
The growing healthcare industry is generating a large volume of useful data on patient demographics, treatment plans, payment, and insurance coverage—attracting the attention of clinicians and scientists alike. In recent years, a number of peer-reviewed articles have addressed different dimensions of data mining application in healthcare. However, the lack of a comprehensive and systematic narrative motivated us to construct a literature review on this topic. In this paper, we present a review of the literature on healthcare analytics using data mining and big data. Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, we conducted a database search between 2005 and 2016. Critical elements of the selected studies—healthcare sub-areas, data mining techniques, types of analytics, data, and data sources—were extracted to provide a systematic view of development in this field and possible future directions. We found that the existing literature mostly examines analytics in clinical and administrative decision-making. Use of human-generated data is predominant considering the wide adoption of Electronic Medical Record in clinical care. However, analytics based on website and social media data has been increasing in recent years. Lack of prescriptive analytics in practice and integration of domain expert knowledge in the decision-making process emphasizes the necessity of future research.Entities:
Keywords: big data; data analytics; data mining; healthcare; healthcare informatics; literature review
Year: 2018 PMID: 29882866 PMCID: PMC6023432 DOI: 10.3390/healthcare6020054
Source DB: PubMed Journal: Healthcare (Basel) ISSN: 2227-9032
Characteristics of existing review/conceptual studies on the related topics.
| Paper | Scope | Timeframe Considered | Number of Papers Reviewed |
|---|---|---|---|
| [ | Awareness effect in type 2 diabetes | 2001–2005 | 18 |
| [ | Fraud detection | N/A | N/A |
| [ | Data mining techniques and guidelines for clinical medicine | N/A | N/A |
| [ | Text mining, Ontologies | N/A | N/A |
| [ | Challenges and future direction | N/A | N/A |
| [ | Data mining algorithm, their performance in clinical medicine | 1998–2008 | 84 |
| [ | Clinical medicine | N/A | N/A |
| [ | Skin diseases | N/A | N/A |
| [ | Clinical medicine | N/A | 84 |
| [ | Algorithms, and guideline | N/A | N/A |
| [ | Data mining process and algorithms | N/A | N/A |
| [ | Algorithms for locally frequent disease in healthcare administration, clinical care and research, and training | N/A | N/A |
| [ | Electronic Medical Record (EMR) and Visual analytics | N/A | N/A |
| [ | Big data, Level of data usage | N/A | N/A |
| [ | MapReduce architectural framework based big data analytics | 2007–2014 | 32 |
| [ | Big data analytics and its opportunities | N/A | N/A |
| [ | Big data analytics in image processing, signal processing, and genomics | N/A | N/A |
| [ | Social media data mining to detect Adverse Drug Reaction, Natural language processing techniques (NLP) | 2004–2014 | 39 |
| [ | Text mining, Adverse Drug Reaction detection | N/A | N/A |
| [ | Big data analytics in critical care | N/A | N/A |
| [ | Methodology of big data analytics in healthcare | N/A | N/A |
|
|
|
|
|
N/A represents Not Reported.
Keywords for database search.
|
|
|
|
|
| 1 | Healthcare, Health care | Data analysis | |
| 2 | Healthcare, Health care, Cancer 2, Disease, Genomics | Data mining, Big data |
1 A logical operator used between the keywords during database search. 2 Cancer was listed independently because other dominant associations have the word “disease” associated with them (i.e., heart disease, skin disease, mental disease etc.).
Figure 1Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow chart [28] illustrating the literature search process.
Figure 2Three stages of effective literature review process, adapted from Levy and Ellis [31].
Figure 3Classification scheme of the literature.
Operational definition of the classes.
| Class | Operational Definition * |
|---|---|
| Analytics | Knowledge discovery by analyzing, interpreting, and communicating data |
| 3A. Types of Analytics | Data Interpretation and Communication method |
|
Descriptive | Exploration and discovery of information in the dataset [ |
|
Predictive | Prediction of upcoming events based on historical data [ |
|
Prescriptive | Utilization of scenarios to provide decision support [ |
| 3B. Types of Data | Type or nature of data used in the study |
|
Web/social media data (WS) | Data extracted from websites, blogs, social media like Facebook, Twitter, LinkedIn [ |
|
Sensor data (SD) | Readings from medical devices and sensors [ |
|
Biometric data (BM) | “Finger prints, genetics, handwriting, retinal scans, X-ray and other medical images, blood pressure, pulse and pulse-oximetry readings, and other similar types of data” [ |
|
Big transection data (BT) | Healthcare bill, insurance claims and transections [ |
|
Human generated data (HG) | Semi-structured and unstructured documents like prescription, Electronic Medical Record (EMR), notes and emails [ |
| 3C. Data mining techniques | Techniques applied to extract and communicate information from the dataset |
|
Regression | Relationship estimation between variables |
|
Association | Finding relation between variables |
|
Classification | Mapping to predefined class based on shared characteristics |
|
Clustering | Identification of groups and categories in data |
|
Anomaly detection | Detection of out-of-pattern events or incidents |
|
Data warehousing | A large storage of data to facilitate decision-making |
|
Sequential pattern mining | Identification of statistically significant patterns in a sequence of data |
| 3D. Application Area | Different areas in healthcare where data mining is applied for knowledge discovery and/or decision support |
|
Clinical decision support | Analytics applied to analyze, extract and communicate information about diseases, risk for clinical use |
|
Healthcare administration | Application of analytics to improve quality of care, reduce the cost of care and to improve overall system dynamics |
|
Privacy and fraud detection | Privacy: Protection of patient identity in the dataset; Fraud detection: Deceptive and unauthorized activity detection |
|
Mental health | Analytical decision support for psychiatric patients or patient with mental disorder |
|
Public health | Analysis of problems which affect a mass population, a region, or a country |
|
Pharmacovigilance | Post market monitoring of Adverse Drug Reaction (ADR) |
| 3E. Theoretical study | Discusses impact, challenges, and future of data mining and big data analytics in healthcare |
* Most of the definitions listed in this table are well established in literature and well know. Therefore, we did not use any specific reference. However, for some classes, specifically for types of analytics and data, varying definitions are available in the literature. We cited the sources of those definitions.
Figure 4Visualization of high-frequency keywords of the reviewed papers. The white circles symbolize the articles and the blue circles represent keywords. The keywords that occurred only once are eliminated as well as the corresponding articles. The size of the blue circles and the texts represent how often that keyword is found. The size of the white circles is proportional to the number of keywords used in that article. The links represents the connections between the keywords and the articles. For example, if a blue circle has three links (e.g., Decision-Making) that means that keyword was used in three articles. The diagram is created with the open source software Gephi [34].
Figure 5Distribution of publication by year (117 articles).
Top 10 journals on application of data mining in healthcare.
| Journal | Number of Articles | |
|---|---|---|
|
| Expert Systems with Applications | 7 |
|
| IEEE Transection on Information Technology in Biomedicine | 6 |
|
| Journal of Medical Internet Research | 5 |
|
| Journal of Medical Systems | 4 |
|
| Journal of the American Medical Informatics Association | 4 |
|
| Health Affairs | 4 |
|
| Journal of Biomedical Informatics | 4 |
|
| Healthcare Informatics Research | 3 |
|
| Journal of Digital Imaging | 3 |
|
| PLoS ONE | 3 |
Figure 6Types of analytics used in literature. (a) Percentage of analytics type; (b) Analytics type by application area.
Figure 7Percentage of data type used (a) and type of data used by application area (b).
Figure 8Utilization of data mining techniques, (a) by percentage and (b) by application area.
Figure 9Word cloud [39] with classification algorithms.
Figure 10Percentage of papers utilized healthcare analytics by application area (92 articles out of 117).
Topics and data sources of papers using clinical decision-making, organized by major disease category.
| Reference | Major Disease | Topic Investigated | Data Source |
|---|---|---|---|
| [ | Cardiovascular disease (CVD) | Risk factors associated with Coronary heart disease (CHD) | Department of Cardiology, at the Paphos General Hospital in Cyprus |
| [ | Diagnosis of CHD | Invasive Cardiology Department, University Hospital of Ioannina, Greece | |
| [ | Classification of uncertain and high dimensional heart disease data | UCI machine learning laboratory repository | |
| [ | Risk prediction of Cardiovascular adverse event | U.S. Midwestern healthcare system | |
| [ | Cardiovascular event risk prediction | HMO Research Network Virtual Data Warehouse | |
| [ | Mobile based cardiovascular abnormality detection | MIT BIH ECG database | |
| [ | Management of infants with hypoplastic left heart syndrome | The University of Iowa Hospital and Clinics | |
| [ | Diabetes | Identification of pattern in temporal data of diabetic patients | Synthetic and real world data (not specified) |
| [ | Exploring the examination history of Diabetic patients | National Health Center of Asti Providence, Italy | |
| [ | Important factors to identify type 2 diabetes control | The Ulster Hospital, UK | |
| [ | Comparison of classification accuracy of algorithms for diabetes | Iranian national non-communicable diseases risk factors surveillance | |
| [ | Type 2 diabetes risk prediction | Independence Blue Cross Insurance Company | |
| [ | Evaluation of HTCP algorithm in classifying type 2 diabetes patients from non-diabetic patient | Olmsted Medical Center and Mayo Clinic in Rochester, Minnesota, USA | |
| [ | Predicting and risk diagnosis of patients for being affected with diabetes. | 1991 National Survey of Diabetes data | |
| [ | Cancer | Survival prediction of prostate cancer patients | The Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute, USA |
| [ | Classification of breast cancer patients with novel algorithm | Wisconsin Breast cancer data set, UCI machine learning laboratory repository | |
| [ | Classification of uncertain and high dimensional breast cancer data | UCI machine learning laboratory repository | |
| [ | Visualization tool for cancer | Taiwan National Health Insurance Database | |
| [ | Lung cancer survival prediction with the help of a predictive outcome calculator | SEER Program of the National Cancer Institute, USA | |
| [ | Emergency Care | Classification of chest pain in emergency department | Hospital (unspecified) emergency department EMR |
| [ | Grouping of emergency patients based on treatment pattern | Melbourne’s teaching metropolitan hospital | |
| [ | Intensive care | Mortality rate of ICU patients | University of Kentucky Hospital |
| [ | Prediction of 30 day mortality of ICU patients | MIMIC-II database | |
| [ | Other applications | Treatment plan in respiratory infection disease | Various health center throughout Malaysia |
| [ | Pressure ulcer prediction | Cathy General Hospital (06–07), Taiwan | |
| [ | Pressure ulcer risk prediction | Military Nursing Outcomes Database (MilNOD), US | |
| [ | Association of medication, laboratory and problem | Brigham and Women’s Hospital, US | |
| [ | Chronic disease (asthma) attack prediction | Blue Angel 24 h Monitoring System, Tainan; Environmental Protection Administration Executive, Yuan; Central Weather Bureau Tainan, Taiwan | |
| [ | Personalized care, predicting future disease | No specified | |
| [ | Correlation between disease | Sct. Hans Hospital | |
| [ | Glaucoma prediction using Fundus image | Kasturba Medical college, Manipal, India | |
| [ | Reducing follow-up delay from image analysis | Department of Veterans Affairs health-care facilities | |
| [ | Disease risk prediction in imbalanced data | National Inpatient Sample (NIS) data, available at | |
| [ | Survivalist prediction of kidney disease patients | University of Iowa Hospital and Clinics | |
| [ | Comparison surveillance techniques for health care associated infection | University of Alabama at Birmingham Hospital | |
| [ | Parkinson disease prediction based on big data analytics | Big data archive by Parkinson’s Progression Markers Initiative (PPMI) | |
| [ | Hospitalization prediction of Hemodialysis patients | Hemodialysis center in Taiwan | |
| [ | 5 year Morbidity prediction | Northwestern Medical Faculty Foundation (NMFF) | |
| [ | Algorithm development for real-time disease diagnosis and prognosis | Not specified |
Problem analyzed and data sources in healthcare administration.
| Reference | Focusing Area | Problem Analyzed | Data Source |
|---|---|---|---|
| [ | Data warehousing and cloud computing | Developing a platform to analyze the causes of readmission | Emory Hospital, US |
| [ | Development of a clinical data warehouse and analytical tools for traditional Chinese medicine | Traditional Chinese Medicine hospitals/wards | |
| [ | Cloud and big data analytics based cyber-physical system for patient-centric healthcare applications and services | Not specified | |
| [ | Repository of radiology reports | Not specified | |
| [ | Creation of large data repository and knowledge discovery with unsupervised learning | University of Virginia University Health System | |
| [ | Development of a mobile application to gather, store and provide data for rural healthcare | Not specified | |
| [ | Healthcare cost, quality and resource utilization | Treatment error prevention to improve quality and reduce cost | National Taiwan University Hospital |
| [ | Healthcare cost prediction | US health insurance company | |
| [ | Healthcare resource utilization by lung cancer patients | Medicare beneficiaries for 1999, US | |
| [ | Length of stay prediction of Coronary Artery Disease (CAD) | Rajaei Cardiovascular Medical and Research Center, Tehran, Iran | |
| [ | Methodology for structured development of monitoring systems and a primary HC network resource allocation monitoring model | National Institute of Public Health; Health Care Institute, Celje; Slovenian Social Security Database, and Slovenian Medical Chamber | |
| [ | Assess the ability of regression tree boosting to risk-adjust health care cost predictions | Thomson Medstat’s Commercial Claims and Encounters database. | |
| [ | Evidence based recommendation in prescribing drugs | Dalhousie University Medical Faculty | |
| [ | Efficient pathology ordering system | Pathology company in Australia | |
| [ | Identifying people with or without insurance based on demographic and socio-economic factors | Behavioral Risk Factor Surveillance System 2004 Survey Data | |
| [ | Predicting care quality from patient experience | English National Health Service website | |
| [ | Patient management | Scheduling of patients | A south-east rural U.S. clinic |
| [ | Care plan recommendation system | A community hospital in the Mid-West U.S. | |
| [ | Examination of risk factors to predict persistent healthcare frequent attendance | Tampere Health Centre, Finland | |
| [ | Forecasting number of patient visit for administrative task | Health care center in Jaen, Spain | |
| [ | Critical factors related to fall | 1000 bed hospital in Taiwan | |
| [ | Verification of structured data, and codes in EMR of fall related injuries from unstructured data | Veterans Health Administration database, US | |
| [ | Other applications | Relation between medical school training and practice | Center for Medicare and Medicaid Service (CMS) |
| [ | Analysis of physician reviews from online platform | Good Doctor Online health community | |
| [ | Evaluation of Key Performance Indicator (KPIs) of hospital | Greek National Health Systems for the year of 2013 | |
| [ | Post market performance evaluation of medical devices | HCUPNet data (2002–2011) | |
| [ | Feasibility of measuring drug safety alert response from HC professional’s information seeking behavior | UpToDate, an online medical resource | |
| [ | Influencing factors of home healthcare service outcome | U.S. home and hospice care survey (2000) | |
| [ | Compilation of various data types for tracing, and analyzing temporal events and facilitating the use of NoSQL and cloud computing techniques | Taiwan’s National Health Insurance Research Database (NHIRD) |
List of papers in healthcare privacy and fraud detection.
| Reference | Problem Analyzed | Data Source |
|---|---|---|
| [ | Cloud based big data framework to ensure data security | Not specified |
| [ | Weakness in de-identification or anonymization of health data | MedHelp and Mp and Th1 (Medicare social networking sites) |
| [ | Automatic and systematic detection of fraud and abuse | Bureau of National Health Insurance (BNHI) in Taiwan. |
| [ | Novel algorithm to protect data privacy | Hong Kong Red Cross Blood Transfusion Service (BTS) |
List of data mining application in mental health with data sources.
| Reference | Problem Analyzed | Data Source |
|---|---|---|
| [ | Identification and intervention of developmental delay of children | Yunlin Developmental Delay Assessment Center |
| [ | Personalized treatment for anxiety disorder | Volunteer participants |
| [ | Abnormal behavior detection | Through experiment with human subject |
| [ | Mental health diagnosis and exploration of psychiatrist’s everyday practice | Queensland Schizophrenia Research center |
List of data mining application in public health with data sources.
| Reference | Problem Analyzed | Data Source |
|---|---|---|
| [ | Designing preventive healthcare programs | World Health Organization (WHO) |
| [ | Predicting the peak of health center visit due to influenza | Military Influenza case data provided by US Armed Forces Health Surveillance Center and Environmental data from US National Climate Data Center |
| [ | Contrast patient and customer loyalty, estimating Customer lifetime value, and identifying the targeted customer | Iranian Public Hospital data extracted from Hospital information system |
| [ | Understanding the information seeking behavior of public and professionals on infectious disease | National electronic Library of Infection and National Resource of Infection Control, Google Trends, and relevant media coverage (LexisNexis). |
| [ | Knowledge extraction for non-expert user through automation of data mining process | Brazilian health ministry |
| [ | Innovative use of data mining and visualization techniques for decision-making | Slovenian national Institute of Public Health |
| [ | Real-time emergency response method using big data and Internet of Things | UCI machine learning repository |
List of data mining application in pharmacovigilance with data sources.
| Reference | Problem Analyzed | Data Source |
|---|---|---|
| [ | Sentiment and network analysis based on social media data to find ADR signal | Cancer discussion forum websites |
| [ | ADR signal detection from multiple data sources | Food and Drug Administration (FDA) database and publicly available electronic health record (HER) in US |
| [ | ADR detection from EPR through temporal data analysis | Danish psychiatric hospital |
| [ | ADR (hypersensitivity) signal detection of six anticancer agents | FDA released AERS reports (2004–2009), US |
| [ | ADR caused by multiple drugs | FDA released AERS reports, US |
| [ | ADR due to Statins used in Cardiovascular disease (CVD) and muscular and renal failure treatment | FDA released AERS reports, US |
| [ | Creating a ranked list of Adverse Events (AEs) | EHR form European Union |
| [ | Detecting ADR signals of Rosuvastatins compared to other statins users | Health Insurance Review and Assessment Service claims database (Seoul, Korea) |
| [ | Unexpected and rare ADR detection technique | Medicare Benefits Scheme (MBS) and Queensland Linked Data Set (QLDS) |
Problem analyzed in theoretical studies.
| Sector Highlight | Reference | Problem Analyzed |
|---|---|---|
| Disease Control, Current situation of different diseases (infection, epidemic, cancer, mental health) | [ | Proposed an idea for dynamic clinical decision support |
| [ | Described current situation of infection control and predicted future challenges in this sector | |
| [ | Described activities taken by national organization to control disease and provide better health care | |
| [ | Reviewed efficient collection and aggregation of big data and proposed an intelligence based learning framework to help prevent cancer | |
| Data quality, database framework and uncertainty quantification | [ | Considered the management of uncertainty originating from data mining. |
| [ | Contemplated the quality of the data when collected from multimodal sources | |
| [ | Provided the structure of the database of CancerLinQ that comprised of 4 key steps | |
| [ | Described five major problems that need to be tackled in order to have an effective integration of big data analytics and VPH modeling in healthcare | |
| [ | Discuss the issues of data quality in the context of big data health care analytics | |
| [ | Discussed the necessity of proper management and confidentiality of healthcare data along with the benefit of big data analytics | |
| Healthcare policy making | [ | Addressed the challenges faced in implementing health care policies and considered the ethical and legal issues of performing predictive analysis on health care big data |
| [ | Focused on the US federal regulatory pathway by which CancerLinQ will have legislative authority to use the patients’ records and the approach of ASCO toward the organizing and supervising the information | |
| Patient Privacy | [ | Focused on ensuring patient privacy while collecting data, storing them and using them for analysis aimed to eliminate discrimination in the health care provided to patients. |
| [ | Spotted light on ensuring Privacy and security while collecting Personal Health care Information (PHI) | |
| [ | Highlighted those strategies appropriate for data mining from physicians’ prescriptions while maintaining the patient’s privacy | |
| Personalized health care | [ | Transforming big data into computational models to provide personalized health care |
| [ | Development of informed decision-making frameworks for person centered health care | |
| [ | Looked into the availability of big data and the role of biomedical informatics on the personalized medicine. Also, emphasized on the ethical concerns related to personalized medicines | |
| Others | [ | Finding the aspects of big data that are most relevant to Health care |
| [ | Selecting dynamic simulation modeling approach based on the availability and type of big data | |
| [ | Quantifying performance in the delivery of medical services | |
| [ | Identifying high risk patients to ensure better care, and explored the analytics procedure, algorithms and challenges to implement analytics | |
| [ | Addressed barriers for the exploitation of health data in Europe | |
| [ | Analyzed the opportunity and obstacles in applying predictive analytics based on big data in case of evaluating emergency care | |
| [ | Provided an overview of the uses of the Person-Event Data Environment to perform command surveillance and policy analysis for Army leadership | |
| [ | Development of big data analytics in healthcare and future challenges |