| Literature DB >> 28138367 |
Ioannis Kavakiotis1, Olga Tsave2, Athanasios Salifoglou2, Nicos Maglaveras3, Ioannis Vlahavas4, Ioanna Chouvarda3.
Abstract
The remarkable advances in biotechnology and health sciences have led to a significant production of data, such as high throughput genetic data and clinical information, generated from large Electronic Health Records (EHRs). To this end, application of machine learning and data mining methods in biosciences is presently, more than ever before, vital and indispensable in efforts to transform intelligently all available information into valuable knowledge. Diabetes mellitus (DM) is defined as a group of metabolic disorders exerting significant pressure on human health worldwide. Extensive research in all aspects of diabetes (diagnosis, etiopathophysiology, therapy, etc.) has led to the generation of huge amounts of data. The aim of the present study is to conduct a systematic review of the applications of machine learning, data mining techniques and tools in the field of diabetes research with respect to a) Prediction and Diagnosis, b) Diabetic Complications, c) Genetic Background and Environment, and e) Health Care and Management with the first category appearing to be the most popular. A wide range of machine learning algorithms were employed. In general, 85% of those used were characterized by supervised learning approaches and 15% by unsupervised ones, and more specifically, association rules. Support vector machines (SVM) arise as the most successful and widely used algorithm. Concerning the type of data, clinical datasets were mainly used. The title applications in the selected articles project the usefulness of extracting valuable knowledge leading to new hypotheses targeting deeper understanding and further investigation in DM.Entities:
Keywords: Biomarker(s) identification; Data mining; Diabetes mellitus; Diabetic complications; Disease prediction models; Machine learning
Year: 2017 PMID: 28138367 PMCID: PMC5257026 DOI: 10.1016/j.csbj.2016.12.005
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1The basic steps of the KDD process.
Fig. 2Literature selection and classification process.
Fig. 3Articles per year in the collection employed.
Fig. 4Distribution of articles in scientific journals.
Comparison of different ML algorithms.
| Publication | Type of DM | Type of data | No. of subjects | Compared algorithms | Validation method | Best accuracy |
|---|---|---|---|---|---|---|
| Cai et al. | T2D | Gut microbiota | Dataset A: 344 | Logistic regression (LR), linear discriminant analysis (LDA), naïve Bayes (NB) and support vector machine (SVM) | 10-fold cross-validation | SVM on several different experiments |
| Malik et al. | Both types (hyperglycemia) | Electrochemical measurements of saliva | 175 | Logistic regression (LR), support vector machine (SVM) and artificial neural network (ANN) | 3-fold cross-validation | SVM ACC = 84.09 |
| Farran et al. | T2D | Demographic, anthropometric, vital | 10,632 | Logistic regression (LR), k-nearest neighbors (k-NN), multifactor dimensionality reduction (MDR) support vector machines (SVM) | 5-fold cross-validation | SVM ACC = 81.3 |
| Mani et al. | T2D | Demographic, clinical lab values | 2280 distributed in three datasets | Gaussian Naïve Bayes (NB), Logistic Regression (LR), K-nearest neighbor (k-NN, CART, Random Forests (RF), Support Vector Machine (SVM) | 5-fold cross-validation | RF AUC = 0.803/0.807/0.877 |
| Tapak et al. | Nonspecific | Demographic, anthropometric, diagnostic and clinical laboratory measurements | 6500 | Artificial neural networks (ANN), support vector machines (SVM), fuzzy c-mean, Random Forests (RF) | 10-fold cross-validation | SVM ACC = 0.986 AUC = 0.979 |