| Literature DB >> 35408133 |
Susel Góngora Alonso1, Gonçalo Marques2, Deevyankar Agarwal3, Isabel De la Torre Díez1, Manuel Franco-Martín4.
Abstract
New computational methods have emerged through science and technology to support the diagnosis of mental health disorders. Predictive models developed from machine learning algorithms can identify disorders such as schizophrenia and support clinical decision making. This research aims to compare the performance of machine learning algorithms: Decision Tree, AdaBoost, Random Forest, Naïve Bayes, Support Vector Machine, and k-Nearest Neighbor in the prediction of hospitalized patients with schizophrenia. The data set used in the study contains a total of 11,884 electronic admission records corresponding to 6933 patients with various mental health disorders; these records belong to the acute units of 11 public hospitals in a region of Spain. Of the total, 5968 records correspond to patients diagnosed with schizophrenia (3002 patients) and 5916 records correspond to patients with other mental health disorders (3931 patients). The results recommend Random Forest with the best accuracy of 72.7%. Furthermore, this algorithm presents 79.6%, 72.8%, 72.7%, and 72.7% for AUC, precision, F1-Score, and recall, respectively. The results obtained suggest that the use of machine learning algorithms can classify hospitalized patients with schizophrenia in this population and help in the hospital management of this type of disorder, to reduce the costs associated with hospitalization.Entities:
Keywords: hospitalization; machine learning algorithms; predictive models; random forest; schizophrenia
Mesh:
Year: 2022 PMID: 35408133 PMCID: PMC9003328 DOI: 10.3390/s22072517
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Flow diagram of the study. The diagram shows the first phase of pre-processing the database. Subsequently, the machine learning algorithms described are applied to the pre-processed dataset. In the final phase, the performance metrics obtained from the algorithms are compared.
Metrics and ranking of the features.
| Variables | Information Gain | Gain Ratio | Gini | X2 | ReliefF |
|---|---|---|---|---|---|
| Diag_Sec02_Code | 0.047 | 0.023 | 0.032 | 128.578 | 0.012 |
| Diag_Sec03_Code | 0.014 | 0.007 | 0.010 | 0.044 | 0.006 |
| Diag_Sec04_Code | 0.011 | 0.006 | 0.008 | 91.753 | 0.004 |
| Diag_Sec05_Code | 0.016 | 0.010 | 0.011 | 269.514 | 0.003 |
| Diag_Sec06_Code | 0.014 | 0.012 | 0.010 | 331.083 | 0.019 |
| Stays_Days | 0.009 | 0.005 | 0.007 | 128.946 | −0.0003 |
| Age | 0.025 | 0.012 | 0.017 | 310.541 | 0.012 |
| Gender | 0.069 | 0.070 | 0.047 | 623.212 | - |
| Admission_Type | 0.0004 | 0.002 | 0.0003 | 0.238 | −0.002 |
| Proc_Ppal_Code | 0.005 | 0.003 | 0.004 | 49.338 | 0.015 |
| Proc_Sec02_Code | 0.005 | 0.003 | 0.003 | 100.140 | −0.014 |
| Proc_Sec03_Code | 0.004 | 0.004 | 0.003 | 96.206 | −0.005 |
Parameters of the machine learning algorithms used in the study.
| Algorithms | Parameters |
|---|---|
| Random Forest | Number of trees = 10 |
| Maximum number of considered features: unlimited | |
| Maximum tree depth: unlimited | |
| Stop splitting nodes with maximum instances = 5 | |
| AdaBoost | Base estimator: tree |
| Number of estimators = 50 | |
| Decision Tree | Minimum number of instances in leaves = 2 |
| Minimum number of instances in internal nodes = 5 | |
| Maximum depth = 100 | |
| kNN | Number of neighbours = 5 |
| Distance metric: Euclidean | |
| Weight: Uniform | |
| Naïve Bayes | fL = 0 |
| usekernel: False | |
| adjust = 0 | |
| SVM | C = 1.0 |
| sigma = 0.5 | |
| Numerical tolerance = 0.001 | |
| Iteration limit = 100 |
Figure 2Diagram of the Random Forest algorithm. The algorithm generates multiple trees; in the figure, each set represents a tree. Each tree classifies a class, resulting in the class with the highest number of votes.
Analysis of clinical data.
| Variables | Total (N = 11,884 Admission Records) | |
|---|---|---|
| Gender (%) | ||
| Male | 71.0 | 40.6 |
| Female | 29.0 | 59.4 |
| Age, mean (years) | 43 | 49 |
| <18 years | 10 | 36 |
| 18–30 years | 1048 | 737 |
| 31–45 years | 2493 | 1756 |
| 46–60 years | 1624 | 1844 |
| >60 years | 793 | 1543 |
| Days of stay, mean (days) | 17 | 14 |
| Main diagnoses of the predictive variable Diag_Sec02_Code for records with schizophrenia | ||
| Non-compliance with medical treatment | 473 | 130 |
| Tobacco abuse disorders | 353 | 111 |
| Family record of psychiatric disease | 229 | 118 |
| Abuse of continuous cannabis | 200 | 70 |
| Alcohol abuse | 159 | 86 |
| Main diagnoses of the predictive variable Diag_Sec02_Code for records without schizophrenia | ||
| Dysthymic disorder | 19 | 687 |
| Personality disorder | 75 | 265 |
| Neom arterial hypertension | 140 | 177 |
| Personality histrionic disorder | 4 | 167 |
| Psychosis | 40 | 162 |
Performance metrics applying 10-fold stratified cross-validation.
| Algorithms | AUC | Accuracy | Precision | F1-Score | Recall |
|---|---|---|---|---|---|
| Random Forest | 0.796 | 0.727 | 0.728 | 0.727 | 0.727 |
| AdaBoost | 0.765 | 0.708 | 0.708 | 0.708 | 0.708 |
| Decision Tree | 0.682 | 0.682 | 0.682 | 0.681 | 0.681 |
| k-NN | 0.729 | 0.677 | 0.676 | 0.676 | 0.676 |
| Naïve Bayes | 0.729 | 0.670 | 0.671 | 0.669 | 0.670 |
| SVM | 0.641 | 0.657 | 0.657 | 0.657 | 0.657 |
Figure 3ROC curve for target = 0 with FP = 500, FN = 500, and target probability = 50.0%. The graph shows the ROC curves created by the false positive and false negative values with 10-folds stratified cross-validation. Each ROC curve is represented by a different color (See legend). The Random Forest algorithm shows the best value of AUC= 0.796 (See Supplementary Materials Table S1) for class 0 (records of non-schizophrenia).
Figure 4ROC curve for target = 1 with FP = 500, FN = 500, and target probability = 50.0%. The graph shows the ROC curves created by the false positive and false negative values with 10-folds stratified cross-validation. Each ROC curve is represented by a different color (See legend). The Random Forest algorithm shows the best value of AUC= 0.796 (See Supplementary Materials Table S2) for class 1 (records of schizophrenia).
Comparison of results obtained with other studies.
| Reference | Method | Validation | Dataset | AUC | Accuracy (%) |
|---|---|---|---|---|---|
| [ | Random Forest | Cross-Validation k = 10 | N = 345 patients | 0.67 | 66.00 |
| [ | Random Forest | Cross-Validation k = 10 | N = 86 patients | - | 90.69 |
| [ | SVM | Leave-One-Out Cross-Validation (LOOCV) | N = 68 patients | - | 78.24 |
| [ | Random Forest | Cross-Validation k = 10 | N = 72 patients | 0.68 | 68.60 |
| [ | Random Forest | Cross-Validation | N = 466 patients | - | 85.10 |
| Our study | Random Forest | Cross-Validation k = 10 | N = 6933 patients | 0.79 | 72.74 |