| Literature DB >> 31075973 |
Yamid Fabián Hernández-Julio1, Martha Janeth Prieto-Guevara2, Wilson Nieto-Bernal3, Inés Meriño-Fuentes4,5, Alexander Guerrero-Avendaño6.
Abstract
Clinical decision support systems (CDSS) have been designed, implemented, and validated to help clinicians and practitioners for decision-making about diagnosing some diseases. Within the CDSSs, we can find Fuzzy inference systems. For the reasons above, the objective of this study was to design, to implement, and to validate a methodology for developing data-driven Mamdani-type fuzzy clinical decision support systems using clusters and pivot tables. For validating the proposed methodology, we applied our algorithms on five public datasets including Wisconsin, Coimbra breast cancer, wart treatment (Immunotherapy and cryotherapy), and caesarian section, and compared them with other related works (Literature). The results show that the Kappa Statistics and accuracies were close to 1.0% and 100%, respectively for each output variable, which shows better accuracy than some literature results. The proposed framework could be considered as a deep learning technique because it is composed of various processing layers to learn representations of data with multiple levels of abstraction.Entities:
Keywords: clusters; deep learning; fuzzy sets; knowledge base; rule base
Year: 2019 PMID: 31075973 PMCID: PMC6628283 DOI: 10.3390/diagnostics9020052
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Features in the original Coimbra Breast Cancer dataset.
| Feature | Attribute | Patients | Controls | |
|---|---|---|---|---|
| 1 | Age (years) | 53.0 (23.0) | 65 (33.2) | 0.479 |
| 2 | BMI (kg/m2) | 27 (4.6) | 28.3 (5.4) | 0.202 |
| 3 | Glucose (mg/dL) | 105.6 (26.6) | 88.2 (10.2) | <0.001 |
| 4 | Insulin (µU/mL) | 12.5 (12.3) | 6.9 (4.9) | 0.027 |
| 5 | HOMA | 3.6 (4.6) | 1.6 (1.2) | 0.003 |
| 6 | Leptin (ng/mL) | 26.6 (19.2) | 26.6 (19.3) | 0.949 |
| 7 | Adiponectin (µg/mL) | 10.1 (6.2) | 10.3 (7.6) | 0.767 |
| 8 | Resistin (ng/mL) | 17.3 (12.6) | 11.6 (11.4) | 0.002 |
| 9 | MCP-1 (pg/dL) | 563 (384) | 499.7 (292.2) | 0.504 |
Features in the original immunotherapy and cryotherapy datasets.
| Immunotherapy | |||
|---|---|---|---|
| Feature | Attribute | Type | Values |
| 1 | Sex | Numeric | 41 male and 49 females |
| 2 | Age | Numeric | 15–56 |
| 3 | Time elapsed before treatment (months) | Numeric | 0–12 |
| 4 | Number of warts | Numeric | 1–19 |
| 5 | Type of warts (count) | Numeric | 47 common, 22 plantar, and 21 both |
| 6 | Surface area (mm2) | Numeric | 6–900 |
| 7 | Induration diameter of initial test (mm) | Numeric | 5–70 |
| 8 | Response to treatment | Nominal | Yes or no |
|
| |||
| 1 | Sex | Numeric | 47 male and 43 females |
| 2 | Age | Numeric | 15–67 |
| 3 | Time elapsed before treatment (months) | Numeric | 0–12 |
| 4 | Number of warts | Numeric | 1–12 |
| 5 | Type of warts (count) | Numeric | 54 common, 09 plantar, and 27 both |
| 6 | Surface area (mm2) | Numeric | 4–750 |
| 7 | Response to treatment | Nominal | Yes or no |
Figure 1Mamdani-type fuzzy model components. Source: Reference [38].
First design steps for the framework for the development of data-driven Mamdani-type CDSS.
| DDMTFCDSS Activity Steps | Activity Description |
|---|---|
|
| |
| 1. To identify the dataset. | This stage is related to identifying the source of the collected data to make the fuzzy inference system. These data usually belong to experiments that seek to observe the behavior of some dependent variables through the interaction of independent variables. Generally, the first variables are known as output variables and the second are known as input variables. This section describes the context and the adopted methodology to obtain the database that will serve as an input to work with the other framework components. |
| 2. Data Preparation (Crisp inputs). | This step, according to the methodology proposed by Palit and Popovic [ |
| 3. Reviewing existing models. | In this stage, an academic and scientific search of the different works related to the problem is carried out. For this, different indexed databases such as Scopus, Science Direct, Web of Science, Scielo, Google Scholar, ACM, etc. are used. |
Previous steps to the iterative design for the framework for the development of Data-driven Mamdani-type CDSS.
| DDMTFCDSS Activity Steps | Activity Description |
|---|---|
| 4. Evaluating the optimal number of clusters. | This stage pretends to find the optimal number of clusters for each one of the input and output variables. It determines a reference pivot number for adding or removing groups depending on whether you want to reduce or increase the number of fuzzy sets for each variable. This cluster number will indicate the same amount of fuzzy sets for the selected variable. If the variables are input, then you can determine the number of rules that the fuzzy system will have through the interaction between them. If the variables are output, then the number of the cluster will be the number of the fuzzy set that will have that variable, but it will not affect the rule set number. For the reasons above, it is important to have this optimal value of clusters, because it will give us an initial idea of how many rules and how many fuzzy sets we must have at maximum for each variable.To determine the optimal number of clusters, for each variable (input and output(s)), we applied pivot tables to establish the maximum number of clusters that could have each one. The pivot table makes a unique table (without repetitions of values). If the optimal number of clusters mentioned above is greater than 20, it is recommended to calculate the square root of this value and take that value as the optimal number of clusters. The recommended minimum number of clusters is two (2)— |
| 5. Setting a number of clusters (minimum and maximum) according to the previous evaluation. | Determining the optimal number of clusters, we can have an idea of how many fuzzy sets and the number of rules we can have for the construction of the fuzzy system. Based on that optimal number of clusters, we can establish a range (minimum and maximum) in which the result of the previous section is within it. For example: If the result of the previous step threw five (5) clusters for the first input variable, we could set the range between minimum two (2) and maximum five (5). This is done to see the possibilities of reducing the number of clusters that can directly affect the number of rules or, on the contrary, see if the performance with a higher number of rules can be improved. For this case, the principal idea is to optimize the effectiveness of the established fuzzy system with a reduced rules number— |
Iterative design steps for the framework for the development of Data-driven Mamdani-type CDSS.
| DDMTCDSS Activity Steps | Activity Description |
|---|---|
|
| |
| 6. Random permutations | This stage allows us to make random permutations into the selected dataset(s). This is made to avoid the same input and output(s) order and could be chosen different classes or attributes at the moment for choosing the two or three subsets for training, validation, and test through random sampling or cross-validation processes. |
|
| |
| 7. Cluster analysis: fuzzification process. | At this stage, the main idea is to classify the data values of the respective input and output variables, according to the criteria set out in the previous section. For the individual analyses, it is recommended to use at least one of the two types of algorithms recognized in the academic and scientific field. Non-hierarchical (K-means) and hierarchical clusters (nearest neighbor and the Ward method) can be used to evaluate the performance of each one of the algorithms through the next phase of the methodology— |
| 8. Sampling: Cross-validation datasets. | This stage seeks to randomly divide the dataset into two or three sub-sets. The framework proposes two kinds of data partition. The first one consists of random sampling. The user can select two or three subsets (training, validation, and testing). Generally, the percentage for each one, could be: 70:30:0; 70:20:10; 70:15:15; 70:10:20, etc. The user chooses this percentage. See |
|
| |
| 9. Pivot tables | Knowing the number of the optimal clusters to each variable, we can calculate the rules number that the fuzzy system can have through the interaction between the input variables—Rulebase. According to Hernández-Julio, Hernández, Guzmán, Nieto-Bernal, Díaz, and Ferraz [ |
Pivot tables substages and decision support systems’ implementation steps.
| DDMTFCDSS Activity Steps | Activity Description |
|---|---|
| 9.1 Combining different input variable cluster datasets. | The objective in this stage is to make combinations between input variables clusters datasets generated in the previous step, to find the best performance. For doing this, we can use the command nchoosek from any matrices laboratory software— |
| 9.2 Establishing the fuzzy rules. | This section is based on the previous one. If the operations carried out with the use of the pivot tables find one or several combinations that guarantee good results (not clusters overlapping or minimum differences between the values), it proceeds to make the rules base of the fuzzy system. This is done by the recommendations of the previous section (using the command Unique). The rules will be easily detectable. To see an example refers to the case studies— |
|
| |
| 10. Elaborating the Decision Support System based on a fuzzy set theory (Inference engine). | This stage refers to the implementation in specific software for the elaboration of fuzzy model systems such as Matlab® [ |
| 11. Evaluating the fuzzy inference system performance (defuzzification and Crisp Outputs). | This stage aims to measure the designed and implemented system performance until this moment. For doing this, we must use the evaluation functions of each specific program and realize the simulations with the observed data and with the mean values of these or the test data subset ( |
|
| |
| 12. Communication. | This stage refers to the paper or documentation preparation. In this case, the modeler may show the results through a user manual or an academic and scientific journal article. If the main aim is to publish a journal article, the target population must be researchers or practitioners within the interest domain. |
Optimal clusters number for breast cancer datasets.
| Inputs | Output | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| CT | UCSi | UCSh | MA | SECS | BN * | BC | NN | MI | ||
|
| ||||||||||
| Rows Number | 10 | 10 | 10 | 10 | 10 | 11 | 10 | 10 | 10 | 2 |
| Rounded Square root | 10 | 10 | 10 | 10 | 10 | 11 | 10 | 10 | 10 | - |
|
| ||||||||||
|
|
|
|
|
|
|
|
|
| ||
| Rows Number | 51 | 110 | 50 | 113 | 116 | 116 | 115 | 116 | 113 | 2 |
| Rounded Square root | 7 | 10 | 7 | 11 | 11 | 11 | 11 | 11 | 11 | - |
CT = Clump Thickness. UCSi = Uniformity of Cell Size. UCSh = Uniformity of Cell Shape. MA = Marginal Adhesion. SECS = Single Epithelial Cell Size. BN = Bare Nuclei. BC = Bland Chromatin. NN = Normal Nucleoli. MI = Mitoses. BMI = Body Mass Index. HOMA= HOMA-homeostasis model assessment for insulin resistance. MCP-1 = Monocyte Chemoattractant Protein-1. * indicates that there are missing values and were replaced by zero.
Example of some values obtained from a unique table for the caesarean dataset.
| Inp_Var 1 | Inp_Var 2 | Inp_Var 3 | Inp_Var 4 | Inp_Var 5 | Out_Var |
|---|---|---|---|---|---|
| 1 | 2 | 3 | 2 | 1 | 2 |
| 2 | 2 | 2 | 3 | 1 | 2 |
| 3 | 4 | 3 | 1 | 2 | 2 |
| 4 | 4 | 1 | 3 | 1 | 1 |
| 4 | 4 | 3 | 2 | 2 | 1 |
| 5 | 4 | 1 | 2 | 2 | 1 |
| 6 | 3 | 2 | 2 | 2 | 2 |
Inp_Var = Input variable. Out_Var = Output variable.
Performance metrics obtained with our proposed framework and other classifiers obtained from the literature for WBCD.
| References | DDFCDSS—This Work | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Main Aspects | [ | [ | [ | [ | [ | [ | [ | [ | Five Variables (RS) | Five Variables (CV) | |
| Num of Variables | 3 | 9 | - | 6 | 9 | 9 | 7 | 3 | 5 | 5 | |
| Num of Rules or Hidden neurons/technique | 39/FIS | DBN/4:2 | DNN | DNN | SMO | FCLF and CNN | FRNN | WT and IT2FLS | 192/DDFCDSS | 233/DDFCDSS | |
| Performance | Accuracy (%): | 98.57% | 99.68% | 98.62% | 96.2%:96.6% | 72.70% | 98.71% | 99.72% | 97.88% |
|
|
| Sensitivity: | 0.9793 | 1.000 | - | - | - | 0.976 | 1.000 | 0.9850 |
|
| |
| Specificity: | 0.9891 | 0.9947 | - | - | - | 0.9943 | 0.9947 | 0.9650 |
|
| |
| F-Measure: | 0.9793 | - | - | - | 0.71 | - | 0.9970 | - |
|
| |
| Area under curve: | 0.9901 | - | - | - | 0.63 | 0.9816 | 1.000 | 0.9750 |
|
| |
| Kappa statistics: | 0.9683 | - | - | - | - | - | 0.9943 | - |
|
| |
FIS: Fuzzy Inference System. DBN: Deep Belief Network. DNN: Deep Neural Network. SMO: Sequential Minimal Optimization. CNN: Convolutional Neural Network. FCLF: Fully Connected Layer First. FRNN: Fuzzy-Rough Nearest Neighbor. WT: Wavelet transformation. IT2FLS: interval type-2 fuzzy logic system. DDFCDSS: Data-driven Fuzzy Decision Support System. RS: Random Sampling. CV: Cross Validation. - which are not mentioned in the literature. Bold values indicate the best performance with fewer input variables.
Performance metrics obtained with our proposed framework and other classifiers obtained from the literature for CBCD.
| Main Aspects | Reference [ | Reference [ | DDFCDSS—This Work | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| CV | RS | |||||||||
| Number of Variables | V1:V2 | V1:V3 | V1:4 | V1-V5 | V1:V6 | V1:V9 | V1:V8 | V2,V3,V6,V8,V9 | V1:V9 | |
| Number of Rules or Hidden neurons/technique | SVM | AdaBoostM1 and MAD | 97 | 81 | ||||||
| Performance | Accuracy (%): | - | - | - | - | - | - | 91.37 |
|
|
| Sensitivity: | 0.81:0.86 | 0.87:0.92 | 0.82:0.88 | 0.84:0.9 | 0.81:0.86 | 0.75:0.81 | - |
|
| |
| Specificity: | 0.7:0.76 | 0.78:0.83 | 0.84:0.9 | 0.81:0.87 | 0.8:0.86 | 0.78:0.84 | - |
|
| |
| F-Measure: | - | - | - | - | - | - | 0.914 |
|
| |
| Area under curve: | 0.76:0.81 | 0.82:0.86 | 0.87:0.91 | 0.86:0.9 | 0.83:0.88 | 0.81:0.85 | 0.938 |
|
| |
| Kappa statistics: | - | - | - | - | - | - | 82.76% |
|
| |
| Precision: | - | - | - | - | - | - | 0.919 |
|
| |
| Recall: | - | - | - | - | - | - | 0.914 |
|
| |
CBCD: Coimbra Breast Cancer Dataset. V: Variable. SVM: Support Vector Machine. MAD: Mean Absolute Deviation. DDFCDSS: Data-driven fuzzy clinical decision support systems, - which are not mentioned in the literature. Bold values indicate the best performance.
Performance metrics obtained with our proposed framework and other classifiers obtained from the literature for Wart treatment (Cryotherapy) dataset.
| Main Aspects | References | This Work | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| [ | [ | [ | [ | [ | [ | [ | RS | CV | |||
| Number of Variables | 4 | 3 | 6 * | 6 | 3/Merged dataset | 3 | 2 | 6 | 4 | 4 | |
| Number of Rules or Hidden neurons/technique | ANFIS | NB, C4.5, DT, LR, K-NN | SVM | RF, BGSA + RF | C4.5 + RFFW | DT | J48, J48 + GA | 46/DDFCDSS | 52/DDFCDS | ||
| Performance | Accuracy (%): | 80% | 95.40% | 85.46% | 94.81% | 87.22% | 93.33% | 94.4% | 93.3% |
|
|
| Sensitivity: | 0,820 | 0.976 | 0.474 | - | 0.805 | 0.885 | - | 0.989 |
|
| |
| Specificity: | 0,770 | - | 0.958 | - | 0.908 | 0.980 | - | 0.130 |
|
| |
| F-Measure: | - | 0.933 | - | - | 0.799 | 0.919 | - | 0.989 |
|
| |
| Area under curve: | - | - | - | - | 0.830 | 0.617 | - | 0.988 |
|
| |
| Kappa statistics: | - | - | - | - | - | - | - | 0.977 |
|
| |
| Precision: | - | 0.937 | - | - | - | - | - | 0.989 |
|
| |
| Recall: | - | - | - | - | - | - | - | 0.989 |
|
| |
ANFIS: Adaptive Neuro-Fuzzy Inference System, which is not mentioned in the literature. NB: Naive Bayes. DT: Decision Tree. LR: Logistic Regression. K-NN: K-Nearest Neighbor. SVM: Support Vector Machine. GA: Genetic Algorithm. C4.5: decision tree algorithm. * Both datasets. RF: Random Forest. BGSA: Binary Gravitational Search Algorithm. RFFW: Random Forest Feature Weighting. DDFCDSS: Data-Driven Fuzzy Decision Support System. RS: Random Sampling. CV: Cross Validation. - which are not mentioned in the literature. Bold values indicate the best performance.
Performance metrics obtained with our proposed framework and other classifiers obtained from the literature for wart treatment (Immunotherapy) dataset.
| Main Aspects | References | This Work | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| [ | [ | [ | [ | [ | [ | [ | RS | CV | |||
| Number of Input Variables | 3 | 3 | 6 | 6 | 3/Merged dataset | 3 | 2 | 7 | 3 | 5 | |
| Number of Rules or Hidden Neurons/Technique | ANFIS | NB, C4.5, DT, LR, K-NN | SVM | RF, BGSA + RF | C4.5 + RFFW | DT | J48, J48 + GA | 57/DDFCDSS | 78/DDFCDSS | ||
| Performance | Accuracy (%): | 83.33% | 84% | 85.46% | 88.14% | 87.22% | 84.44% | 90% | 96.66% |
|
|
| Sensitivity: | 0.870 | 0.832 | 0.474 | - | 0.805 | 0.55 | - | 0.967 |
|
| |
| Specificity: | 0.710 | - | 0.958 | - | 0.908 | 0.9143 | - | 0.086 |
|
| |
| F-Measure: | - | 0.851 | - | 0.799 | 0.611 | - | 0.966 |
|
| ||
| Area under curve: | - | - | - | 0.830 | 0.707 | - | 0.972 |
|
| ||
| Kappa statistics: | - | - | - | - | - | - | 0.898 |
|
| ||
| Precision: | - | 0.901 | - | - | - | - | 0.966 |
|
| ||
| Recall: | - | - | - | - | - | - | 0.967 |
|
| ||
ANFIS = Adaptive Neuro-Fuzzy Inference System, which is not mentioned in the literature. NB: Naive Bayes. DT: Decision Tree. LR: Logistic Regression. K-NN: K-Nearest Neighbor. SVM: Support Vector Machine. C4.5: decision tree algorithm. RFFW: Random Forest Feature Weighting. GA: Genetic Algorithm. DDFCDSS: Data-driven Fuzzy Decision Support System. RS: Random Sampling. CV: Cross Validation. - which are not mentioned in the literature. Bold values indicate the best performance.
Performance metrics obtained with our proposed framework and other classifiers obtained from the literature for a caesarean section dataset.
| Main Aspects | References | This Work | |||
|---|---|---|---|---|---|
| [ | [ | RS | CV | ||
| Number of Input Variables | 5 | 5 | 5 | 5 | |
| Number of Rules or Hidden neurons/technique | C4.5 DT/31 and 21 leaves nodes | k-nearest neighbors and Random forest | 74/DDFCDSS | 67/DDFCDSS | |
| Performance | Accuracy (%): | 86.25% | 95%:95% | 95% | 93.4% |
| Sensitivity: | 0.8630 | 0.95:0.95 | 1.0000 | 0.934 | |
| Specificity: | 0.1090 | 0.037:0.052 | 0.8947 | 0.934 | |
| F-Measure: | 0.9460 | - | 0.9545 | 0.943 | |
| Area under the curve: | - | 0.995:0.994 | 0.9565 | 0.93 | |
| Kappa statistics: | 0.7281 | 0.8992:0.8977 | 0.8992 | 0.864 | |
| Precision: | 0.8860 | 0.955:0.950 | 0.9130 | 0.952 | |
| Recall: | 0.8630 | - | 1.0000 | 0.934 | |
DT = Decision tree. DDMTFCDSS = Data-driven Mamdani-Type Fuzzy Clinical Decision Support System. RS: Random Sampling. CV: Cross Validation. - which are not mentioned in the literature.