| Literature DB >> 35073885 |
Ricardo Sánchez-de-Madariaga1,2, Juan Martinez-Romo3,4, José Miguel Cantero Escribano5, Lourdes Araujo3,4.
Abstract
BACKGROUND: Association Rules are one of the main ways to represent structural patterns underlying raw data. They represent dependencies between sets of observations contained in the data. The associations established by these rules are very useful in the medical domain, for example in the predictive health field. Classic algorithms for association rule mining give rise to huge amounts of possible rules that should be filtered in order to select those most likely to be true. Most of the proposed techniques for these tasks are unsupervised. However, the accuracy provided by unsupervised systems is limited. Conversely, resorting to annotated data for training supervised systems is expensive and time-consuming. The purpose of this research is to design a new semi-supervised algorithm that performs like supervised algorithms but uses an affordable amount of training data.Entities:
Keywords: Association rules discovery; Machine learning; Medical records; Semi-supervised approach
Mesh:
Year: 2022 PMID: 35073885 PMCID: PMC8785547 DOI: 10.1186/s12911-022-01755-3
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Interaction between the dataset and the unsupervised and supervised modules
Fig. 2Evolution of the p value and the performance (F-measure) of the system depending on the threshold used
Fig. 3Flow diagram of incremental learning. Rounded rectangles show the beginning and the end of the iterations, rectangles are the rule sets, the broken line rectangle represents the seed set performance, ovals are processes, and the diamond represents a condition
F-measure using different thresholds for the p value and using a test set (20%) in order to evaluate
| Unsupervised module | |
|---|---|
| F-Measure | |
| 5E-2 | 0.384 |
| 1E-2 | 0.396 |
| 1E-3 | 0.492 |
| 1E-4 | 0.526 |
| 1E-5 | 0.557 |
| 1E-6 | 0.611 |
| 1E-7 | 0.615 |
| 1E-8 | 0.615 |
| 1E-9 | 0.623 |
| 1E-10 | 0.626 |
| 1E-11 | |
| 1E-12 | 0.623 |
| 1E-13 | 0.619 |
| 1E-14 | 0.611 |
Best results appear in boldface
Results for the Random Forest algorithm using different training set sizes and using the same test set (20%) for all the cases, based on their F-Measure, AUC-ROC, and AU-PRC
| Supervised module | |||
|---|---|---|---|
| Train/test % | F-measure | AUC-ROC | AU-PRC |
| 5/20 | 0.63 | 0.67 | 0.68 |
| 10/20 | 0.64 | 0.67 | 0.68 |
| 20/20 | 0.66 | 0.68 | 0.69 |
| 30/20 | 0.66 | 0.69 | 0.71 |
| 40/20 | 0.67 | 0.70 | 0.72 |
| 50/20 | 0.66 | 0.71 | 0.70 |
| 60/20 | 0.68 | 0.71 | 0.73 |
| 70/20 | 0.70 | 0.72 | |
| 80/20 | |||
Best results appear in boldface
Results using different classification algorithms on a split of 80–20% for training and test, based on their F-Measure, AUC-ROC, and AU-PRC
| Algorithm | F-Measure | AUC-ROC | AU-PRC |
|---|---|---|---|
| NaiveBayesMultinomial | 0.59 | 0.66 | 0.68 |
| SimpleLogistic | 0.57 | 0.62 | 0.63 |
| MultilayerPerceptron | 0.65 | 0.66 | 0.66 |
| Logistic | 0.62 | 0.67 | 0.69 |
| VotedPerceptron | 0.61 | 0.60 | 0.60 |
| SVM | 0.63 | 0.60 | 0.59 |
| IBK | 0.66 | 0.63 | 0.61 |
| AdaBoostM1 | 0.58 | 0.65 | 0.63 |
| ClassificationViaRegression | 0.62 | 0.67 | 0.69 |
| PART | 0.66 | 0.67 | 0.65 |
| Bagging+REPTree | 0.70 | 0.69 | 0.69 |
| RandomForest | |||
| J48 | 0.68 | 0.69 | 0.67 |
| EXTRA Tree | 0.69 | 0.66 | 0.63 |
Best results appear in boldface
Results of EXTRAE Algorithm on HUF corpus using different seed sizes, based on their F-Measure, AUC-ROC, and AU-PRC
| HUF corpus | |||||
|---|---|---|---|---|---|
| Seed size | Iterations | F-Measure | AUC-ROC | AU-PRC | |
| 5 | 3 | 4.79E-13 | 0.73 | ||
| 10 | 7 | 4.79E-13 | |||
| 15 | 8 | 3.67E-13 | 0.72 | 0.79 | 0.80 |
| 20 | 14 | 3.67E-13 | 0.73 | ||
| 25 | 8 | 5.34E-10 | 0.74 | 0.79 | 0.80 |
| 35 | 6 | 3.3E-6 | 0.73 | 0.78 | 0.80 |
| 50 | 13 | 3.3E-6 | 0.74 | 0.78 | 0.80 |
| 75 | 4 | 3.35E-9 | 0.72 | 0.79 | |
| 100 | 5 | 3.35E-9 | 0.69 | 0.79 | |
| 125 | 5 | 3.35E-9 | 0.74 | 0.79 | 0.80 |
| 150 | 5 | 3.67E-13 | 0.72 | 0.79 | 0.80 |
| 175 | 4 | 3.67E-13 | 0.72 | 0.79 | 0.80 |
| 200 | 6 | 3.67E-13 | 0.74 | ||
Iterations is the max number of iterations reached and p value is obtained automatically using the filter approach on the seed set. Best results appear in boldface
Evolution of learning from a seed set with 10 rules, based on their F-Measure, AUC-ROC, AU-PRC, and Accuracy
| Iteration | Coincident rules | F-Measure | AUC-ROC | AU-PRC | Accuracy (%) |
|---|---|---|---|---|---|
| 0 | – | 0.55 | 0.61 | 0.61 | 58 |
| 1 | 793 | 0.70 | 0.76 | 0.78 | 74 |
| 2 | 159 | 0.71 | 0.78 | 0.80 | 76 |
| 3 | 19 | 0.74 | 0.78 | 77 | |
| 4 | 3 | 0.73 | 77 | ||
| 5 | 3 | 0.71 | 0.79 | 0.80 | 77 |
| 6 | 2 | 0.72 | 0.79 | 78 | |
| 7 | 4 |
Coincident rules are those from the development set that have the same prediction and label based on the p value filter. Best results appear in boldface