| Literature DB >> 36075941 |
Serkan Ucer1, Tansel Ozyer2, Reda Alhajj3,4,5.
Abstract
We propose a new type of supervised visual machine learning classifier, GSNAc, based on graph theory and social network analysis techniques. In a previous study, we employed social network analysis techniques and introduced a novel classification model (called Social Network Analysis-based Classifier-SNAc) which efficiently works with time-series numerical datasets. In this study, we have extended SNAc to work with any type of tabular data by showing its classification efficiency on a broader collection of datasets that may contain numerical and categorical features. This version of GSNAc simply works by transforming traditional tabular data into a network where samples of the tabular dataset are represented as nodes and similarities between the samples are reflected as edges connecting the corresponding nodes. The raw network graph is further simplified and enriched by its edge space to extract a visualizable 'graph classifier model-GCM'. The concept of the GSNAc classification model relies on the study of node similarities over network graphs. In the prediction step, the GSNAc model maps test nodes into GCM, and evaluates their average similarity to classes by employing vectorial and topological metrics. The novel side of this research lies in transforming multidimensional data into a 2D visualizable domain. This is realized by converting a conventional dataset into a network of 'samples' and predicting classes after a careful and detailed network analysis. We exhibit the classification performance of GSNAc as an effective classifier by comparing it with several well-established machine learning classifiers using some popular benchmark datasets. GSNAc has demonstrated superior or comparable performance compared to other classifiers. Additionally, it introduces a visually comprehensible process for the benefit of end-users. As a result, the spin-off contribution of GSNAc lies in the interpretability of the prediction task since the process is human-comprehensible; and it is highly visual.Entities:
Mesh:
Year: 2022 PMID: 36075941 PMCID: PMC9458666 DOI: 10.1038/s41598-022-19419-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1The brief pipeline of the GSNAc model.
Figure 2Illustrative box diagram of a general machine learning classification model.
Figure 3(i) Tabular data converted to adjacency (similarity) matrix (ii), and graph (iii) constructed and enriched with node degrees, edge weights, and colored by communities (classes). Layout graph intuitively shows the importance of some nodes (i.e., H. Sapiens and T-Rex have the highest number of weighted connections, therefore, larger in size) and visually depicts how similar nodes are (e.g., Shark has the smallest similarity to other nodes so it is distant from the core).
Figure 4General overview of the GSNAc model illustrated on Iris dataset.
Figure 5From raw data to GCM demonstrated on C. elegans Dataset.
Figure 6(a) Prediction Phase 1 sample diagram on C. Elegans dataset. İllustrating the prediction attempt of the test node ADFL (not displayed in the diagram). Raw list of neighbors of ADFL listed, sorted by vectorial similarity, grouped by class, and finally similarities are aggregated by class. A prediction is not made since the top two highest averaged classes have very close scores (under margin). (b) Prediction Phase 2 sample diagram on C. Elegans dataset. Illustrating the prediction of the test node ADFL (not displayed in the diagram). This time, the prediction is made based on topological similarity in the same fashion with the first round. This time, without seeking a distinctive margin, the top (highest topological similarity average) class is assigned as the prediction.
The datasets used in the experiments.
| Dataset name | Domain | Samples | Features | Classes | ||||
|---|---|---|---|---|---|---|---|---|
| Sample Size | CV for splitting | Numerical | Categorical | Total | Class labels | Class distribution | ||
| Caravan insurance | Finance | 983 | 5 | 81 | 4 | 85 | 2 | Imbalanced |
| Iris | Biology | 149 | 5 | 4 | 0 | 4 | 3 | Balanced |
| Connectome | Biology | 279 | 2 | 10 | 2 | 12 | 10 | Imbalanced |
| Colon | Medicine | 62 | 2 | 1988 | 0 | 1988 | 2 | Imbalanced |
| Lymphoma | Medicine | 96 | 2 | 4026 | 0 | 4026 | 9 | Imbalanced |
| PBC | Medicine | 276 | 2 | 16 | 2 | 18 | 4 | Imbalanced |
| Heart Kaggle | Medicine | 299 | 5 | 10 | 2 | 12 | 2 | Imbalanced |
| Heart UCI | Medicine | 302 | 5 | 6 | 7 | 13 | 2 | Balanced |
| COVID | Medicine | 436 | 2 | 36 | 1 | 37 | 2 | Imbalanced |
| Breast cancer wisconsin | Medicine | 569 | 5 | 30 | 0 | 30 | 2 | Imbalanced |
| Pima diabetes | Medicine | 768 | 5 | 8 | 0 | 8 | 2 | Imbalanced |
| Voice | Signal processing | 474 | 5 | 20 | 0 | 20 | 2 | Balanced |
| Forest type | Signal processing | 497 | 5 | 10 | 44 | 54 | 7 | Imbalanced |
| Digits | Signal processing | 1797 | 5 | 64 | 0 | 64 | 10 | Balanced |
| Wine | Statistics | 178 | 5 | 13 | 0 | 13 | 3 | Balanced |
| Titanic | Statistics | 183 | 5 | 4 | 3 | 7 | 2 | Balanced |
| weather rain | Statistics | 226 | 5 | 16 | 5 | 21 | 2 | Imbalanced |
| Pokerhand | Statistics | 250 | 2 | 0 | 10 | 10 | 6 | Imbalanced |
| Make blobs | Syntethic | 300 | 5 | 2 | 0 | 2 | 2 | Balanced |
| Make moons | Syntethic | 500 | 5 | 2 | 0 | 2 | 2 | Balanced |
Performance comparison of GSNAc with other classifiers. F1 stands for F1 weighted score.
| Dataset | Decision Tree | Gaussian RBF | NB Gaussian | AdaBoost | kNN | Random Forest | XGBboost | ANN-MLP | SVM Linear | SVM RBF | GSNAc | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F1 | Rank | F1 | Rank | F1 | Rank | F1 | Rank | F1 | Rank | F1 | Rank | F1 | Rank | F1 | Rank | F1 | Rank | F1 | Rank | F1 | Rank | |
| Colon | 0.7913 | 6 | 0.2828 | 11 | 0.7979 | 5 | 0.7163 | 8 | 0.6696 | 9 | 0.8193 | 4 | 0.7892 | 7 | 0.8234 | 3 | 0.6262 | 10 | ||||
| Connectome | 0.7887 | 3 | 0.5919 | 7 | 0.5133 | 10 | 0.3932 | 11 | 0.5391 | 8 | 0.7795 | 4 | 0.8032 | 2 | 0.7430 | 5 | 0.6118 | 6 | 0.5191 | 9 | ||
| COVID | 0.7074 | 10 | 0.7420 | 7 | 0.7296 | 9 | 0.7394 | 8 | 0.6506 | 11 | 0.7558 | 4 | 0.7660 | 2 | 0.7514 | 5 | 0.7642 | 3 | 0.7509 | 6 | ||
| Forest type | 0.5713 | 2 | 0.5657 | 6 | 0.4116 | 11 | 0.4319 | 10 | 0.5458 | 9 | 0.5675 | 5 | 0.5545 | 8 | 0.5596 | 7 | 0.5692 | 3 | 0.5677 | 4 | ||
| Iris | 0.9329 | 9 | 0.9530 | 3 | 0.9530 | 4 | 0.9259 | 11 | 0.9530 | 2 | 0.9396 | 8 | 0.9262 | 10 | 0.9465 | 7 | 0.9530 | 5 | 0.9466 | 6 | ||
| Make blobs | 0.7100 | 11 | 0.7867 | 3 | 0.7733 | 5 | 0.7499 | 9 | 0.7533 | 8 | 0.7599 | 7 | 0.7266 | 10 | 0.7833 | 4 | 0.7667 | 6 | 0.7867 | 2 | ||
| Make moons | 0.9920 | 8 | 0.9740 | 9 | 0.8800 | 11 | 0.9960 | 7 | 0.9980 | 6 | 0.8800 | 10 | ||||||||||
| PBC | 0.3949 | 10 | 0.4530 | 5 | 0.2544 | 11 | 0.4290 | 8 | 0.4772 | 3 | 0.4332 | 7 | 0.4241 | 9 | 0.4522 | 6 | 0.4898 | 2 | 0.4707 | 4 | ||
| Pokerhands | 0.3729 | 10 | 0.4415 | 2 | 0.3890 | 9 | 0.4020 | 7 | 0.4157 | 6 | 0.4188 | 4 | 0.3609 | 11 | 0.3942 | 8 | 0.4288 | 3 | 0.4169 | 5 | ||
| Weather rain | 0.7364 | 10 | 0.7419 | 9 | 0.5808 | 11 | 0.8076 | 4 | 0.7454 | 8 | 0.8463 | 2 | 0.8279 | 3 | 0.7916 | 6 | 0.7476 | 7 | 0.8030 | 5 | ||
| Titanic | 0.7388 | 7 | 0.7339 | 8 | 0.6380 | 11 | 0.7596 | 3 | 0.7561 | 4 | 0.7426 | 5 | 0.7070 | 10 | 0.7397 | 6 | 0.7315 | 9 | 0.7700 | 2 | ||
| Heart UCI | 0.7618 | 11 | 0.7810 | 9 | 0.7859 | 8 | 0.8040 | 7 | 0.8145 | 4 | 0.8047 | 5 | 0.8044 | 6 | 0.7713 | 10 | 0.8172 | 3 | 0.8176 | 2 | ||
| Heart kaggle | 0.7502 | 9 | 0.6910 | 11 | 0.7540 | 8 | 0.7945 | 6 | 0.7131 | 10 | 0.8189 | 4 | 0.7865 | 7 | 0.8259 | 2 | 0.8039 | 5 | 0.8193 | 3 | ||
| Caravan insurance | 0.8759 | 10 | 0.8973 | 7 | 0.6098 | 11 | 0.9065 | 5 | 0.9089 | 3 | 0.9063 | 6 | 0.9089 | 2 | 0.8968 | 8 | 0.8940 | 9 | 0.9089 | 3 | ||
| Lymphoma | 0.6768 | 7 | 0.0074 | 11 | 0.4587 | 10 | 0.5638 | 9 | 0.7269 | 5 | 0.7600 | 4 | 0.7144 | 6 | 0.9087 | 2 | 0.6631 | 8 | 0.9060 | 3 | ||
| Voice | 0.9578 | 6 | 0.9557 | 9 | 0.8945 | 11 | 0.9557 | 7 | 0.9430 | 10 | 0.9599 | 5 | 0.9641 | 3 | 0.9662 | 2 | 0.9557 | 8 | 0.9620 | 4 | ||
| Breast cancer wisconsin | 0.9193 | 11 | 0.9665 | 6 | 0.9296 | 10 | 0.9665 | 7 | 0.9646 | 8 | 0.9612 | 9 | 0.9700 | 4 | 0.9753 | 3 | 0.9753 | 2 | 0.9683 | 5 | ||
| Digits | 0.8523 | 9 | 0.9674 | 7 | 0.7840 | 10 | 0.2633 | 11 | 0.9766 | 3 | 0.9760 | 4 | 0.9644 | 8 | 0.9710 | 6 | 0.9788 | 2 | 0.9755 | 5 | ||
| Wine | 0.9214 | 10 | 0.9717 | 5 | 0.9719 | 4 | 0.9046 | 11 | 0.9605 | 8 | 0.9719 | 3 | 0.9494 | 9 | 0.9832 | 2 | 0.9605 | 7 | 0.9661 | 6 | ||
| Pima diabetes | 0.6904 | 11 | 0.7471 | 6 | 0.7514 | 5 | 0.7393 | 8 | 0.7320 | 9 | 0.7571 | 2 | 0.7275 | 10 | 0.7530 | 4 | 0.7560 | 3 | 0.7436 | 7 | ||
The bold values indicates the top score and rank for the respective dataset. Rows are sorted by GSNAc Models’ success, columns (classifiers) are sorted by their respective cumulative successes.
The classifiers used in the experiments.
| Classifier name | Classifier type | Parameters |
|---|---|---|
| AdaBoost | Ensemble-tree based | base_estimator = None, n_estimators = 50, learning_rate = 1.0, algorithm = 'SAMME.R', random_state = None |
| Artificial neural networks (ANN) | Function based | max_iter = 2000, hidden_layer_sizes = (100,), activation = 'relu', *, solver = 'adam', alpha = 0.0001, batch_size = 'auto', learning_rate = 'constant', learning_rate_init = 0.001, power_t = 0.5, shuffle = True, random_state = None, tol = 0.0001, verbose = False, warm_start = False, momentum = 0.9, nesterovs_momentum = True, early_stopping = False, validation_fraction = 0.1, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e−08, n_iter_no_change = 10, max_fun = 15,000 |
| Decision tree | Hierarchical | criterion = 'gini', splitter = 'best', max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = None, random_state = None, max_leaf_nodes = None, min_impurity_decrease = 0.0, class_weight = None, ccp_alpha = 0.0 |
| Naive Bayes Gaussian | Statistical | priors = None, var_smoothing = 1e−09 |
| Gaussian process RBF kernel | Statistical | kernel = RBF, optimizer = 'fmin_l_bfgs_b', n_restarts_optimizer = 0, max_iter_predict = 100, warm_start = False, copy_X_train = True, random_state = None, multi_class = 'one_vs_rest', n_jobs = None |
| k nearest neighbours (kNN) | Statistical | n_neighbors = 5, *, weights = 'uniform', algorithm = 'auto', leaf_size = 30, p = 2, metric = 'minkowski', metric_params = None, n_jobs = − 1 |
| Random forest | Ensemble-tree based | n_estimators = 100, *, criterion = 'gini', max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0, bootstrap = True, oob_score = False, n_jobs = None, random_state = None, verbose = 0, warm_start = False, class_weight = None, ccp_alpha = 0.0, max_samples = None |
| SVM Linear Kernel | Function based | C = 1.0, kernel = 'linear', degree = 3, gamma = 'scale', coef0 = 0.0, shrinking = True, probability = False, tol = 0.001, cache_size = 200, class_weight = None, verbose = False, max_iter =—1, decision_function_shape = 'ovr', break_ties = False, random_state = None |
| SVM RBF kernel | Function based | C = 1.0, kernel = 'rbf', degree = 3, gamma = 'scale', coef0 = 0.0, shrinking = True, probability = False, tol = 0.001, cache_size = 200, class_weight = None, verbose = False, max_iter =—1, decision_function_shape = 'ovr', break_ties = False, random_state = None |
| XGBoost | Ensemble-tree based | default parameter set |
Figure 7Explainable AI approach versus todays’ classifier models in a nutshell.
Figure 8Sample visualization of the prediction step. Test node ADAR from the C. elegans dataset has been classified as CL-E (orange).
Figure 9Overall classification result on C. Elegans dataset. Blue color implies that the specific sample has been predicted true by the respective classifier (such as AdaBoost); red color indicates false prediction. Boxes in black indicate cases where the same sample has been predicted true (as in blue colors) by XGBoost and AdaBoost classifiers, but false (as in red colors) by GSNAc, and boxes in orange indicate opposite cases (i.e., false by XGBoost and AdaBoost while true by GSNAc).