| Literature DB >> 35626495 |
Chen Shao1, Xusheng Du1, Jiong Yu1, Jiaying Chen1.
Abstract
Outlier detection is an important research direction in the field of data mining. Aiming at the problem of unstable detection results and low efficiency caused by randomly dividing features of the data set in the Isolation Forest algorithm in outlier detection, an algorithm CIIF (Cluster-based Improved Isolation Forest) that combines clustering and Isolation Forest is proposed. CIIF first uses the k-means method to cluster the data set, selects a specific cluster to construct a selection matrix based on the results of the clustering, and implements the selection mechanism of the algorithm through the selection matrix; then builds multiple isolation trees. Finally, the outliers are calculated according to the average search length of each sample in different isolation trees, and the Top-n objects with the highest outlier scores are regarded as outliers. Through comparative experiments with six algorithms in eleven real data sets, the results show that the CIIF algorithm has better performance. Compared to the Isolation Forest algorithm, the average AUC (Area under the Curve of ROC) value of our proposed CIIF algorithm is improved by 7%.Entities:
Keywords: Isolation Forest; clustering; k-means; selection matrix
Year: 2022 PMID: 35626495 PMCID: PMC9141139 DOI: 10.3390/e24050611
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1The principle of IF, (a) normal data; (b) outlier. The blue asterisk indicates normal data, and the red asterisk indicates an outlier.
Figure 2The principle of CIIF, (a) normal data; (b) outlier. The blue asterisk indicates normal data and the red asterisk indicates an outlier, the data in the selection cluster are indicated by the purple dots.
Testing dataset.
| Dataset | Data Volume | Dimension | Number of Outliers | Outlier Ratio % |
|---|---|---|---|---|
| breastw | 683 | 9 | 239 | 34.9927 |
| annthyroid | 7200 | 6 | 534 | 7.4167 |
| arrhythmia | 452 | 274 | 66 | 14.6018 |
| pima | 768 | 9 | 268 | 34.8958 |
| speech | 3686 | 400 | 61 | 1.6549 |
| thyroid | 3772 | 6 | 93 | 2.4655 |
| vertebral | 240 | 6 | 30 | 12.5 |
| wine | 129 | 13 | 10 | 7.7519 |
| ionosphere | 351 | 33 | 126 | 35.8974 |
| shuttle | 49,097 | 9 | 3511 | 7.1511 |
| cardio | 1822 | 21 | 176 | 9.6122 |
Confusion matrix of classification results.
| Actual | Forecast | |
|---|---|---|
| Positive | Negative | |
| Positive | TP | FN |
| Negative | FP | TN |
AUC results on 11 real-world datasets AUC of several algorithms. The highlighted data indicate the best and second best values on each data set.
| Data | AUC | ||||||
|---|---|---|---|---|---|---|---|
| CIIF | IF | LOF | KNN | COF | FastABOD | LDOF | |
| breastw |
| 0.9876 | 0.2421 |
| 0.1273 | 0.6220 | 0.6394 |
| annthyroid |
|
| 0.6958 | 0.6938 | 0.6523 | 0.2153 | 0.7377 |
| arrhythmia |
|
| 0.5092 | 0.5092 | 0.7229 | 0.2562 | 0.5092 |
| pima |
| 0.8064 | 0.4491 |
| 0.4859 | 0.2580 | 0.5221 |
| speech |
| 0.4539 | 0.5467 | 0.4821 | 0.5747 | 0.2662 |
|
| thyroid |
|
| 0.6836 | 0.9481 | 0.6121 | 0.1642 | 0.7098 |
| vertebral |
| 0.3659 | 0.4846 | 0.3238 | 0.4805 |
| 0.5281 |
| wine |
| 0.7829 | 0.4008 |
| 0.2319 | 0.5647 | 0.4496 |
| ionosphere | 0.8624 | 0.8527 | 0.8643 |
| 0.8529 | 0.1738 |
|
| shuttle |
|
| 0.5184 | 0.6339 | 0.5534 | 0.4172 | 0.5208 |
| cardio |
| 0.9042 | 0.6128 |
| 0.5796 | 0.4759 | 0.5798 |
Figure 3ROC curve of several algorithms in several datasets: (a) breast; (b) annthyroid; (c) arrhythmia; (d) pima; (e) speech; (f) thyroid; (g) vertebral; (h) wine; (i) ionosphere; (j) shuttle; (k) cardio.
Figure 4Computational times(s).
Figure 5Computational time of shuttle(s).
Figure 6AUC impact of I.
Figure 7AUC impact of subsampling X.