| Literature DB >> 36262121 |
Zeynel Cebeci1, Cagatay Cebeci2, Yalcin Tahtali3, Lutfi Bayyurt3.
Abstract
Outliers are data points that significantly deviate from other data points in a data set because of different mechanisms or unusual processes. Outlier detection is one of the intensively studied research topics for identification of novelties, frauds, anomalies, deviations or exceptions in addition to its use for data cleansing in data science. In this study, we propose two novel outlier detection approaches using the typicality degrees which are the partitioning result of unsupervised possibilistic clustering algorithms. The proposed approaches are based on finding the atypical data points below a predefined threshold value, a possibilistic level for evaluating a point as an outlier. The experiments on the synthetic and real data sets showed that the proposed approaches can be successfully used to detect outliers without considering the structure and distribution of the features in multidimensional data sets. ©2022 Cebeci et al.Entities:
Keywords: Anomaly detection; Data analysis; Fuzzy and possibilistic clustering; Outlier detection; Unsupervised learning
Year: 2022 PMID: 36262121 PMCID: PMC9575855 DOI: 10.7717/peerj-cs.1060
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 12D scatter plot for SD1.
Outliers detected in SDS1.
|
|
|
|
|---|---|---|
| 2 | 39 121 122 123 124 125 126 127 128 129 130 | 121 122 123 124 125 127 129 130 |
| 3 | 1 6 121 122 123 124 125 126 127 128 129 130 | 121 122 123 125 127 128 129 130 |
| 4 | 121 122 123 124 125 126 127 128 129 130 | 121 122 123 125 126 127 128 129 130 |
| 5 | 121 122 123 124 125 126 127 128 (129 130) | 121 123 125 126 128 |
| 6 | 104 121 122 123 124 125 126 127 128 | 121 123 125 126 128 |
Values of the validity indices by different number of clusters for SDS1.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 | 0.1167624 | 8.043394 | 0.09970858 | 0.8583530 |
| 3 | 0.1385422 | 8.044566 | 1.87827975 | 0.8379177 |
| 4 | 0.1853725 | 8.732699 | 0.01513389 | 0.8392733 |
| 5 | 0.1059272 | 6.024347 | 0.95398914 | 0.8742973 |
| 6 | 1.4147905 | 10.132927 | 0.06037724 | 0.7632591 |
Figure 22D-scatter plot (p1 vs p2) for SDS2.
Figure 3Outliers detected from the results of possibilistic partitioning for four and five clusters on SDS1.
Values of the validity indices by different number of clusters for SDS2.
|
|
|
|
|
|
|---|---|---|---|---|
| 2 | 0.08256631 | 4.213788 | 257.2904 | 0.9480826 |
| 3 | 0.85765530 | 43.048094 | 30203.7867 | 0.6795931 |
| 4 | 0.42083833 | 22.494335 | 42228.4537 | 0.7041015 |
| 5 | 3.36639018 | 177.540583 | 70887.5026 | 0.4275838 |
| 6 | 2.65760351 | 146.300871 | 45277.8600 | 0.4881886 |
Figure 4Clusters and outliers detected from the result of possibilistic partitioning for two clusters on SDS2.
Outliers detected in the synthetic data sets by some methods in R environment.
|
|
|
| ||
|---|---|---|---|---|
|
|
|
| ||
| SDS1 | 20 104 121 122 123 124 125 126 127 128 | 86 to130 | 121 122 123 124 125 126 127 128 129 130 | – |
| SDS2 | 2 14 33 35 40 41 42 43 44 45 | 9 14 40 41 43 44 45 | no outlies detected | 14 35 41 42 43 44 45 |
Real data sets used for evaluation of the proposed approaches.
|
|
|
|
|
|---|---|---|---|
| b-cancer | 367 | 30 | 10 (2.70) |
| letter | 1600 | 32 | 100 (6.25) |
| pen-global | 809 | 16 | 90 (11.10) |
| satellite | 5100 | 36 | 75 (1.47) |
| wine | 129 | 13 | 10 (7.70) |
Number of outliers detected on the real data sets.
|
|
| ||||||
|---|---|---|---|---|---|---|---|
| Data set |
|
| |||||
| b-cancer | 2 | 10 | 16 | 21 | 9 | 10 | 16 |
| letter | 2 | 40 | 190 | 491 | 18 | 73 | 242 |
| pen-global | 3 | 100 | 180 | 239 | 83 | 140 | 206 |
| satellite | 2 | 78 | 153 | 372 | 61 | 95 | 207 |
| wine | 2 | 2 | 7 | 16 | 2 | 3 | 8 |
|
| |
| 1 | Input: T, alpha, apr |
| 2 | //Typicality degrees matrix in nxc dimension, and built by an |
| 3 | //unsupervised possibilistic clustering algorithm |
| 4 | // alpha, threshold possibility value for outlier testing |
| 5 | // apr, number of the approach to be used in outlier detection |
| 6 | Output: Outliers |
| 7 | //Outliers, vector of n length to store the flags of outliers |
| 8 | n <- count of rows of matrix T |
| 9 | c <- count ofcolumns of matrix T |
| 10 | // If alpha is undefined, use 0.05 as the default value |
| 11 | if alpha is null then alpha = 0.05 |
| 12 | Outliers <- {0} //Assign 0 to all elements of the outliers |
| 13 | for |
| 14 | if apr = 1 then |
| 15 | sumT <- 0 |
| 16 | for |
| 17 | sumT <- sumT + T[i,k] |
| 18 | end |
| 19 | avgT <- sumT / c |
| 20 | if avgT <= alpha then |
| 21 | Outliers[k] <- 1 |
| 22 | end |
| 23 | else |
| 24 | if apr = 2 then |
| 25 | isOutlier <- True |
| 26 | for |
| 27 | if T[i,k] >= alpha then |
| 28 | isOutlier <- False |
| 29 | end |
| 30 | end |
| 31 | if isOutlier = True then |
| 32 | Outliers[k] <- 1 |
| 33 | end |
| 34 | end |
| 35 | end |
| 36 | end |
| 37 | return Outliers |