| Literature DB >> 35626512 |
Soroosh Shalileh1, Boris Mirkin2,3.
Abstract
This paper proposes a meaningful and effective extension of the celebrated K-means algorithm to detect communities in feature-rich networks, due to our assumption of non-summability mode. We least-squares approximate given matrices of inter-node links and feature values, leading to a straightforward extension of the conventional K-means clustering method as an alternating minimization strategy for the criterion. This works in a two-fold space, embracing both the network nodes and features. The metric used is a weighted sum of the squared Euclidean distances in the feature and network spaces. To tackle the so-called curse of dimensionality, we extend this to a version that uses the cosine distances between entities and centers. One more version of our method is based on the Manhattan distance metric. We conduct computational experiments to test our method and compare its performances with those by competing popular algorithms at synthetic and real-world datasets. The cosine-based version of the extended K-means typically wins at the high-dimension real-world datasets. In contrast, the Manhattan-based version wins at most synthetic datasets.Entities:
Keywords: K-means clustering; cluster analysis; community detection; data recovery; feature-rich networks; node-attributed networks; nonsummability assumption
Year: 2022 PMID: 35626512 PMCID: PMC9142054 DOI: 10.3390/e24050626
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Real world datasets.
| Name | Vertices | Links | Attributes | Number of Communities | Ground Truth | Ref. |
|---|---|---|---|---|---|---|
| COSN | 46 | 552 | 16 | 2 | Region | [ |
| Lawyers | 71 | 339 | 18 | 6 | Derived out-of-office and status features | [ |
| World Trade | 80 | 1000 | 16 | 5 | Structural world system in 1980 features | [ |
| Malaria HVR6 | 307 | 6526 | 6 | 2 | Cys Labels | [ |
| Parliament | 451 | 11,646 | 108 | 7 | Political parties | [ |
| Cora | 2708 | 5276 | 1433 | 7 | Computer Science research area | [ |
| SinaNet | 3490 | 30,282 | 10 | 10 | Users of same forum | [ |
| Amazon Photo | 7650 | 71,831 | 745 | 8 | Product categories | [ |
KEFRiN’s Performance at small-size networks with quantitative features only: the average and standard deviation of ARI index over 10 different data sets. Both cases, with and without noise, are considered. The best results are bold-faced.
| No Noise | With Noise | |||||
|---|---|---|---|---|---|---|
|
| KEFRiNe | KEFRiNc | KEFRiNm | KEFRiNe | KEFRiNc | KEFRiNm |
| 0.9, 0.3, 0.9 | 0.818(0.163) | 0.920(0.132) |
| 0.847(0.185) |
| 0.926(0.115) |
| 0.9, 0.3, 0.7 | 0.823(0.119) | 0.904(0.117) |
|
| 0.837(0.138) | 0.862(0.143) |
| 0.9, 0.6, 0.9 | 0.792(0.179) | 0.737(0.124) |
| 0.765(0.092) | 0.738(0.174) |
|
| 0.9, 0.6, 0.7 | 0.796(0.180) |
| 0.802(0.184) | 0.800(0.180) | 0.765(0.162) |
|
| 0.7, 0.3, 0.9 | 0.849(0.128) |
| 0.880(0.152) | 0.786(0.178) |
| 0.845(0.133) |
| 0.7, 0.3, 0.7 | 0.809(0.098) | 0.803(0.132) |
| 0.831(0.142) | 0.760(0.200) |
|
| 0.7, 0.6, 0.9 | 0.499(0.184) |
| 0.303(0.086) |
| 0.544(0.164) | 0.316(0.150) |
| 0.7, 0.6, 0.7 | 0.595(0.189) |
| 0.306(0.114) | 0.462(0.129) |
| 0.279(0.141) |
| Average | 0.748 | 0.830 | 0.831 | 0.740 | 0.758 | 0.718 |
KEFRiN’s performance at small-sized and medium-sized synthetic networks with categorical features: The average and standard deviation of ARI index over 10 different data sets. The best results are highlighted in bold-face.
| Small | Medium | |||||
|---|---|---|---|---|---|---|
|
| KEFRiNe | KEFRiNc | KEFRiNm | KEFRiNe | KEFRiNc | KEFRiNm |
| 0.9, 0.3, 0.9 | 0.855(0.145) | 0.922(0.119) |
| 0.508(0.205) | 0.724(0.097) |
|
| 0.9, 0.3, 0.7 | 0.795(0.149) | 0.819(0.142) |
|
| 0.742(0.182) | 0.762(0.184) |
| 0.9, 0.6, 0.9 |
| 0.726(0.097) | 0.893(0.147) | 0.279(0.204) |
| 0.894(0.074) |
| 0.9, 0.6, 0.7 | 0.588(0.173) | 0.711(0.145) |
| 0.766(0.180) | 0.733(0.083) |
|
| 0.7, 0.3, 0.9 | 0.827(0.141) | 0.877(0.130) |
| 0.364(0.247) | 0.641(0.111) |
|
| 0.7, 0.3, 0.7 | 0.794(0.144) | 0.795(0.117) |
|
| 0.797(0.088) | 0.759(0.092) |
| 0.7, 0.6, 0.9 | 0.399(0.094) | 0.819(0.142) |
| 0.426(0.246) | 0.591(0.094) |
|
| 0.7, 0.6, 0.7 | 0.074(0.047) |
| 0.392(0.121) | 0.671(0.196) |
| 0.695(0.074) |
| Average | 0.640 | 0.812 | 0.828 | 0.578 | 0.710 | 0.810 |
The average and standard deviation of ARI index over 10 different data sets for KEFRiN results at small-size networks combining quantitative and categorical features, with and without noise. The best results are highlighted in bold-face.
| No Noise | With Noise | |||||
|---|---|---|---|---|---|---|
|
| KEFRiNe | KEFRiNc | KEFRiNm | KEFRiNe | KEFRiNc | KEFRiNm |
| 0.9, 0.3, 0.9 | 0.823(0.125) | 0.752(0.096) |
|
| 0.810(0.153) | 0.859(0.146) |
| 0.9, 0.3, 0.7 | 0.840(0.133) | 0.769(0.101) |
| 0.864(0.137) | 0.858(0.143) |
|
| 0.9, 0.6, 0.9 | 0.756(0.171) | 0.809(0.138) |
| 0.733(0.184) | 0.717(0.130) |
|
| 0.9, 0.6, 0.7 |
| 0.716(0.122) | 0.754(0.193) | 0.708(0.223) | 0.549(0.186) |
|
| 0.7, 0.3, 0.9 | 0.872(0.129) | 0.750(0.078) |
| 0.713(0.185) |
| 0.851(0.152) |
| 0.7, 0.3, 0.7 | 0.782(0.155) | 0.681(0.078) |
| 0.840(0.130) | 0.647(0.143) |
|
| 0.7, 0.6, 0.9 | 0.583(0.143) |
| 0.389(0.122) |
| 0.520(0.118) | 0.244(0.130) |
| 0.7, 0.6, 0.7 | 0.473(0.095) |
| 0.207(0.098) | 0.370(0.123) |
| 0.183(0.059) |
| Average | 0.745 | 0.715 | 0.716 | 0.708 | 0.675 | 0.710 |
The average and standard deviation of ARI index over 10 different data sets for KEFRiN results at medium-size networks combining quantitative and categorical features, with and without noise. The best results are highlighted in bold-face.
| No Noise | With Noise | |||||
|---|---|---|---|---|---|---|
|
| KEFRiNe | KEFRiNc | KEFRiNm | KEFRiNe | KEFRiNc | KEFRiNm |
| 0.9, 0.3, 0.9 | 0.570(0.121) |
| 0.697(0.122) | 0.541(0.122) |
| 0.777(0.116) |
| 0.9, 0.3, 0.7 | 0.540(0.158) |
| 0.686(0.124) |
| 0.733(0.103) | 0.768(0.131) |
| 0.9, 0.6, 0.9 | 0.641(0.068) |
| 0.601(0.075) |
| 0.645(0.061) | 0.641(0.069) |
| 0.9, 0.6, 0.7 | 0.672(0.082) |
| 0.573(0.061) |
| 0.617(0.046) | 0.624(0.074) |
| 0.7, 0.3, 0.9 | 0.614(0.085) |
| 0.578(0.134) | 0.556(0.113) |
| 0.551(0.104) |
| 0.7, 0.3, 0.7 | 0.543(0.081) |
| 0.574(0.113) | 0.753(0.084) |
| 0.658(0.108) |
| 0.7, 0.6, 0.9 | 0.385(0.120) |
| 0.180(0.139) |
| 0.593(0.135) | 0.077(0.105) |
| 0.7, 0.6, 0.7 | 0.255(0.050) |
| 0.102(0.079) |
| 0.483(0.021) | 0.035(0.010) |
| Average | 0.528 | 0.758 | 0.499 | 0.642 | 0.664 | 0.516 |
Comparison of CESNA, SIAN, DMoN, SEANAC and KEFRiN algorithms on small-size synthetic networks with categorical features: The average and standard deviation of ARI index over 10 different data sets. The best results are shown in bold-face and the second-best ones are underlined.
| Dataset | CESNA | SIAN | DMoN | SEANAC | KEFRiNe | KEFRiNc | KEFRiNm |
|---|---|---|---|---|---|---|---|
| 0.9, 0.3, 0.9 |
| 0.554(0.285) | 0.709(0.101) |
| 0.886(0.116) | 0.922(0.119) | 0.895(0.173) |
| 0.9, 0.3, 0.7 |
| 0.479(0.289) | 0.380(0.107) |
| 0.835(0.138) | 0.819(0.142) | 0.891(0.135) |
| 0.9, 0.6, 0.9 | 0.934(0.075) | 0.320(0.255) | 0.412(0.109) |
|
| 0.726(0.097) | 0.868(0.202) |
| 0.9, 0.6, 0.7 |
| 0.110(0.138) | 0.213(0.051) | 0.750(0.117) | 0.694(0.096) | 0.711(0.145) |
|
| 0.7, 0.3, 0.9 |
| 0.553(0.157) | 0.566(0.105) |
| 0.788(0.117) | 0.877(0.130) | 0.937(0.124) |
| 0.7, 0.3, 0.7 |
| 0.508(0.211) | 0.292(0.077) |
| 0.836(0.115) | 0.795(0.117) | 0.824(0.191) |
| 0.7, 0.6, 0.9 | 0.506(0.101) | 0.047(0.087) | 0.345(0.064) |
| 0.762(0.169) |
| 0.379(0.174) |
| 0.7, 0.6, 0.7 | 0.202(0.081) | 0.030(0.040) | 0.115(0.058) |
|
| 0.540(0.107) | 0.184(0.098) |
Comparison of CESNA, SIAN, DMoN, SEANAC, and KEFRiN algorithms over medium-size synthetic networks with categorical features; average and standard deviation of ARI index over 10 different datasets. The best results are shown in bold-face and second ones are underlined.
| Dataset | CESNA | SIAN | DMoN | SEANAC | KEFRiNe | KEFRiNc | KEFRiNm |
|---|---|---|---|---|---|---|---|
| 0.9, 0.3, 0.9 |
| 0.000(0.000) | 0.512(0.137) |
| 0.508(0.205) | 0.724(0.097) | 0.863(0.089) |
| 0.9, 0.3, 0.7 |
| 0.000(0.000) | 0.272(0.073) |
| 0.777(0.129) | 0.742(0.182) | 0.762(0.184) |
| 0.9, 0.6, 0.9 | 0.632(0.058) | 0.000(0.000) | 0.370(0.063) |
| 0.279(0.204) | 0.652(0.110) |
|
| 0.9, 0.6, 0.7 | 0.474(0.089) | 0.000(0.000) | 0.168(0.030) |
| 0.766(0.180) | 0.733(0.083) |
|
| 0.7, 0.3, 0.9 | 0.764(0.068) | 0.026(0.077) | 0.446(0.099) |
| 0.364(0.247) | 0.641(0.111) |
|
| 0.7, 0.3, 0.7 | 0.715(0.128) | 0.000(0.000) | 0.228(0.077) |
|
| 0.797(0.088) | 0.759(0.092) |
| 0.7, 0.6, 0.9 | 0.060(0.024) | 0.000(0.000) | 0.332(0.051) |
| 0.426(0.246) | 0.591(0.094) |
|
| 0.7, 0.6, 0.7 | 0.016(0.008) | 0.000(0.000) | 0.133(0.016) |
| 0.671(0.196) |
| 0.695(0.074) |
The selected standardization options for the least-squares community detection methods at the real world datasets. Symbols R, Z, S, M, N stand for Range standardization, Z-scoring, Scale shift, Modularity and No Pre-processing, respectively.
| Dataset | SEANAC | KEFRiNe | KEFRiNc | KEFRiNm | ||||
|---|---|---|---|---|---|---|---|---|
| Y | P | Y | P | Y | P | Y | P | |
| Malaria HVR6 | Z | U | R | R | N | N | Z | M |
| Lawyers | R | S | Z | N | Z | N | Z | M |
| World Trade | R | R | N | N | Z | M | R | M |
| Parliament | Z | M | N | N | Z | N | R | M |
| COSN | Z | N | Z | N | Z | N | R | M |
| Cora | Z | M | N | N | N | N | Z | M |
| SinaNet | Z | M | Z | Z | Z | N | Z | M |
| Amazon Photo | N|A | Z | S | N | N | Z | M | |
Comparison of CESNA, SIAN, DMoN, SEANAC, KEFRiNe and KEFRiNc algorithms with Real-world data sets; average values of ARI are presented over 10 random initializations. The best results are highlighted in bold-face; those second-best are underlined.
| Dataset | CESNA | SIAN | DMoN | SEANAC | KEFRiNe | KEFRiNc | KEFRiNm |
|---|---|---|---|---|---|---|---|
| HRV6 | 0.20(0.00) | 0.39(0.29) |
| 0.49(0.11) | 0.34(0.02) |
| −0.056(0.004) |
| Lawyers | 0.28(0.00) |
|
|
| 0.43(0.13) | 0.44(0.14) | 0.415(0.085) |
| World Trade | 0.13(0.00) | 0.10(0.01)) | 0.13(0.02) |
| 0.27(0.17) |
| 0.048(0.013) |
| Parliament | 0.25(0.00) |
|
| 0.28(0.01) | 0.15(0.09) | 0.41(0.05) | −0.035(0.001) |
| COSN | 0.44(0.00) | 0.75(0.00) |
| 0.72(0.02) | 0.65(0.18) |
| 0.493(0.056) |
| Cora | 0.14(0.00) | 0.17(0.03) |
| 0.00(0.00) | 0.00(0.00) |
| −0.000(0.000) |
| SinaNet | 0.09(0.00) | 0.17(0.02) | 0.28(0.01) | 0.21(0.03) |
|
| 0.001(0.000) |
| Amazon Photo | 0.19(0.000) | N|A |
| N|A | 0.06(0.01) |
| 0.030(0.001) |
The execution time of methods under consideration at medium-size synthetic networks with categorical features at the nodes. The average of 10 different data sets at the same setting is reported in second. The fastest algorithm is shown in bold-face.
|
| CESNA | SIAN | DMoN | SEANAC | KEFRiNe | KEFRiNc | KEFRiNm |
|---|---|---|---|---|---|---|---|
| 0.9, 0.3, 0.9 | 38.265 | 856.785 | 124.698 | 492.006 | 2.434 | 2.389 |
|
| 0.7, 0.6, 0.7 | 83.961 | 2674.541 | 207.541 | 476.251 | 2.859 | 3.131 |
|