| Literature DB >> 27571416 |
Leila M Naeni1,2,3, Hugh Craig4, Regina Berretta1,2, Pablo Moscato1,2.
Abstract
In this study we propose a novel, unsupervised clustering methodology for analyzing large datasets. This new, efficient methodology converts the general clustering problem into the community detection problem in graph by using the Jensen-Shannon distance, a dissimilarity measure originating in Information Theory. Moreover, we use graph theoretic concepts for the generation and analysis of proximity graphs. Our methodology is based on a newly proposed memetic algorithm (iMA-Net) for discovering clusters of data elements by maximizing the modularity function in proximity graphs of literary works. To test the effectiveness of this general methodology, we apply it to a text corpus dataset, which contains frequencies of approximately 55,114 unique words across all 168 written in the Shakespearean era (16th and 17th centuries), to analyze and detect clusters of similar plays. Experimental results and comparison with state-of-the-art clustering methods demonstrate the remarkable performance of our new method for identifying high quality clusters which reflect the commonalities in the literary style of the plays.Entities:
Mesh:
Year: 2016 PMID: 27571416 PMCID: PMC5003342 DOI: 10.1371/journal.pone.0157988
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Authors and their contributions in the 168 Shakespearean era plays dataset.
| Author | # Plays | Author | # Plays | Author | # Plays |
|---|---|---|---|---|---|
| Shakespeare | 28 | Shirley | 3 | Greville | 1 |
| Middleton | 18 | Webster | 3 | Davenant | 1 |
| Jonson | 17 | Rowley | 2 | Brome | 1 |
| Fletcher | 15 | Massinger | 2 | Day | 1 |
| Chapman | 13 | Haughton | 2 | Porter | 1 |
| Lyly | 8 | Kyd | 2 | Chettle | 1 |
| Ford | 7 | Marston | 2 | Edwards | 1 |
| Peele | 5 | Wilmot | 1 | Suckling | 1 |
| Dekker | 5 | Carey | 1 | Sidney | 1 |
| Marlowe | 5 | Daniel | 1 | Marmion | 1 |
| Heywood | 5 | Brandon | 1 | Beaumont | 1 |
| Greene | 4 | Lodge | 1 | Tourneur | 1 |
| Wilson | 3 | Goffe | 1 | Nashe | 1 |
Fig 1Flowchart of stages in the clustering methodology proposed in this study.
Basic information of kNN graphs constructed from the complete dataset.
All graphs have 168 nodes that are representing the 168 plays in the dataset.
| kNN Graphs | # Edges | Average Degree |
|---|---|---|
| k = 1 | 145 | 1.7 |
| k = 2 | 284 | 3.4 |
| k = 3 | 441 | 5.3 |
| k = 4 | 551 | 6.6 |
| k = 5 | 687 | 8.2 |
| k = 6 | 822 | 9.8 |
| k = 7 | 957 | 11.4 |
| k = 8 | 1,091 | 13.0 |
| k = 9 | 1,225 | 14.6 |
| k = 10 | 1,362 | 16.2 |
Fig 2An illustration of the string-coding representation for a clustering solution.
Left: a simple toy graph with two clusters shown using different colors, Right: an integer list encoding the solution, where the red and blue color clusters are labeled as 1 and 2, respectively.
Basic information of the real-world networks.
| Network | # Nodes | # Edges | Average Degree | |
|---|---|---|---|---|
| 1 | karate | 34 | 78 | 4.59 |
| 2 | dolphin | 62 | 159 | 5.13 |
| 3 | polbooks | 105 | 441 | 8.40 |
| 4 | football | 115 | 613 | 10.66 |
| 5 | jazz | 198 | 2742 | 27.70 |
Experimental results on five real-world benchmark networks.
The maximum, average and standard deviation of modularity values (Q,Q,Q) obtained by LPAm, Meme-Net, Moga-Net, MODPSO, MLCD, MA-Net and iMA-Net.
| Criterion | LPAm | Meme-Net | Moga-Net | MODPSO | MLCD | MA-Net | iMA-Net | |
|---|---|---|---|---|---|---|---|---|
| karate | 0.4052 | 0.4020 | 0.4159 | |||||
| 0.3564 | 0.4020 | 0.3945 | 0.4182 | 0.4195 | ||||
| 0.0285 | 0 | 0.0089 | 0.0079 | 0 | 0.0022 | 0 | ||
| dolphin | 0.5071 | 0.5185 | 0.5034 | 0.5264 | ||||
| 0.4938 | 0.5096 | 0.4584 | 0.5255 | 0.5247 | 0.5252 | |||
| 0.0114 | 0.0061 | 0.0163 | 0.0070 | 0 | 0.0032 | 0.0026 | ||
| polbooks | 0.5145 | 0.5232 | 0.4993 | 0.5264 | ||||
| 0.4976 | 0.5218 | 0.4618 | 0.5263 | 0.5255 | 0.5270 | |||
| 0.0158 | 0.0031 | 0.0129 | 0.0007 | 0 | 0.0029 | 0.0004 | ||
| football | 0.6032 | 0.6044 | 0.4325 | |||||
| 0.5777 | 0.6023 | 0.3906 | 0.6038 | 0.5984 | 0.6042 | |||
| 0.0199 | 0.0015 | 0.0179 | 0.0011 | 0.0000 | 0.0051 | 0.0006 | ||
| jazz | 0.4448 | 0.4376 | 0.2929 | 0.4421 | ||||
| 0.4360 | 0.4330 | 0.2952 | 0.4419 | 0. | 0.4448 | 0.4450 | ||
| 0.0098 | 0.0011 | 0.0084 | 0.0001 | 0.0000 | 0.0002 | 0.0001 |
The best solution found by iMA-Net in ten kNN graphs derived from 168 Shakespearean era play dataset.
| kNN graph | Q | # Clusters | NMI | ARI | NMI×ARI |
|---|---|---|---|---|---|
| 0.898 | 24 | 0.686 | 0.317 | 0.217 | |
| 0.751 | 18 | ||||
| 0.713 | 10 | 0.690 | 0.457 | 0.315 | |
| 0.670 | 9 | 0.673 | 0.441 | 0.296 | |
| 0.623 | 9 | 0.663 | 0.431 | 0.286 | |
| 0.579 | 9 | 0.670 | 0.429 | 0.288 | |
| 0.542 | 9 | 0.672 | 0.434 | 0.292 | |
| 0.509 | 8 | 0.639 | 0.396 | 0.253 | |
| 0.482 | 8 | 0.639 | 0.401 | 0.256 | |
| 0.455 | 8 | 0.623 | 0.392 | 0.244 |
Q is the modularity value used as a fitness value of the clustering. NMI, ARI and NMI×ARI are quality measures used to compare with the true solution of the dataset based on the authorship of plays.
Fig 3Best clustering outcome of iMA-Net with the highest NMI and ARI.
18 clusters are detected in 2-nearest neighbor graph shown by different colours. Node size is proportional to the node’s total degree. Nodes are labeled by the play.author.
Configuration of reduced datasets.
| Reduced dataset | # Plays | # Authors |
|---|---|---|
| 149 | 20 | |
| 139 | 15 | |
| 130 | 12 | |
| 126 | 11 | |
| 106 | 7 |
Best solutions found by iMA-Net in 10 kNN graphs derived from each reduced dataset (G1-G5).
The highest values of NMI, ARI and NMI×ARI in each dataset are denoted in bold.
| Reduced dataset | kNN graph | Q | # Clusters | NMI | ARI | NMI×ARI |
|---|---|---|---|---|---|---|
| 0.907 | 23 | 0.696 | 0.371 | 0.258 | ||
| 0.753 | 16 | 0.505 | 0.373 | |||
| 0.713 | 10 | 0.659 | 0.451 | 0.297 | ||
| 0.672 | 9 | 0.701 | 0.505 | 0.354 | ||
| 0.625 | 10 | 0.711 | ||||
| 0.583 | 9 | 0.669 | 0.457 | 0.306 | ||
| 0.541 | 9 | 0.676 | 0.463 | 0.313 | ||
| 0.511 | 8 | 0.628 | 0.415 | 0.261 | ||
| 0.476 | 7 | 0.642 | 0.438 | 0.281 | ||
| 0.447 | 8 | 0.629 | 0.424 | 0.267 | ||
| 0.903 | 22 | 0.707 | 0.393 | 0.278 | ||
| 0.765 | 16 | 0.499 | 0.370 | |||
| 0.722 | 11 | 0.717 | 0.522 | 0.374 | ||
| 0.680 | 9 | 0.702 | 0.520 | 0.365 | ||
| 0.630 | 10 | 0.717 | ||||
| 0.588 | 9 | 0.683 | 0.495 | 0.338 | ||
| 0.552 | 9 | 0.672 | 0.492 | 0.331 | ||
| 0.506 | 8 | 0.649 | 0.454 | 0.295 | ||
| 0.476 | 8 | 0.683 | 0.501 | 0.342 | ||
| 0.453 | 8 | 0.672 | 0.464 | 0.312 | ||
| 0.892 | 18 | 0.660 | 0.377 | 0.249 | ||
| 0.767 | 15 | 0.773 | 0.542 | 0.419 | ||
| 0.729 | 10 | 0.711 | 0.552 | 0.392 | ||
| 0.682 | 11 | 0.740 | 0.560 | 0.414 | ||
| 0.636 | 10 | |||||
| 0.599 | 9 | 0.694 | 0.542 | 0.376 | ||
| 0.561 | 9 | 0.687 | 0.526 | 0.361 | ||
| 0.520 | 9 | 0.678 | 0.509 | 0.345 | ||
| 0.488 | 8 | 0.690 | 0.536 | 0.370 | ||
| 0.460 | 7 | 0.632 | 0.448 | 0.283 | ||
| 0.893 | 19 | 0.646 | 0.343 | 0.222 | ||
| 0.766 | 15 | 0.554 | 0.431 | |||
| 0.762 | 16 | 0.772 | 0.552 | 0.426 | ||
| 0.681 | 11 | 0.731 | 0.554 | 0.405 | ||
| 0.639 | 10 | 0.740 | ||||
| 0.604 | 9 | 0.708 | 0.595 | 0.421 | ||
| 0.563 | 9 | 0.689 | 0.536 | 0.369 | ||
| 0.523 | 9 | 0.680 | 0.522 | 0.355 | ||
| 0.490 | 7 | 0.631 | 0.431 | 0.272 | ||
| 0.464 | 8 | 0.663 | 0.498 | 0.330 | ||
| 0.895 | 17 | 0.733 | 0.451 | 0.331 | ||
| 0.781 | 11 | 0.826 | 0.618 | 0.511 | ||
| 0.751 | 9 | 0.758 | 0.587 | 0.445 | ||
| 0.701 | 8 | 0.843 | 0.730 | 0.615 | ||
| 0.656 | 7 | 0.810 | ||||
| 0.627 | 7 | 0.847 | 0.690 | |||
| 0.584 | 7 | 0.847 | 0.690 | |||
| 0.544 | 7 | 0.837 | 0.787 | 0.659 | ||
| 0.514 | 6 | 0.799 | 0.690 | 0.551 | ||
| 0.482 | 6 | 0.808 | 0.706 | 0.570 |
Confusion matrix of the true authorship and the clustering solution obtained by the 5NN graph with NMI × ARI = 0.693.
The confusion matrix shows how 106 plays by 7 authors are distributed into 7 clusters. As expected, a good separation occurred in clusters 2, 4, 6 and 7, which are formed by plays of one specific author.
| Author | # Plays | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | Cluster 7 | |
|---|---|---|---|---|---|---|---|---|---|
| 28 | 28 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| 18 | 0 | 18 | 0 | 0 | 0 | 0 | 0 | ||
| 17 | 0 | 0 | 15 | 0 | 2 | 0 | 0 | ||
| 15 | 1 | 0 | 0 | 14 | 0 | 0 | 0 | ||
| 13 | 1 | 0 | 7 | 0 | 5 | 0 | 0 | ||
| 8 | 1 | 0 | 0 | 0 | 0 | 7 | 0 | ||
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | ||
Fig 4The best clustering outcome of iMA-Net with the highest NMI × ARI in G5.
Seven clusters are shown by different colours in the 5-nearest neighbour graph constructed from similarity between 106 plays from 7 authors. Nodes sizes are proportioned to their degree.
The best clustering solutions obtained by benchmark methods (k-means++, MST-kNN, Complete-linkage, Average-linkage, Single-linkage and Ward’s method) together with the best (Best) and the worst (Worst) clustering result obtained by the proposed method in Complete dataset and five reduced datasets (G1-G5).
The Rank column is based on the value of NMI × ARI.
| Dataset | Method | # Clusters | NMI | ARI | NMI × ARI | Rank |
|---|---|---|---|---|---|---|
| 18 | 0.742 | 0.525 | 0.390 | 2 | ||
| 24 | 0.686 | 0.317 | 0.217 | 3 | ||
| 13 | 0.617 | 0.339 | 0.209 | 4 | ||
| 2 | 0.292 | 0.026 | 0.008 | 8 | ||
| 20 | 0.623 | 0.324 | 0.202 | 5 | ||
| 20 | 0.440 | 0.052 | 0.023 | 6 | ||
| 20 | 0.384 | 0.033 | 0.013 | 7 | ||
| 20 | 0.772 | 0.568 | 0.438 | 1 | ||
| 10 | 0.711 | 0.542 | 0.386 | 2 | ||
| 8 | 0.628 | 0.415 | 0.261 | 3 | ||
| 18 | 0.638 | 0.354 | 0.226 | 4 | ||
| 3 | 0.435 | 0.090 | 0.039 | 6 | ||
| 20 | 0.591 | 0.181 | 0.107 | 5 | ||
| 20 | 0.434 | 0.065 | 0.028 | 7 | ||
| 19 | 0.388 | 0.052 | 0.020 | 8 | ||
| 19 | 0.775 | 0.570 | 0.442 | 1 | ||
| 10 | 0.717 | 0.564 | 0.404 | 1 | ||
| 8 | 0.649 | 0.454 | 0.295 | 3 | ||
| 16 | 0.628 | 0.395 | 0.248 | 4 | ||
| 3 | 0.469 | 0.132 | 0.062 | 6 | ||
| 19 | 0.562 | 0.191 | 0.107 | 5 | ||
| 20 | 0.467 | 0.098 | 0.046 | 7 | ||
| 20 | 0.392 | 0.064 | 0.025 | 8 | ||
| 18 | 0.765 | 0.507 | 0.387 | 2 | ||
| 10 | 0.744 | 0.617 | 0.459 | 1 | ||
| 18 | 0.660 | 0.377 | 0.249 | 4 | ||
| 12 | 0.626 | 0.448 | 0.280 | 3 | ||
| 3 | 0.476 | 0.147 | 0.070 | 6 | ||
| 19 | 0.574 | 0.224 | 0.129 | 5 | ||
| 20 | 0.491 | 0.125 | 0.061 | 7 | ||
| 20 | 0.412 | 0.081 | 0.033 | 8 | ||
| 14 | 0.740 | 0.535 | 0.396 | 2 | ||
| 10 | 0.740 | 0.628 | 0.465 | 2 | ||
| 7 | 0.631 | 0.431 | 0.222 | 4 | ||
| 17 | 0.640 | 0.389 | 0.249 | 3 | ||
| 3 | 0.486 | 0.162 | 0.079 | 6 | ||
| 20 | 0.612 | 0.264 | 0.162 | 5 | ||
| 18 | 0.400 | 0.080 | 0.032 | 8 | ||
| 20 | 0.517 | 0.142 | 0.074 | 7 | ||
| 13 | 0.752 | 0.619 | 0.465 | 1 | ||
| 7 | 0.856 | 0.810 | 0.693 | 1 | ||
| 17 | 0.733 | 0.451 | 0.331 | 3 | ||
| 10 | 0.600 | 0.422 | 0.253 | 5 | ||
| 3 | 0.540 | 0.237 | 0.128 | 6 | ||
| 15 | 0.665 | 0.465 | 0.310 | 4 | ||
| 10 | 0.449 | 0.124 | 0.056 | 7 | ||
| 16 | 0.362 | 0.075 | 0.027 | 8 | ||
| 9 | 0.830 | 0.797 | 0.661 | 2 |