| Literature DB >> 25329660 |
Jianjun Cheng1, Mingwei Leng1, Longjie Li1, Hanhai Zhou1, Xiaoyun Chen1.
Abstract
Community structure detection is of great importance because it can help in discovering the relationship between the function and the topology structure of a network. Many community detection algorithms have been proposed, but how to incorporate the prior knowledge in the detection process remains a challenging problem. In this paper, we propose a semi-supervised community detection algorithm, which makes full utilization of the must-link and cannot-link constraints to guide the process of community detection and thereby extracts high-quality community structures from networks. To acquire the high-quality must-link and cannot-link constraints, we also propose a semi-supervised component generation algorithm based on active learning, which actively selects nodes with maximum utility for the proposed semi-supervised community detection algorithm step by step, and then generates the must-link and cannot-link constraints by accessing a noiseless oracle. Extensive experiments were carried out, and the experimental results show that the introduction of active learning into the problem of community detection makes a success. Our proposed method can extract high-quality community structures from networks, and significantly outperforms other comparison methods.Entities:
Mesh:
Year: 2014 PMID: 25329660 PMCID: PMC4201489 DOI: 10.1371/journal.pone.0110088
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Algorithm 1: Semi-supervised community detection algorithm based on must-link and cannot-link constraints.
|
|
|
|
| 1: Augment the |
|
|
|
|
| /* |
| 2: Initialize set |
|
|
|
|
| 3: Take every node involved in each |
|
|
|
|
|
|
| 4: If some nodes contained in different communities |
|
|
|
|
|
|
|
|
| /* |
| 5: For each community |
|
|
|
|
|
|
|
|
|
|
|
|
| 6: Among all communities and unclassified nodes, find the most similar pair |
|
|
|
|
|
|
| and then insert the nodes that have |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7: Repeat step 6, until all nodes in the network are processed, i.e., until |
| 8: |
Figure 1A simple two-community network.
If the nodes are selected according to their degree values, only node will be selected, and community will be ignored. However, using the score value in conjunction with degree value of every node in the network as the condition, we will select node (or ) from the network at least, which means that the selected nodes can cover all of the ground truth communities. (The different node shapes and shades indicate different communities, the black lines are the edges within communities, and the light-gray connections represent the edges across different communities. This illustration style is also applied in the following figures.)
Algorithm 2: Active approach to generate the must-link and cannot-link constraints.
|
|
|
|
| 1: For each node |
|
|
| 2: Extract the nodes whose |
| 3: |
| 4: Select the node with the maximal degree in each cluster |
|
|
| if more than one node having the same maximal degree exist, the node with the maximal |
| 5: Initialize the sets of |
|
|
| 6: For any two representatives |
|
|
| if |
| if |
| 7: Check each cluster |
|
|
| • if |
| • if |
|
|
| then query the relationships between |
|
|
| • if the query result is “ |
| • if the result is “ |
| • for all |
|
|
| 8: Repeat step 7, until all nodes having the same maximal degrees in every cluster are processed, or until certain user-specified termination criteria are met |
| 9: If more |
| • From the remainder nodes in |
| • if |
|
|
| • if |
|
|
| • if |
|
|
| then query the relationships between |
|
|
| • if |
|
|
| • if |
|
|
| 10: Repeat step 9, until certain user-specified criteria are met |
| 11: |
Algorithm 3: .
|
|
|
|
| 1: For each node pair |
|
|
| 2: For each node |
|
|
|
|
| 3: Construct the initial clusters |
|
|
| 4: Select the node with the largest degree from the unprocessed nodes in |
| 5: Merge the cluster containing the |
| 6: Take each node in |
| 7: Repeat steps 4–6 until all of the nodes in |
| 8: |
Algorithm 4: Similarity computation algorithm based on random walk.
|
|
|
|
| 1: Initialize the similarity matrix: |
|
|
|
|
| 2: Take any node in the network as the |
|
|
|
|
|
|
|
|
|
|
|
|
| 3: Increase the similarity value between every pair of nodes recorded in set |
|
|
|
|
|
|
| 4: Take any other node as the |
| 5: |
Statistical information of the networks.
| network | nodes | edges | communities |
| karate | 34 | 78 | 2 |
| dolphin | 62 | 159 | 4 |
| Risk map | 42 | 83 | 6 |
| collaboration | 118 | 197 | 6 |
Figure 2Zachary's karate club network.
(a) The ground truth community structure; (b) The community structure extracted by the proposed algorithm; (c) The community structure extracted by FastQ; (d) The community structure aggregated from 30 community structures extracted by LPA; (e) The community structure detected by Infohiermap; (f) The community structure identified by PPC.
Comparisons of the 3 metrics: A rank (number in parentheses) is attached to the value of each metric for each network, and the value with the highest rank for each metric on each network is shown in bold.
| network | algorithm |
|
|
| rank score | rank |
| karate | ground truth | 0.371 | 1.00 | 1.00 | ||
| Fast | 0.381 (4) | 0.735 (4) | 0.692 (4) | 4.00 | 5 | |
| LPA | 0.399 (3) (max: 0.416 average: 0.397) | 0.853 (2) | 0.826 (2) | 2.33 |
| |
| Infohiermap | 0.402 (2) | 0.824 (3) | 0.699 (3) | 2.67 | 3 | |
| PPC |
| 0.676 (5) | 0.687 (5) | 3.67 | 4 | |
| proposal | 0.371 (5) |
|
| 2.33 |
| |
| dolphin | ground truth | 0.519 | 1.00 | 1.00 | ||
| Fast | 0.491 (5) | 0.839 (4) | 0.733 (5) | 4.67 | 5 | |
| LPA | 0.503 (4) (max: 0.526 average: 0.506) | 0.823 (5) | 0.837 (3) | 4.00 | 4 | |
| Infohiermap | 0.525 (2) | 0.887 (2) |
| 1.67 | 2 | |
| PPC | 0.519 (3) | 0.871 (3) | 0.812 (4) | 3.33 | 3 | |
| proposal |
|
| 0.85 (2) | 1.33 |
| |
| Risk map | ground truth | 0.621 | 1.00 | 1.00 | ||
| Fast | 0.625 (2) | 0.929 (2) | 0.894 (3) | 2.33 | 3 | |
| LPA | 0.624 (3) (max: 0.634 average: 0.619) | 0.81 (4) | 0.848 (4) | 3.67 | 4 | |
| Infohiermap |
| 0.857 (3) | 0.945 (2) | 2.00 |
| |
| PPC | 0.621 (4) | 0.81 (4) | 0.803 (5) | 4.33 | 5 | |
| proposal | 0.621 (4) |
|
| 2.00 |
| |
| collaboration | ground truth | 0.739 | 1.00 | 1.00 | ||
| Fast | 0.749 (2) | 0.831 (3) | 0.867 (3) | 2.67 | 3 | |
| LPA | 0.681 (5) (max: 0.726 average: 0.678) | 0.627 (5) | 0.799 (5) | 5.00 | 4 | |
| Infohiermap1 | 0.651 (6) | 0.636 (4) | 0.764 (6) | 5.33 | 6 | |
| Infohiermap2 | 0.704 (4) | 0.602 (6) | 0.805 (4) | 4.67 | 5 | |
| PPC |
| 0.847 (2) | 0.876 (2) | 1.67 |
| |
| proposal | 0.72 (3) |
|
| 1.67 |
|
Infohiermap1 and Infohiermap2 represent the first-level and the second-level community structures extracted by the Infohiermap algorithm, respectivly.
Figure 3Lusseau's bottlenose dolphin social network.
(a) The ground truth community structure; (b) The community structure extracted by the proposed algorithm; (c) The community structure identified by FastQ; (d) The community structure aggregated from 30 outputs of LPA; (e) The community structure detected by Infohiermap; (f) The community structure identified by PPC.
Figure 4Risk map network.
(a) The ground truth community structure; (b) The community structure identified by the proposed algorithm; (c) The community structure extracted by FastQ; (d) The community structure aggregated from 30 outputs of LPA; (e) The community structure detected by Infohiermap; (f) The community structure extracted by PPC.
Figure 5Collaboration network of scientists at the Santa Fe Institute.
(a) The ground truth community structure; (b) The community structure detected by the proposed algorithm; (c) The community structure obtained by FastQ; (d) The community structure aggregated from 30 results of LPA; (e) The first-level community structure extracted by Infohiermap; (f) The second-level community structure extracted by Infohiermap; (g) The community structure identified by PPC.
Figure 6The evolutions of the three metrics on the dolphin social network.
Figure 7The evolutions of the three metrics on the scientist collaboration network.