Literature DB >> 26090857

FctClus: A Fast Clustering Algorithm for Heterogeneous Information Networks.

Jing Yang¹, Limin Chen², Jianpei Zhang¹.

Abstract

It is important to cluster heterogeneous information networks. A fast clustering algorithm based on an approximate commute time embedding for heterogeneous information networks with a star network schema is proposed in this paper by utilizing the sparsity of heterogeneous information networks. First, a heterogeneous information network is transformed into multiple compatible bipartite graphs from the compatible point of view. Second, the approximate commute time embedding of each bipartite graph is computed using random mapping and a linear time solver. All of the indicator subsets in each embedding simultaneously determine the target dataset. Finally, a general model is formulated by these indicator subsets, and a fast algorithm is derived by simultaneously clustering all of the indicator subsets using the sum of the weighted distances for all indicators for an identical target object. The proposed fast algorithm, FctClus, is shown to be efficient and generalizable and exhibits high clustering accuracy and fast computation speed based on a theoretic analysis and experimental verification.

Entities: CellLine Disease Gene

Mesh：

Year: 2015 PMID： 26090857 PMCID： PMC4474961 DOI： 10.1371/journal.pone.0130086

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Information networks are ubiquitous and include social information networks and DBLP bibliographic networks. Numerous studies on homogeneous information networks, which consist of a single type of data object, have been performed; however, little research has been performed on the clustering of heterogeneous information networks, which consist of multiple types of data objects. Clustering on a heterogeneous network may lead to better understanding the hidden structures and deeper meanings of the networks[1]. The star network schema is popular and important in the field of heterogeneous information networks. The star network schema includes one data object target type and multiple data object attribute types, whereby each relation is the target data objects and all attribute data objects linking to it. Algorithms based on compatible bipartite graphs can effectively consider multiple types of relational data. Various classical clustering algorithms, such as algorithms based on semi-definite programming[2,3], algorithms based on information theory[4] and spectral clustering algorithms for multi-type relational data[5], have been proposed for heterogeneous data from the compatible point of view. These algorithms are generalizable, but the computational complexity of these algorithms is too great for use in clustering heterogeneous information networks. Sun et al. presents an algorithm, NetClus[6], and a PathSim-based clustering algorithm[7] for clustering heterogeneous information networks. NetClus is effective for DBLP bibliographic networks, but the algorithm is not a general model for clustering other heterogeneous information networks; NetClus is not sufficiently stable. The concept behind NetClus is also used for clustering service webs[8,9]. The PathSim-based clustering algorithm requires a user guide, and the clustering quality reflects the requirements of users rather than the requirements of the network. ComClus[10] is a derivation algorithm of NetClus for use with hybrid networks that simultaneously include heterogeneous and homogeneous relations. NetClus and ComClus are not general and depend on the given application. Dynamic link inference in heterogeneous networks[11] requires more accurate initial clustering. A high clustering quality is necessary for network analysis, but low computation speed is intolerable because of the large network scales involved. The accuracy of the LDCC algorithm[12] is improved, while both the heterogeneous and homogeneous data relations are explored. The CESC algorithm[13] is very effective for clustering homogeneous data using an approximate commute time embedding. A heterogeneous information network with a star network schema can transform into multiple compatible bipartite graphs from the compatible point of view. When the relation between any two nodes of the bipartite graph is presented with the commute time, the relation of both heterogeneous and homogenous data objects can be explored; the clustering accuracy can also be improved. The heterogeneous information networks are large but very sparse; therefore, the approximate commute time embedding of each bipartite graph can be quickly computed using random mapping and a linear time solver[14]. All of the indicator subsets in each embedding indicate the target dataset, and subsequently, a general model for clustering heterogeneous information networks is formulated based on all indicator subsets. All weighted distances between the indicators and the cluster centers in the respective indicator subsets are computed. All indicator subsets can be simultaneously clustered according to the sum of the weighted distances for all indicators for an identical target object. Based on the above discussion, an effective clustering algorithm, FctClus, which is based on the approximate commute time embedding for heterogeneous information networks, is proposed in this paper. The computation speed and clustering accuracy of FctClus are high.

Methods

Commute Time Embedding of the Bipartite Graph

Given two types of datasets, and , the graph G = 〈V, E〉 is called a bipartite graph if V(G ) = X 0 ∪ X 1 and , where 1 ≤ i ≤ n 0, 1 ≤ j ≤ n 1. is the relation matrix between X 0 and X 1, where the element w is the edge weight between and . Then, the adjacency matrix of the bipartite graph G can be denoted as D 1 and D 2 are the diagonal matrices, where the diagonal element of D 1 is and the diagonal element of D 2 is . thus the Laplacian matrix of the bipartite graph G is . L can be eigen-decomposed into L = ΦΛΦ, where Λ = diag(λ 1, λ 2,⋯, λ ) is a diagonal matrix composed of the eigenvalues of L and λ 1 ≤ λ 2 ≤ ⋯ ≤ λ , Φ = (ϕ 1, ϕ 2, ⋯, ϕ ) is an eigenmatrix and ϕ is an eigenvector corresponding to the eigenvalue λ . Let L + be a pseudo-inverse matrix of L and . The bipartite graph is also an undirected weighted graph. According to the literature[15], the commute time c between nodes i and j of G can be computed by the pseudo-inverse matrix L +. where is the (i, j) element of L +, g = ∑ w , e is a unit column vector in which the i-th element is 1; that is, . According to the literature[15,16], the commute time c between nodes i and j of G is Thus, the commute time c is the square pairwise Euclidean distance between the row vectors in the space or the column vectors in the space [13], or is called the commute time embedding of the bipartite graph G . c is the average path length between two nodes rather than the shortest path between two nodes. Using the commute time for clustering the noisy data increases robustness and captures the complex clusters. Therefore clustering in the commute time embedding can also effectively capture the complex clusters. is used in this paper. If a normal Laplacian matrix L = D −1/2 LD −1/2 is used, the commute time embedding is [13].

Approximate Commute Time Embedding of the Bipartite Graph

If directly computing or , the process requires O(n 3) time for the eigen-decomposition of the Laplacian matrix L or L . n = n 0 +n 1 is the number of nodes and s is the number of edges in the bipartite graph G . According to the literature[17], if the edges in G are oriented and where i and j are nodes of G , then B is a directed edge-node incidence matrix. Using as a diagonal matrix whose entries are the edge weights, thus . Furthermore, thus, ψ is the commute time embedding of the bipartite graph G , where the square root of the commute time is the Euclidean distance between i and j in ψ because According to the literature[18], given vectors v 1,⋯, v ∈ R and ε > 0, is a random matrix of row vectors, where is equivalent when k = O(log n / ε 2). With probability 1−1 / n, at least for all pairs. Therefore, given the bipartite graph G with n nodes and s edges, ε > 0, and a matrix with probability of at least 1−1 / n: for any nodes i, j ∈ G , where k = O(log n / ε 2). The proof of Eq (4) comes directly from Eq (2) and Eq (3). c ≈||Y(e − e )||2 with an error ε based on Eq (4). If directly computing , L + must first be computed, but the computational complexity of directly computing L + is excessive. However, using the method in the literature[19,20] to compute , the complexity is decreased. Let ; then, Y = θL +, which is equal to YL = θ. First, is computed, and then, YL = θ. Each row of Y, y , is computed by solving the system y L = θ , where θ is the i-th row of θ. The linear time solver of Spielman and Teng[19,20] requires only time to solve the system. Because [17], where is the solution, y L = θ using the linear time solver. Then, [17] Therefore, with an error bound of ε 2. The component of the algorithm for the approximate commute time embedding of the bipartite graph is illustrated as follows. Algorithm1 ApCte (Approximate Commute Time Embedding of the Bipartite Graph) input the relation matrix ; compute the matrices B, and L using ; compute ; compute each using the system y L = θ by calling to the Spielman-Teng solver k times[14], 1 ≤ i ≤ k ; output the approximate commute time embedding . All data objects of X 0 and X 1 are mapped into a common subspace , where the first n 0 column vectors of indicate X 0 and the last n 1 column vectors of indicate X 1. The dataset is composed of the n = n 0 +n 1 column vectors of is called an indicator dataset. The input matrix is a sparse matrix with s nonzero elements. Therefore, the complexity of computing the matrices B, and L in step 2 is O(2s) + O(s) + O(n). The sparse matrix B has 2s nonzero elements, and the diagonal matrix has s nonzero elements. Computing takes O(2sk + s) time in step 3. Because the linear time solver of Spielman and Teng[19,20] requires only time to solve for each y of system y L = θ , constructing takes time in step 4. Therefore, the complexity of algorithm1, ApCte, is only O(2s) + O(s) +O(n) + O(2sk + s) + = . In practice, k = O(log n / ε 2) is small and does not vary between different datasets. The indicator dataset includes low-dimensional homogeneous data; therefore, traditional algorithms can be used for the indicator dataset.

A General Model Formulation

Given a dataset with T+1 types, where X is a dataset belonging to the t-th type, a weighted graph G = < V, E, W > on χ is called an information network; if V(G) = χ, the E(G) is a binary relation on V and W: E → R +. Such an information network is called a heterogeneous information network when T ≥ 1 and a homogeneous information network when T = 0[6]. An information network G = < V, E, W > on χ is called a heterogeneous information network with a star network schema if ∀e = 〈x , x 〉 ∈ E, x ∈ X 0 and x ∈ X (t ≠ 0). X 0 is the target dataset, and X (t ≠ 0) is the attribute dataset. To derive a general model for clustering the target dataset, a heterogeneous information network with a star network schema using the dataset with T+1 types is given, where X 0 is the target dataset and are the attribute datasets. , where n is the object number of X . denotes the relation matrix between the target dataset X 0 and the attribute dataset X , where the element denotes the relation between of X 0 and of X . If an edge between and exists, its edge weight is . If no edge exists, = 0. T relation matrices exist in the heterogeneous information network with a star network schema. The target dataset X 0 and the attribute dataset X constitute a bipartite graph, G (0, which corresponds to the relation matrix W (0. The indicator dataset which also is the approximate commute time embedding of G (0 can be quickly computed by ApCte, where the first n 0 data of Y (0 indicate X 0 and the last n data of Y (0 indicate the attribute dataset X . consists of the first n 0 data of Y (0, and Y ( consists of the last n data of Y (0. and Y ( are called the indicator subsets. indicates the i-th object of X 0 and is called an indicator for 1 ≤ i ≤ n 0. There exists a one-to-one correspondence between the indicators of and the objects of X 0. Because T bipartite graphs correspond to T indicator datasets, the target dataset X 0 is simultaneously indicated by the T indicator subsets , and each object of X 0 is simultaneously indicated by T indicators. β ( is the weight of the relation matrix W (0, where , β ( > 0. The target dataset X 0 is partitioned into K clusters. The indicators of , which indicate the identical object of X 0, belong to T clusters. The T clusters are in T different indicator subsets and are denoted using the same label. Let where is the j-th cluster center of the indicator subset . There exists a one-to-one correspondence between the indicator function and the objects of X 0. If all indicators, , that indicates the i-th object of X 0 belong to the j-th cluster, γ = 1; otherwise, γ = 0. If the objective function F in Eq (5) is minimized, the clusters of X 0 are optimal from the compatible point of view because each indicator subset reflects the relation between the target dataset and the attribute dataset. Obviously, determining the global minimum of Eq (5) is NP hard.

Derivation of Fast Algorithm for Clustering Heterogeneous Information Networks

The following steps allow for the local minimum of F in Eq (5) to be quickly achieved by simultaneously clustering all of the indicator subsets.

Setting the Cluster Label

When given the cluster label of each indicator subset, the modeling process can be simplified. Suppose that the labels of the K clusters of each are set. Let q 1, q 2 ∈ X 0, , . indicate q 1, and indicate q 2. The clusters which indicators for an identical target object belong to have the same label. If one indicator of belongs to the j-th cluster, all of the other indicators of also belong to the j-th cluster in their respective indicator subset. If belongs to the j-th cluster, then all either belong to the j-th cluster in their respective indicator subset or none belong to the j-th cluster. Each cluster of has an initial center. K random objects are selected from the target dataset X 0. The indicators indicating the K objects are taken as the initial cluster centers for each and for the clusters whose center indicates an identical target object with the same label. Then, all of the other indicators for an identical target object only belong to the j-th cluster in each or no indicators belong to the j-th cluster, where 1 ≤ j ≤ K. Therefore, the K clusters of are set labels.

The sum of the Weighted Distances

An object of X 0 is indicated by T indicators. All of the T distances between the indicator and the center in each affect the object allocation. The target object allocation is determined by the sum of the weighted distances for the T indicators. Setting q ∈ X 0, , , indicates q . The weighted distance between and the j-th cluster center in is . The sum of the weighted distances is , which determines the cluster that the object q belongs. where j is the cluster label.

The Local Minimum of F

F in Eq (5) can also be expressed as Obviously, Eq (7) is another representation of Eq (5). Given the initial centers and the cluster labels in the T indicator subsets , is first partitioned by computing Eq (6) and setting F = F 0 in Eq (7). The cluster centers of remain the same, and γ is unchanged. The new center of each cluster in is computed. The new center is the mean of all data of each cluster. The new centers of replace the old centers, and subsequently, Eq (7) is used to set F = F 1. Then, proving Because only the new centers of replace the old centers, γ remains unchanged. Therefore Because the cluster centers of also remain unchanged, is constant, and . Subsequently, Thus, the cluster centers of , for F 1 ≤ F 0, are replaced. The new centers of replace the old centers, while the centers of remain unchanged. Re-clustering using Eq (6), where the corresponding value is F = F 2 in Eq (7), gives F 2 ≤ F 1. Partitioning using Eq (6) computes the new cluster centers of ; the new centers replace the old centers . Then, the same procedure is repeated for each . The value of F decreases in this case. The above procedures are repeated until F in Eq (7) converges; then, the local minimum of F in Eq (7) is obtained. The algorithm based on the approximate commute time embedding for heterogeneous information networks is shown below. Algorithm 2 FctClus (Fast Clustering Algorithm based on the Approximate Commute Time Embedding for Heterogeneous Information Networks) Input relation matrices , weights and cluster number K; for t = 1 to T do Compute indicator dataset Y (0 of the bipartite graph corresponding to W (0 using algorithm 1; Constitute the indicator subset that indicates X 0; end for Initialize the K initial cluster centers of and set the cluster label; loop for t = 1 to T do Partition into K clusters by computing Eq (6); Re-compute the new cluster centers of ; ; end for end loop Output the clusters of X 0. The computational complexity of steps 2~5 is in algorithm 2, where T is the number of relational matrices in the heterogeneous information network and k is the data dimension of . n and s are the node number and edge number of the t-th bipartite graph, respectively. Step 6 requires only O(K) time; the time is constant. The object number of X 0 is equal to the indicator number of each indicator subset, thus the computational complexity of steps 7~13 is O(uTKk n 0), where K is the number of clusters of each ; n 0 is the data number of each ; and u is the iteration number for F in Eq (7) convergence. Therefore, the computational complexity of algorithm 2, FctClus, is + O(uTKk n 0), where k and u are small and T and K are constant.

Experiments

The Experimental Dataset

The experimental datasets are composed of real data selected from the DBLP data. The DBLP is a typical heterogeneous information network in computer science domain and contains 4 types of objects, including papers, authors, terms and venues. Two different-scaled heterogeous datasets called S and S respectively are used in experiments. S is the small test dataset and is called the "four-area dataset", as in the literature[6]. S extracted from the DBLP dataset downloaded in 2011 contains four areas related to data mining: databases, data mining, information retrieval and machine learning. Five representative conferences for each area are chosen, and all papers and terms that appear in the titles are included. S is showed in S1 File. S is the large test dataset and extracted from the Chinese DBLP dataset, which are sharing resources released by Institute of automation, Chinese Academy of Sciences. S includes 34 computer science journals, 16, 567 papers, 47, 701 authors and 52,262 terms(keywords). S is showed in S2 File. When analyzing the papers, this object is the target dataset, and the other objects are the attribute datasets. There is no direct link between papers because the DBLP provided very limited citation information. When analyzing the authors, this object is the target dataset, while papers and venues are the attribute datasets. However, there is a direct link between authors because of the co-author relation between various authors; therefore, authors are another attribute dataset related to the target dataset. The experiments are performed in the MATLAB 7.0 programming environment. The matlab source codes for our algorithm are showed in S3 File and are available online at https://github.com/lsy917/chenlimin, which include a main program and three function programs. FctClus.m is the main program which output the clusters of the object dataset, and ApCte.m, Prematrix.m and Net_Branches.m are function programs. The Koutis CMG solver[14] is used in all experiments as the nearly linear time solver to create the embedding. The solver uses symmetric, diagonally dominant matrices that are available online at http://www.cs.cmu.edu/~jkoutis/cmg.html.

The Relational Matrix

Papers are the target dataset, while authors, venues and terms are the attribute datasets. X 0 denotes papers, and X 1, X 2 and X 3 denote authors, venues and terms, respectively. W (0 is the relation matrix between X 0 and X , 1 ≤ t ≤ 3. The element of is When authors are the target dataset, papers and venues are the attribute datasets. Authors are also an attribute dataset because of the co-author relation existing between authors. X 0 denotes authors when X 1 and X 2 denote papers and venues, respectively. W (0 is the relation matrix between X 0 and X , 0 ≤ t ≤ 2. The element of is All the algorithms use the same relation matrix for all experiments.

Parameter Analysis

Analysis of Parameter k

The equation [13] is used to compute the clustering accuracy in the experiments, where n is the object number of dataset, label(i) is the cluster label, and c is the predicted label of an object i. δ(⋅) is an indicator function: k is small in practice, and minimal differences exist among the various datasets[13]. The literature[13] has proved that the accuracy curve is flat for clustering different homogeneous datasets when k ≥50. Using the small dataset S , the clustering accuracy as a function of k in a heterogeneous information network is studied. An experiment with different k is conducted in the small dataset, S . In the FctClus algorithm, the weight of is taken as β (1) = 0.3, β (2) = 0.4 and β (3) = 0.3 for clustering papers; the weight of is taken as β (1) = 0.4, β (2) = 0.2 and β (3) = 0.4 for clustering authors. The clustering accuracy is affected by k , as shown in Fig 1 and Fig 2.

Fig 1

The influence of k for clustering papers on s

Fig 2

The influence of k for clustering authors on s

The parameter k is quite small because the accuracy curve is flat when k obtains a certain value. k = 60 is suitable for the dataset in the experiment. k is small and does not considerably affect the computation speed of FctClus. It is advantageous that FctClus is not sensitive to k in terms of both accuracy and performance. All weights of the relation matrix and k = 60 are studied in other experiments.

Analysis of Iteration u

An experiment is conducted in the small dataset S to compare the influence of iteration u on the clustering result, where k = 60. The influence of the iteration u on clustering papers and authors is shown in Fig 3 and Fig 4. The algorithm quickly convergences when u = 30. u = 40 is examined in the other experiments.

Fig 3

The influence of u for clustering papers on s

Fig 4

The influence of u for clustering authors on s

Comparison of Clustering Accuracy and Computation Speed

The complexity of the algorithms is too high for large-scale networks based on semi-definite programming[2,3] and spectral clustering algorithms for multi-type relational data[5]. The low-complexity algorithms CIT[4], NetClus[6] and ComClus[10] are selected for comparison with the FctClus algorithm in terms of clustering accuracy and computation speed; the datasets S and S are also chosen for this experiment. The initial cluster centers of FctClus or the initial cluster partitions of the other three algorithms are randomly selected 3 times. The best clustering accuracy of the 3 measurements is used as the clustering accuracy of the four algorithms, and the computation speed at this time is considered as the measured computation speed. The parameters in literature[6] are used as the parameters in NetClus, and the parameters in literature[10] are used as the parameters in ComClus in this experiment. The comparison results are shown in Table 1 and Table 2.

Table 1

Comparison of clustering accuracy (%).

target object &dataset	CIT	NetClus	ComClus	FctClus
Papers on s _small	73.91	71.54	72.83	78.87
Authors on s _small	74.41	69.13	74.91	81.33
Papers on s _large	70.84	71.28	72.93	76.36
Authors on s _large	71.02	68.29	73.01	77.94

Table 2

Comparison of computation speed(s).

target object &dataset	CIT	NetClus	ComClus	FctClus
Papers on s _small	78.5	37.3	40.3	37.1
Authors on s _small	79.8	36.9	39.8	38.3
Papers on s _large	1469.3	802.6	827.3	808.4
Authors on s _large	1484.7	743.7	781.4	774.9

The clustering accuracy of FctClus is the highest of all four algorithms. The clustering accuracy of CIT is lower than that of FctClus because the bipartite graphs of the heterogeneous information networks are sparse. The computational complexity of CIT is O(n 2), and the convergence speed of CIT is low when the heterogeneous information network is sparse. The clustering accuracy of NetClus is low because only heterogeneous relations are used. Homogeneous and heterogeneous relations are both used in ComClus; therefore, the accuracy of ComClus is higher than that of NetClus. FctClus is an algorithm based on commute time embedding. The data relations are explored using commute time and the direct relations of the target dataset are considered. FctClus is not affected by the sparsity of networks; thus, FctClus is highly accurate. The computation speed of FctClus is nearly as fast as NetClus. The experiment demonstrates that FctClus is effective. FctClus is more universal and can be adapted for clustering any heterogeneous information network with a star network schema. However, NetClus and ComClus can only be adapted for clustering bibliographic networks because NetClus and ComClus depend on a ranking function of a specific application field.

Comparison of Clustering Stability

To compare the stability of the FctClus, NetClus and CIT algorithms, the small dataset S is used for clustering papers in this experiment. ComClus is a derivation algorithm of NetClus; it has the same properties as NetClus. ComClus is not considered in this study. The initial cluster centers of FctClus and the initial cluster partitions of NetClus and CIT are randomly recorded 10 times, and the three algorithms are executed 10 times respectively. The clustering accuracy of the three algorithms for 10 times is shown in Fig 5. Although the computation speeds of FctClus and NetClus are both high, Fig 5 shows that the stability of FctClus is higher than that of NetClus and that the initial centers do not greatly impact the clustering result of FctClus. However, NetClus is very unstable, and the initial clusters greatly impact the clustering accuracy and convergence speed of NetClus. CIT is more stable than NetClus, but the clustering accuracy is low.

Fig 5

A stability comparison of the 3 algorithms for 10 times.

Running Time Analysis of the FctClus Algorithm

The running time distributions of FctClus on the two datasets are shown in Table 3. The experimental data show that FctClus is effective. The running time for serial computing the three embedding is less than 50% of the total running time. When utilizing parallel computing for the three embedding, the computation speed is higher. When clustering indicator subsets in parallel, the computation speed may also be increased.

Table 3

Distribution of running time for FctClus.

target object &dataset	Embedding time(s)	Clustering time(s)	Total time(s)
Papers on s _small	19.6	17.5	37.1
Authors on s _small	18.1	20.2	38.3
Papers on s _large	398.8	409.6	808.4
Authors on s _large	382.4	392.5	774.9

Conclusions

The relation between the original data described by the commute time guarantees the accuracy and performance of the FctClus algorithm. Because heterogeneous information networks are sparse, FctClus can use random mapping and a linear time solver[14] to compute the approximate commute time embedding, which guarantees the high computation speed. FctClus is effective and may be broadly implemented for large heterogeneous information networks, as demonstrated in theory and experimentally. The weight of the relation matrix impacts the target function, but the weight cannot be determined self-adaptively; this requires further research. The relations of data in the real world are typically high-order heterogeneous, so effective clustering algorithms for heterogeneous information networks with any schema will be studied in the future.

S dataset.

(TXT) Click here for additional data file. (TXT) Click here for additional data file.

The matlab source codes for algorithm.

(TXT) Click here for additional data file.

1 in total

1. Clustering and embedding using commute times.

Authors: Huaijun John Qiu; Edwin R Hancock
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2007-11 Impact factor: 6.226

1 in total

1. Generic, network schema agnostic sparse tensor factorization for single-pass clustering of heterogeneous information networks.

Authors: Jibing Wu; Qinggang Meng; Su Deng; Hongbin Huang; Yahui Wu; Atta Badii
Journal: PLoS One Date: 2017-02-28 Impact factor: 3.240

1 in total