| Literature DB >> 20980271 |
Shi Yu1, Xinhai Liu, Léon-Charles Tranchevent, Wolfgang Glänzel, Johan A K Suykens, Bart De Moor, Yves Moreau.
Abstract
MOTIVATION: We propose a novel algorithm to combine multiple kernels and Laplacians for clustering analysis. The new algorithm is formulated on a Rayleigh quotient objective function and is solved as a bi-level alternating minimization procedure. Using the proposed algorithm, the coefficients of kernels and Laplacians can be optimized automatically.Entities:
Mesh:
Year: 2010 PMID: 20980271 PMCID: PMC3008636 DOI: 10.1093/bioinformatics/btq569
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Performance on disease dataset
| Algorithm | ARI | NMI | ||
|---|---|---|---|---|
| OKLC 1 | – | – | ||
| OKLC 2 | 0.5369 ± 0.0493 | 2.97E-04 | 0.7106 ± 0.0283 | 9.85E-05 |
| OKLC 3 | 0.5469 ± 0.0485 | 1.10E-03 | 0.7268 ± 0.0360 | 2.61E-02 |
| CSPA | 0.4367 ± 0.0266 | 5.66E-11 | 0.6362 ± 0.0222 | 4.23E-12 |
| HGPA | 0.5040 ± 0.0363 | 8.47E-07 | 0.6872 ± 0.0307 | 7.42E-07 |
| MCLA | 0.4731 ± 0.0320 | 2.26E-10 | 0.6519 ± 0.0210 | 5.26E-14 |
| QMI | 0.4656 ± 0.0425 | 7.70E-11 | 0.6607 ± 0.0255 | 8.49E-11 |
| EACAL | 0.4817 ± 0.0263 | 2.50E-09 | 0.6686 ± 0.0144 | 5.54E-12 |
| AdacVote | 0.1394 ± 0.0649 | 1.47E-16 | 0.4093 ± 0.0740 | 6.98E-14 |
All the comparing methods combine nine kernels and nine Laplacians. The mean values and the SDs are observed from 20 random repetitions. The best performance is shown in bold. The P-values are statistically evaluated with the best performance using paired t-test.
Performance on journal dataset
| Algorithm | ARI | NMI | ||
|---|---|---|---|---|
| OKLC 1 | 0.7346 ± 0.0584 | 0.3585 | 0.7688 ± 0.0364 | 0.1472 |
| OKLC 2 | 0.7235 ± 0.0660 | 0.0944 | 0.7532 ± 0.0358 | 0.0794 |
| OKLC 3 | – | – | ||
| CSPA | 0.6703 ± 0.0485 | 8.84E-05 | 0.7173 ± 0.0291 | 1.25E-05 |
| HGPA | 0.6673 ± 0.0419 | 4.74E-06 | 0.7141 ± 0.0269 | 5.19E-06 |
| MCLA | 0.6571 ± 0.0746 | 6.55E-05 | 0.7128 ± 0.0463 | 2.31E-05 |
| QMI | 0.6592 ± 0.0593 | 5.32E-06 | 0.7250 ± 0.0326 | 1.30E-05 |
| EACAL | 0.5808 ± 0.0178 | 3.85E-11 | 0.7003 ± 0.0153 | 6.88E-09 |
| AdacVote | 0.5899 ± 0.0556 | 1.02E-07 | 0.6785 ± 0.0325 | 6.51E-09 |
All the comparing methods combine four kernels and four Laplacians. The mean values and the SDs are observed from 20 random repetitions. The best performance is shown in bold. The P-values are statistically evaluated with the best performance using paired t-test.
Fig. 1.Confusion matrices of disease data obtained by kernel KM on LDDB (A) and OKLC model 1 integration (B). The numbers of cluster labels are consistent with the numbers of diseases presented in Supplementary Material 3. In each row of the confusion matrix, the diagonal element represents the fraction of correctly clustered genes and the off-diagonal non-zero element represents the fraction of misclustered genes.
Fig. 2.Confusion matrices of journal data obtained by kernel KM on IDF (A) and OKLC model 1 integration (B). The numbers of cluster labels are consistent with the numbers of ESI journal categories presented in Supplementary Material 3. In each row, the diagonal element represents the fraction of correctly clustered journals and the off-diagonal non-zero element represents the fraction of misclustered journals.
The average values of coefficients of kernels and Laplacians in disease dataset optimized by OKLC model 1
| Rank of θ | Source | θ value | Performance rank |
|---|---|---|---|
| 1 | LDDB kernel | 0.6113 | 1 |
| 2 | MESH kernel | 0.3742 | 6 |
| 3 | Uniprot kernel | 0.0095 | 5 |
| 4 | Omim kernel | 0.0050 | 2 |
| 1 | LDDB Laplacian | 1 | 1 |
The sources assigned with 0 coefficient are not presented. The performance is ranked by the average values of ARI and NMI evaluated on each individual sources (Supplementary Material 3).
The average values of coefficients of kernels and Laplacians in journal dataset optimized by OKLC model 3
| Rank of θ | Source | θ value | Performance rank |
|---|---|---|---|
| 1 | IDF kernel | 0.5389 | 1 |
| 2 | Binary kernel | 0.4520 | 2 |
| 3 | TF kernel | 0.2876 | 4 |
| 4 | TF-IDF kernel | 0.2376 | 3 |
| 1 | Bibliographic Laplacian | 0.7106 | 1 |
| 2 | Cocitation Laplacian | 0.5134 | 4 |
| 3 | Crosscitation Laplacian | 0.4450 | 2 |
| 4 | Binarycitation Laplacian | 0.1819 | 3 |
The average values of coefficients of kernels and Laplacians in disease data set optimized by OKLC model 3
| Rank of θ | Source | θ value | Performance rank |
|---|---|---|---|
| 1 | LDDB kernel | 0.4578 | 1 |
| 2 | MESH kernel | 0.3495 | 6 |
| 3 | OMIM kernel | 0.3376 | 2 |
| 4 | SNOMED kernel | 0.3309 | 7 |
| 5 | MPO kernel | 0.3178 | 3 |
| 6 | GO kernel | 0.3175 | 8 |
| 7 | eVOC kernel | 0.3180 | 4 |
| 8 | Uniprot kernel | 0.3089 | 5 |
| 9 | KO kernel | 0.2143 | 9 |
| 1 | LDDB Laplacian | 0.6861 | 1 |
| 2 | MESH Laplacian | 0.2799 | 4 |
| 3 | OMIM Laplacian | 0.2680 | 2 |
| 4 | GO Laplacian | 0.2645 | 7 |
| 5 | eVOC Laplacian | 0.2615 | 6 |
| 6 | Uniprot Laplacian | 0.2572 | 8 |
| 7 | SNOMED Laplacian | 0.2559 | 5 |
| 8 | MPO Laplacian | 0.2476 | 3 |
| 9 | KO Laplacian | 0.2163 | 9 |
Fig. 3.The plot of eigenvalues (A and B) of the optimal kernel-Laplacian combination obtained by all OKLC models. The parameter K is set as equivalent as the reference label numbers.
Comparison of CPU time of all algorithms
| Algorithm | Disease data (s) | Journal data (s) |
|---|---|---|
| OKLC model 1 | 42.39 | 1011.4 |
| OKLC model 2 | 0.19 | 13.27 |
| OKLC model 3 | 37.74 | 577.51 |
| CSPA | 9.49 | 177.22 |
| HGPA | 10.13 | 182.51 |
| MCLA | 9.95 | 320.93 |
| QMI | 9.36 | 186.25 |
| EACAL | 9.74 | 205.59 |
| AdacVote | 9.22 | 172.12 |
The reported values are averaged from 20 repetitions. The CPU time is evaluated on Matlab v7.6.0 + Windows XP2 installed on a Laptop computer with Intel Core 2 Duo 2.26 GHz and 2 G memory.
The average values of coefficients of kernels and Laplacians in journal data set optimized by OKLC model 1
| Rank of θ | Source | θ value | Performance rank |
|---|---|---|---|
| 1 | IDF kernel | 0.7574 | 1 |
| 2 | TF kernel | 0.2011 | 3 |
| 3 | Binary kernel | 0.0255 | 2 |
| 4 | TF-IDF kernel | 0.0025 | 4 |
| 1 | Bibliographic Laplacian | 1 | 1 |
The sources assigned with 0 coefficient are not presented. The performance is ranked by the average values of ARI and NMI evaluated on each individual sources (Supplementary Material 5).