| Literature DB >> 25250864 |
Renchu Guan1, Chen Yang2, Maurizio Marchese3, Yanchun Liang4, Xiaohu Shi4.
Abstract
Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.Entities:
Mesh:
Year: 2014 PMID: 25250864 PMCID: PMC4177555 DOI: 10.1371/journal.pone.0108847
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Top 10 topics in Biomed corpus.
| Serial Number | Name | Abb Name | DocumentNumber |
| 1 | BMC Bioinformatics | BMC Bioinformatics | 5022 |
| 2 | BMC Genomics | BMC Genomics | 4121 |
| 3 | BMC Public Health | BMC Public Health | 3758 |
| 4 | BMC Cancer | BMC Cancer | 3025 |
| 5 | Journal of Cardiovascular Magnetic Resonance | J Cardiovas Magn R | 2538 |
| 6 | Retrovirology | Retrovirology | 2478 |
| 7 | BMC Neuroscience | BMC Neuroscience | 2454 |
| 8 | BMC Evolutionary Biology | BMC Evo Biol | 1973 |
| 9 | Malaria Journal | Malaria J | 1968 |
| 10 | Journal of Medical Case Reports | J Med Case Rep | 1966 |
Figure 1F-measure comparison.
K-means: k-means clustering; AP: Affinity Propagation clustering; SSAP: Semi-supervised Affinity Propagation.
Figure 2Entropy comparison.
K-means: k-means clustering; AP: Affinity Propagation clustering; SSAP: Semi-supervised Affinity Propagation.
Figure 3CPU execution time comparison.
K-means: k-means clustering; AP: Affinity Propagation clustering; SSAP: Semi-supervised Affinity Propagation.
The mean values over all experiments.
| Mean F-measure | Mean Entropy | Mean CPU execution time (Min) | |
| SSAP |
|
|
|
| AP | 0.650 |
| 539.6 |
| k-means | 0.384 | 0.721 | 2866.8 |
Figure 4Directed relationship network based on SSAP clustering of BioMed journals.
Each node indicates a biomedical journal.
Parameter analysis of the directed relationship network for SSAP clustering of biomed journals.
| Number of nodes:10 | Number of inter edges:27 |
|
| Connected components:1 |
| Network radius:2 | Network diameter:5 |
| Shortest paths:73 | Network density: 0 |
|
| Avg.number of neighbors:4.4 |
Cluster distribution matrix.
| Cluster | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Amount |
| BMC Bioinformatics |
| 1 | 3 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 40 |
| BMC Evo Biol | 9 |
| 3 | 0 | 5 | 0 | 6 | 1 | 0 | 0 | 40 |
| BMC Genomics | 7 | 12 |
| 0 | 9 | 0 | 1 | 0 | 4 | 0 | 40 |
| BMC Neuroscience | 3 | 0 | 1 |
|
| 0 | 5 | 0 | 5 | 0 | 40 |
| BMC Cancer | 0 | 0 | 0 |
|
| 6 | 0 | 5 | 0 | 0 | 40 |
| BMC Public Health | 0 | 0 | 0 | 0 | 2 |
|
| 1 | 0 | 0 | 40 |
| Malaria J | 0 | 1 | 2 | 1 | 2 |
|
| 0 | 1 | 0 | 40 |
| J Cardiovas Magn R | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 0 | 0 | 40 |
| Retrovirology | 0 | 1 | 0 | 1 | 2 | 6 | 0 | 0 |
| 0 | 40 |
| J Med Case Rep | 0 | 0 | 0 | 0 | 16 | 1 | 0 | 6 | 3 |
| 40 |
| Amount | 53 | 31 | 16 | 25 | 70 | 53 | 44 | 51 | 43 | 14 | 400 |
| Homologous texts |
|
|
|
|
|
|
|
|
|
|
|