| Literature DB >> 29077747 |
Dejian Yu1, Wanru Wang1, Shuai Zhang1, Wenyu Zhang1, Rongyu Liu1.
Abstract
The challenge of detecting research topics in a specific research field has attracted attention from researchers in the bibliometrics community. In this study, to solve two problems of clustering papers, i.e., the influence of different distributions of citation links and involved textual features on similarity computation, the authors propose a hybrid self-optimized clustering model to detect research topics by extending the hybrid clustering model to identify "core documents". First, the Amsler network, consisting of bibliographic coupling and co-citation links, is created to calculate the citation-based similarity based on the cosine angle of papers. Second, the cosine similarity is also used to compute the text-based similarity, which consists of the textual statistical and topological features. Then, the cosine angle of the linear combination of citation- and text-based similarity is considered as the hybrid similarity. Finally, the Louvain method is applied to cluster papers, and the terms based on term frequency are used to label clusters. To test the performance of the proposed model, a dataset related to the data envelopment analysis field is used for comparison and analysis of clustering results. Based on the benchmark built, different clustering methods with different citation links or textual features are compared according to evaluation measures. The results show that the proposed model can obtain reasonable and effective clustering results, and the research topics of data envelopment analysis field are also analyzed based on the proposed model. As different features are considered in the proposed model compared with previous hybrid clustering models, the proposed clustering model can provide inspiration for further studies on topic identification by other researchers.Entities:
Mesh:
Year: 2017 PMID: 29077747 PMCID: PMC5659815 DOI: 10.1371/journal.pone.0187164
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Related work of clustering methods based on citation links and textual features.
| Methods | Similarity computation | References |
|---|---|---|
| Co-citation network | Small [ | |
| Bibliographic coupling network | Kessler [ | |
| Amsler network | Amsler [ | |
| Co-word analysis | Callon et al. [ | |
| Text mining methods | Boyack et al. [ | |
| Combination of co-citation and co-word analysis | Braam et al. [ | |
| Combination of bibliographic coupling and TF-IDF | Glänzel and Thijs [ | |
| Combination of cross-citation network and TF-IDF | Liu et al. [ |
Symbols used in this study and their corresponding explanations.
| Symbols | Explanations |
|---|---|
| Weight of citation-based similarity | |
| Weight of text-based similarity | |
| Weight of statistical feature of terms | |
| Two papers in the example | |
| The matrix of the total links and their strength of all pairs of papers based on the Amsler network | |
| Matrix of the citation-based similarity between papers | |
| Frequency of term | |
| Citation-based similarity between | |
| Total number of papers | |
| Number of papers containing term | |
| Two nodes of the terms connected in paper | |
| Statistical feature of term | |
| Importance of terms | |
| Total number of performed walks | |
| Number of times that node | |
| Transition probability that node | |
| Diversity of node | |
| Outward accessibility of node | |
| Text-based similarity between papers | |
| Total number of terms appearing in papers | |
| Hybrid similarity between | |
| Weight of the edges between vertex | |
| Sum of the weights of the edges attached to vertexes | |
| The clusters to which vertexes | |
| Sum of the weights of the links from | |
| Sum of the weights of all the links in the network | |
| Σin | Sum of the weights of the links inside a community |
| Σtot | Sum of the weights of the links incident to nodes in a community |
Fig 1Flowchart of the proposed hybrid self-optimized clustering model.
Fig 2In-links and out-links between two papers.
p, p, and p represents the citations of the paper p, while p, p, p, and p represent the citations of the paper p; therefore, p and p are the common in-links between p and p. Similarly, p, p, p, and p represent the references of the paper p, while p, p, and p represent the references of the paper p; therefore, p and p are the common out-links between p and p.
Fig 3Basic statistics of DEA-related papers per year from 1980 to 2017.
(A) Number of papers. (B) Number of citations and references of the papers.
Fig 4Values of F1 measure and RI with different values of parameter α.
Different clustering methods with different parameter settings.
| Groups | Methods | Parameter sets | Citation networks | Textual features |
|---|---|---|---|---|
| Amsler+TF-IDF+OA | Amsler | Statistical and topological features | ||
| Amsler | Amsler | — | ||
| TF-IDF+OA | — | Statistical and topological features | ||
| Amsler+TF-IDF+OA | Amsler | Statistical and topological features | ||
| BC+TF-IDF+OA | Bibliographic coupling | Statistical and topological features | ||
| CoC+TF-IDF+OA | Co-citation | Statistical and topological features | ||
| Amsler+TF-IDF+OA | Amsler | Statistical and topological features | ||
| Amsler+TF-IDF | Amsler | Statistical feature | ||
| Amsler+OA | Amsler | Topological feature | ||
| Amsler+TF-IDF+OA | Amsler | Statistical and topological features | ||
| BC+TF-IDF | Bibliographic coupling | Statistical feature | ||
| CoC+TF-IDF | Co-citation | Statistical feature |
Fig 5Values of evaluation measures of different methods.
The evaluation measures are precision, recall, F1 measure, and RI values, and the different methods are the models listed in Table 3.
Fig 6Information of each cluster based on the proposed model.
It includes the total number of papers, top 30 high-frequency terms, and the trends of number of papers by year of the main seven clusters.