| Literature DB >> 25794172 |
Yuanchao Liu1, Ming Liu1, Xin Wang1.
Abstract
The objective of text clustering is to divide document collections into clusters based on the similarity between documents. In this paper, an extension-based feature modeling approach towards semantically sensitive text clustering is proposed along with the corresponding feature space construction and similarity computation method. By combining the similarity in traditional feature space and that in extension space, the adverse effects of the complexity and diversity of natural language can be addressed and clustering semantic sensitivity can be improved correspondingly. The generated clusters can be organized using different granularities. The experimental evaluations on well-known clustering algorithms and datasets have verified the effectiveness of our approach.Entities:
Mesh:
Year: 2015 PMID: 25794172 PMCID: PMC4367988 DOI: 10.1371/journal.pone.0117390
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The extension at the word level.
Fig 2The LDA model.
Fig 3Selection of keywords by using lexicon chain technology (CNumber is the index of different lexicon chains).
Fig 4An example for demonstrating the effect of semantic extension on document level.
The two datasets.
| No. | #of keywords | k1 | k2 | # of documents | Source |
|---|---|---|---|---|---|
| 1 | 4296 | 10 | 50 | 1000 |
|
| 2 | 63204 | 8 | 20 | 19997 | 20 Newsgroups |
Fig 5Clustering performance for different granularity and different blend factor.
(A). Evaluation method: accuracy; Clustering method: k-means; Dataset: dataset 1; (B). Evaluation method: accuracy; Clustering method: SOM; Dataset: dataset 1; (C). Evaluation method: BF; Clustering method: k-means; Dataset: dataset 1; (D): Evaluation method: BF; Clustering method: SOM; Dataset: dataset 1; (E). Evaluation method: accuracy; Clustering method: k-means; Dataset: dataset 2; (F). Evaluation method: accuracy; Clustering method: SOM; Dataset: dataset 2; (G). Evaluation method: BF; Clustering method: k-means; Dataset: dataset 2; (H). Evaluation method: BF; Clustering method: SOM; Dataset: dataset 2.
The examples of some retrieved concept relevance records (in part. As some phrases records have been omitted here).
| Words | Some retrieved concept relevance records |
|---|---|
| “hockey” | Rank, dribble, sports, blazer, jacket, playsuit, league, sports, team, team, exercise, prolusion, roller-skating, skating, warm-up, exercising, skate, infield, outfield, baseball |
| “guns” | lethality, overkill, backlash, rebound, recoil, arsenal, clip, bare-handed, unarmed, bare-handed, defenceless, unarmed, recoil, nuclearcapable, antisatellite, bulletproof, nitroglycerin, nitroglycerine, trinitroglycerin |
| “religion” | bodhi, cathedral, church, convent, saintdom, sainthood, confess, confession, baptise, baptism, reincarnation, transmigration, samsara, cassock, dalmatic, frock, kasaya, Buddhism, Lamaism |
| Sirius | constellation, skylab, observatory, planetarium, interplanetary, Altair, Andromeda, Aries, Cancer, Canopus, Cassiopeia, Draco, Galaxy, Gemini, Jupiter, Libra, Lyra, Mars, Mercury, Monoceros, Neptune |
| motorcycle | traffic, drive, fecundity, fertility, elevation, intercontinental, motel, magnitude, automobile, boxcar, bulldozer, cable car, caboose, car, caravan, carriage, chariot, dozer, engine, jeep, limousine |
| medicine | covalence, covalency, valence, valency, immunity, kelp, tremella, dosage, dose, crust, anaesthesia, anaesthetize, anesthesia, anesthetize, aftereffect, banxia, broomrape, calamus, cardmom, cardoon, costusroot |
The examples of first 6 topics discovered by gibbsLDA.
| Topics | Some words and their scores |
|---|---|
| Topic 0th: | score 0.040230, league 0.028905, lead 0.024801, quarter 0.014953, valley 0.013476, rebound 0.012819, orange 0.012491, hill 0.010357, la 0.010357, game 0.009701, grove 0.009372, christian 0.009044, goal 0.009044, garden 0.008059, host 0.007895, visit 0.007239, kennedi 0.007239, victory 0.006910, led 0.006582, ad 0.006582 |
| Topic 1th: | women 0.035207, research 0.021928, birth 0.017980, control 0.012956, develop 0.007572, method 0.006137, grant 0.006137, universe 0.005778, delivery 0.005778, technology0.005419, reason 0.005419, contracept0.005419, panel 0.005060, materia 0.005060, differ 0.004701, promote 0.004701, superconduct 0.004701, institut 0.004701, program 0.004701, basic 0.004343 |
| Topic 2th: | play 0.018470, helen 0.010573, fugard 0.010573, martin 0.009696, south 0.009257, artist 0.008818, miss 0.008379, road 0.005747, benson 0.005747, coast 0.005747, central 0.005308, taper 0.005308, playwright 0.004870, April 0.004431, jan 0.004431, mecca 0.004431, stage 0.004431, house 0.004431, prize 0.003992, scr 0.003554 |
| Topic 3th: | music 0.025085, bylin 0.020851, program 0.013795, perform 0.013442, review 0.012736, dance 0.012384, symphony 0.010972, opera 0.009208, sound 0.008855, type 0.008150, local 0.007444, concert 0.006739, sud 0.006739, classic 0.006386, festiv 0.005327, passion 0.005327, headlin 0.005327, orchestra 0.004975, predict 0.004622, director 0.004622 |
| Topic 4th: | station 0.037361, radio 0.021366, channel 0.016035, contest 0.013985, listen 0.012344, offer 0.007013, servic 0.006603, local 0.006193, roll 0.006193, call 0.005782, fm 0.005372, pai 0.004962, goodbi 0.004552, suit 0.004142, include 0.004142, meanwhil 0.003732, clean 0.003322, car 0.003322, promotion 0.003322, award 0.002912 |
| Topic 5th: | office 0.026614, plant 0.023117, city 0.020669, depart 0.018570, energy 0.016472, nuclear 0.013324, oil 0.010876, asbesto 0.010177, worker 0.009478, oper 0.008778, power 0.006680, six 0.005980, action 0.005980, cover 0.005631, defense 0.005631, reactor 0.005631, remove 0.005281, control 0.004581, date 0.004232, manage 0.004232 |
Fig 6The impact of feature selection on clustering results.
(A). Evaluation method: accuracy; Clustering method: k-means; Dataset: dataset 1; (B). Evaluation method: accuracy; Clustering method: SOM; Dataset: dataset 1; (C). Evaluation method: BF; Clustering method: k-means; Dataset: dataset 1; (D): Evaluation method: BF; Clustering method: SOM; Dataset: dataset 1; (E). Evaluation method: accuracy; Clustering method: k-means; Dataset: dataset 2; (F). Evaluation method: accuracy; Clustering method: SOM; Dataset: dataset 2; (G). Evaluation method: BF; Clustering method: k-means; Dataset: dataset 2; (H). Evaluation method: BF; Clustering method: SOM; Dataset: dataset 2.
Comparison of time consumed with or without semantic extension on dataset 1 (: S; *-A: without extension; *-B: with extension).
| K | K-means-A | K-means-B | SOM-A | SOM-B |
|---|---|---|---|---|
| K = 10 | 22.06 | 36.49 | 29.63 | 38.26 |
| K = 50 | 39.82 | 52.83 | 43.29 | 56.31 |
Comparison of time consumed with or without semantic extension on dataset 2 (: S: *-A: without extension; *-B: with extension).
| K | K-means-A | K-means-B | SOM-A | SOM-B |
|---|---|---|---|---|
| K = 8 | 685.32 | 1335.02 | 776.36 | 1493.21 |
| K = 20 | 2219.84 | 3468.35 | 2637.98 | 4137.24 |