| Literature DB >> 27579031 |
Wen Zhang1, Fan Xiao1, Bin Li1, Siguang Zhang2.
Abstract
Recently, LSI (Latent Semantic Indexing) based on SVD (Singular Value Decomposition) is proposed to overcome the problems of polysemy and homonym in traditional lexical matching. However, it is usually criticized as with low discriminative power for representing documents although it has been validated as with good representative quality. In this paper, SVD on clusters is proposed to improve the discriminative power of LSI. The contribution of this paper is three manifolds. Firstly, we make a survey of existing linear algebra methods for LSI, including both SVD based methods and non-SVD based methods. Secondly, we propose SVD on clusters for LSI and theoretically explain that dimension expansion of document vectors and dimension projection using SVD are the two manipulations involved in SVD on clusters. Moreover, we develop updating processes to fold in new documents and terms in a decomposed matrix by SVD on clusters. Thirdly, two corpora, a Chinese corpus and an English corpus, are used to evaluate the performances of the proposed methods. Experiments demonstrate that, to some extent, SVD on clusters can improve the precision of interdocument similarity measure in comparison with other SVD based LSI methods.Entities:
Mesh:
Year: 2016 PMID: 27579031 PMCID: PMC4992544 DOI: 10.1155/2016/1096271
Source DB: PubMed Journal: Comput Intell Neurosci
Existing linear algebra methods for LSI.
| Category | Abbreviation | Full name |
|---|---|---|
| SVD based decomposition for term-document matrix | IRR | Iterative Residual Rescaling |
| SVR | Singular Value Rescaling | |
| ADE | Approximate Dimension Equalization | |
|
| ||
| Non-SVD based decomposition for term-document matrix | SDD | Semidiscrete Decomposition |
| LPI | Locality Preserving Indexing | |
| R-SVD | Riemannian-SVD | |
F-measures of clustering results produced by k-Means and SOMs on Chinese and English documents.
| Corpus |
| SOMs clustering |
|---|---|---|
| Chinese | 0.7367 | 0.6046 |
| English | 0.7697 | 0.6534 |
Similarity measure on English documents of SVD on clusters and other SVD based LSI methods. PR is the abbreviation for “preservation rate” and the best performances (measured by average precision) are marked in bold type.
| PR | SVD | SVDC ( | SVDC (SOMs) | SVR | ADE | IRR |
|---|---|---|---|---|---|---|
| 1.0 |
|
|
| 0.4202 ± 0.0156 | 0.3720 ± 0.0253 | 0.3927 ± 0.0378 |
| 0.9 | 0.4382 ± 0.0324 | 0.4394 ± 0.0065 |
| 0.4202 ± 0.0197 | 0.2890 ± 0.0271 | 0.3929 ± 0.0207 |
| 0.8 | 0.4398 ± 0.0185 | 0.4425 ± 0.0119 |
| 0.4202 ± 0.0168 | 0.3293 ± 0.0093 | 0.3927 ± 0.0621 |
| 0.7 | 0.4420 ± 0.0056 |
| 0.4385 ± 0.0287 | 0.4089 ± 0.0334 | 0.3167 ± 0.0173 | 0.3928 ± 0.0274 |
| 0.6 | 0.4447 ± 0.0579 |
| 0.4462 ± 0.0438 | 0.4201 ± 0.0132 | 0.3264 ± 0.0216 | 0.3942 ± 0.0243 |
| 0.5 | 0.4475 ± 0.0431 |
| 0.4487 ± 0.0367 | 0.4203 ± 0.0369 | 0.3338 ± 0.0295 | 0.3946 ± 0.0279 |
| 0.4 | 0.4499 ± 0.0089 |
| 0.4498 ± 0.0194 | 0.4209 ± 0.0234 | 0.3377 ± 0.0145 | 0.3951 ± 0.0325 |
| 0.3 | 0.4516 ± 0.0375 |
| 0.4396 ± 0.0309 | 0.4222 ± 0.0205 | 0.3409 ± 0.0247 | 0.3970 ± 0.0214 |
| 0.2 | 0.4538 ± 0.0654 |
| 0.4372 ± 0.0243 | 0.4227 ± 0.0311 | 0.3761 ± 0.0307 | 0.3990 ± 0.0261 |
| 0.1 | 0.4553 ± 0.0247 |
| 0.4298 ± 0.0275 | 0.4229 ± 0.0308 | 0.4022 ± 0.0170 | 0.3956 ± 0.0185 |
Similarity measure on Chinese documents of SVD on clusters and other SVD based LSI methods. PR is the abbreviation for “preservation rate” and the best performances (measured by average precision) are marked in bold type.
| PR | SVD | SVDC ( | SVDC (SOMs) | SVR | ADE | IRR |
|---|---|---|---|---|---|---|
| 1.0 |
|
|
| 0.4272 ± 0.0200 | 0.3632 ± 0.0286 | 0.2730 ± 0.0168 |
| 0.9 | 0.4312 ± 0.0279 |
| 0.4463 ± 0.0245 | 0.4272 ± 0.0186 | 0.3394 ± 0.0303 | 0.2735 ± 0.0238 |
| 0.8 | 0.4358 ± 0.0422 |
| 0.4458 ± 0.0239 | 0.4273 ± 0.0209 | 0.3136 ± 0.0137 | 0.2735 ± 0.0109 |
| 0.7 | 0.4495 ± 0.0387 |
| 0.4573 ± 0.0146 | 0.4273 ± 0.0128 | 0.3075 ± 0.0068 | 0.2732 ± 0.0127 |
| 0.6 | 0.4550 ± 0.0176 |
| 0.4547 ± 0.0294 | 0.4273 ± 0.0305 | 0.3006 ± 0.0208 | 0.2730 ± 0.0134 |
| 0.5 | 0.4573 ± 0.0406 |
| 0.4588 ± 0.0164 | 0.4273 ± 0.0379 | 0.2941 ± 0.0173 | 0.2729 ± 0.0141 |
| 0.4 | 0.4587 ± 0.0395 | 0.4624 ± 0.0098 |
| 0.4275 ± 0.0294 | 0.2857 ± 0.0194 | 0.2726 ± 0.290 |
| 0.3 | 0.4596 ± 0.0197 |
| 0.4582 ± 0.0203 | 0.4285 ± 0.0305 | 0.2727 ± 0.0200 | 0.2666 ± 0.242 |
| 0.2 | 0.4602 ± 0.0401 |
| 0.4432 ± 0.0276 | 0.4305 ± 0.0190 | 0.2498 ± 0.0228 | 0.2672 ± 0.0166 |
| 0.1 | 0.4617 ± 0.0409 |
| 0.4513 ± 0.0188 | 0.4343 ± 0.0193 | 0.3131 ± 0.0146 | 0.2557 ± 0.0188 |
Results of t-test on the performances of similarity measure of SVD on clusters and other SVD based LSI methods in English corpus.
| Method | SVDC with SOMs clustering | SVD |
|---|---|---|
| SVDC with | ≫ | ≫ |
| SVDC with SOMs clustering | > |
Results of t-test on the performances of similarity measure of SVD on clusters and other SVD based LSI methods in Chinese corpus.
| Method | SVDC with SOMs clustering | SVD |
|---|---|---|
| SVDC with | > | > |
| SVDC with SOMs clustering | ~ |
Figure 1Similarity measure of SVDC with k-Means and SVD for updating; the preservation rates of their approximation matrices are set as 0.8.