| Literature DB >> 30012087 |
Lana Yeganova1, Sun Kim2, Grigory Balasanov2, W John Wilbur2.
Abstract
BACKGROUND: The need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available. Two common approaches to provide a broad overview of the information space are document clustering and topic modeling. Clustering aims to group documents or terms into meaningful clusters. Topic modeling, on the other hand, focuses on finding coherent keywords for describing topics appearing in a set of documents. In addition, there have been efforts for clustering documents and finding keywords simultaneously.Entities:
Keywords: First singular vector; Projection algorithm; Theme discovery
Mesh:
Year: 2018 PMID: 30012087 PMCID: PMC6048865 DOI: 10.1186/s12859-018-2240-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Top scoring Theme-generated terms for the largest 10 themes in the SNP dataset
| Theme size | Top 10 terms |
|---|---|
| 765 | breast / breast cancer / cancer / cancer risk / breast neoplasms, genetics/ risk / breast cancer / breast neoplasms / women / controls |
| 438 | sle / lupus / lupus erythematosus, systemic / systemic lupus / lupus erythematosus / erythematosus / systemic / sle patients / patients / susceptibility |
| 437 | prostate / prostate cancer / cancer / prostatic neoplasms, genetics / prostatic neoplasms / risk / cancer risk / men / p / associated |
| 436 | ra / rheumatoid / rheumatoid arthritis / arthritis / arthritis, rheumatoid / arthritis, rheumatoid, genetics / ra patients / controls / susceptibility / association |
| 399 | cad / coronary / coronary artery / artery disease / artery / coronary artery disease, genetics / disease cad / coronary artery disease / risk |
| 351 | lung cancer / lung / cancer / lung neoplasms / lung neoplasms, genetics / risk / cancer risk / ci / smoking |
| 340 | meta analysis / meta / cancer / cancer risk / studies / analysis / polymorphism / model / association / control studies |
| 339 | ad / alzheimer’s / alzheimer disease, genetics / alzheimer disease / disease / onset / risk / late onset / aged / ad patients |
| 315 | amd / age related / macular/ macular degeneration/ degeneration / macular degeneration, genetics / cfh / age / complement factor / factor h |
| 294 | colorectal / colorectal cancer / crc/ cancer / colorectal neoplasms, genetics / colorectal neoplasms / risk / ci / cancer risk / controls |
Comparative evaluation of Theme-generated terms with LDA using the UMass coherence metric on the SNP dataset
| # Cl | Method | Topic terms | ||
|---|---|---|---|---|
| Top 5 | Top 10 | Top 20 | ||
| 100 | LDA | -21.15 | -108.99 | -541.92 |
| 100 |
| -15.64 | -81.73 | -378.95 |
| 100 |
| -13.53 | -80.12 | -397.23 |
| 1000 | LDA | -33.66 | -181.19 | -942.08 |
| 1623 |
| -19.43 | -98.31 | -461.09 |
| 1066 |
| -17.25 | -94.82 | -462.35 |
Comparative evaluation of Theme-generated terms with LDA using the NPMI coherence metric on the SNP dataset
| # Cl | Method | Topic terms | ||
|---|---|---|---|---|
| Top 5 | Top 10 | Top 20 | ||
| 100 | LDA | 2.52 | 8.47 | 29.33 |
| 100 |
| 3.34 | 9.78 | 27.38 |
| 100 |
| 3.83 | 10.38 | 28.02 |
| 1000 | LDA | 3.04 | 11.46 | 44.70 |
| 1623 |
| 3.34 | 10.72 | 30.91 |
| 1066 |
| 3.80 | 11.80 | 33.21 |
Fig. 1NPMI of top 5, 10, and 20 topic terms. The size of m is varied from 2 to 40 and for every value of m we compute the NPMI scores for top 5, 10 and 20 terms. We observe that as the size of m increases, the coherence of the top terms also increases
Fig. 2Frequency of top 5, 10, and 20 topic terms. The size of m is varied from 2 to 40 and for every value of m we compute the average frequency of the top 5, 10 and 20 subject terms. We observe that as the size of m increases, the frequency of the top terms also increases, suggesting that the algorithm converges to a more general theme
Comparative evaluation of Theme and LDA-Naïve clusters on the SNP dataset using precision (P), recall (R), and F-score (F) metrics
| # Cl | Method | P | R | F |
|---|---|---|---|---|
| 100 | LDA-Naïve-100 | 0.358 | 0.364 | 0.361 |
| 100 |
| 0.688 | 0.302 | 0.419 |
| 587 | LDA-Naïve-1000 | 0.507 | 0.278 | 0.359 |
| 587 |
| 0.639 | 0.226 | 0.334 |
Comparative evaluation of Theme-generated clusters with LDA-Naïve on the 20NG collection using accuracy (AC), NMI and F-score (F) metrics
| Method | AC | NMI | F |
|---|---|---|---|
|
| 53.52 | 47.98 | 52.46 |
| LDA-Naïve | 50.24 | 51.50 | 50.46 |
Fig. 3Similarity of document pairs within Themes and LDA-based clusters. The similarity between a pair of documents is computed as the dot product of two document vectors. These values are averaged over all within-theme document pairs and, further averaged over all themes of the same size. Same computation is applied to LDA-based clusters. Each point on the graph presents that average as a function of Themes / LDA-based cluster size
Comparison of most frequent LDA top five topic terms and top five Theme-generated terms
| LDA term | Freq. in topics / | Freq. in SNP | Freq. in themes / | Freq. in SNP | |
|---|---|---|---|---|---|
| Freq. in themes | Freq. in topics | ||||
| polymorphisms | 46/0 | 32,071 | cancer | 94/14 | 8,175 |
| gene | 45/0 | 34,735 | risk | 47/24 | 20,363 |
| genetic | 42/3 | 29,383 | patients | 40/37 | 21,422 |
| associated | 37/0 | 31,365 | diabetes | 39/7 | 3,594 |
| patients | 37/40 | 21,422 | schizophrenia | 36/4 | 1,806 |
| study | 36/0 | 32,116 | dna | 36/21 | 11,098 |
| association | 30/11 | 30,831 | genome-wide | 32/5 | 8,100 |
| disease | 29/17 | 15,968 | traits | 31/6 | 4,063 |
| analysis | 27/1 | 23,797 | method | 28/6 | 6,551 |
| receptor | 25/10 | 7,511 | populations | 27/11 | 7,962 |
| two | 24/0 | 17,683 | power | 26/1 | 2,171 |
| risk | 24/47 | 20,363 | data | 23/19 | 15,234 |
| results | 22/0 | 31,862 | loci | 23/4 | 7,006 |
| p | 22/11 | 25,037 | genome | 23/5 | 5,790 |
| dna | 21/36 | 11,098 | snps | 22/18 | 23,870 |
| genes | 20/14 | 19,411 | repair | 21/5 | 1,388 |
| data | 19/23 | 15,234 | sequencing | 21/4 | 6,596 |
| snps | 18/22 | 23,870 | disorder | 21/6 | 3,517 |
| polymorphism | 17/4 | 23,162 | haplotype | 21/5 | 8,933 |
| cell | 16/11 | 5,832 | expression | 21/10 | 9,020 |
Column 1 lists the most frequent LDA terms, followed by number of LDA topics/themes that contain that term in Column 2, and frequency of the term in the SNP dataset in Column 3. Columns 4-6 present similar information for the most frequent Theme-generated terms
Fig. 4Frequency of Theme-generated terms vs. LDA terms. The frequency of Theme terms and LDA topic terms in the SNP literature. Theme-generated terms are presented in blue, and LDA topic terms are presented in orange