| Literature DB >> 20810602 |
Rezarta Islamaj Doğan1, Zhiyong Lu.
Abstract
MOTIVATION: Recognizing words that are key to a document is important for ranking relevant scientific documents. Traditionally, important words in a document are either nominated subjectively by authors and indexers or selected objectively by some statistical measures. As an alternative, we propose to use documents' words popularity in user queries to identify click-words, a set of prominent words from the users' perspective. Although they often overlap, click-words differ significantly from other document keywords.Entities:
Mesh:
Year: 2010 PMID: 20810602 PMCID: PMC2958742 DOI: 10.1093/bioinformatics/btq459
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An example of click-words, top-scoring TF–IDF words, author keywords and MeSH indexing terms for a PubMed article. User click-words are listed by the frequency in which they appear in user queries for this article. TF–IDF words are listed in decreasing order of their TF–IDF weight. MeSH terms follow the order in which they are listed in PubMed. Author keywords are listed as they appear in the article.
Fig. 2.The iterative feature selection model.
The break-even precision recall point for each individual feature type and the corresponding number of contributing features (non-zero Huber weights) for each feature type, when learning to differentiate articles' click-words from the top five TF–IDF-weighted words
| Model | Precision–recall | Number of |
|---|---|---|
| break-even point | features | |
| Word | 0.748 | 36 489 |
| Word location | 0.663 | 15 |
| Neighbor words | 0.661 | 253 340 |
| TF–IDF rank | 0.613 | 5 |
| WFR | 0.609 | 46 |
| MetaMap semantic types | 0.594 | 134 |
| POS tag | 0.524 | 37 |
| Part of phrase | 0.510 | 2 |
| Abbreviation | 0.448 | 2 |
| Random selection | 0.429 | - |
| TF–IDF weight | 0.613 | - |
| ALL features | 0.781 | 290 069 |
Fig. 3.Results of break-even point of precision and recall averaged among the 5-folds of cross-validation test sets, through the progression of iterative feature selection method.
Performance evaluation of the click-word model when compared with the TF–IDF weighting and random selection, for the top five weighted TF–IDF words for each article in the evaluation dataset
| Classification model | Mean average precision | Break-even precision–recall | ROC | Precision@1 | |
|---|---|---|---|---|---|
| Training dataset results of 5-fold cross-validation | Random selection | 0.612 | 0.428 | 0.498 | 0.428 |
| TF–IDF weight | 0.757 | 0.611 | 0.691 | 0.671 | |
| Click-word model | 0.888 | 0.794 | 0.868 | 0.863 | |
| Evaluation dataset results for top 5 TF–IDF words | Random selection | 0.596 | 0.405 | 0.495 | 0.405 |
| TF–IDF weight | 0.737 | 0.581 | 0.681 | 0.631 | |
| Click-word model | 0.855 | 0.743 | 0.832 | 0.810 |
Total number of features: 2000, best break-even point avg: 0.794
| Feature | Number |
|---|---|
| Neighbors | 1322 |
| Word | 556 |
| Semantic types | 83 |
| Word location | 15 |
| POS tag | 12 |
| TF–IDF | 5 |
| WFR | 3 |
| Phrase | 2 |
| Abbreviation | 2 |
Top features of the best model
| Huber weight | Positive features | Huber weight | Negative features |
|---|---|---|---|
| 0.455 | LOC: title + first sentence | −0.409 | LOC: middle abstract only |
| 0.368 | WRD: mirna | −0.342 | LOC: middle + last sentence |
| 0.343 | WRD: cancer | −0.312 | LOC: first + middle sentence |
| 0.337 | SEM: disease or syndrome | −0.312 | LOC: first + middle + last sentence |
| 0.328 | WRD: il | −0.270 | POS: plural noun |
| 0.317 | LOC: title + first sentence + middle abstract | −0.263 | PHR: not in phrase |
| 0.293 | SEM: bacterium | −0.251 | SEM: functional concept |
Performance evaluation results for the baseline random selection model, TF–IDF weighting of the words model and the click-word prediction model, for all the words in the title and abstract of the articles in the evaluation dataset
| Classification model | Mean average precision | Break-even precision–recall | ROC | Precision@1 | |
|---|---|---|---|---|---|
| Evaluation dataset results for all words | Random selection | 0.112 | 0.065 | 0.499 | 0.065 |
| TF–IDF weight | 0.513 | 0.454 | 0.861 | 0.631 | |
| Click-word model | 0.627 | 0.547 | 0.904 | 0.806 | |
| Statistical analysis | 45.195 | 31.824 | 32.425 | 32.877 | |
| 0.0002 | 0.0005 | 0.0005 | 0.0005 |