| Literature DB >> 21613640 |
Minlie Huang1, Aurélie Névéol, Zhiyong Lu.
Abstract
BACKGROUND: Due to the high cost of manual curation of key aspects from the scientific literature, automated methods for assisting this process are greatly desired. Here, we report a novel approach to facilitate MeSH indexing, a challenging task of assigning MeSH terms to MEDLINE citations for their archiving and retrieval.Entities:
Mesh:
Year: 2011 PMID: 21613640 PMCID: PMC3168302 DOI: 10.1136/amiajnl-2010-000055
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1An overview of our approach. MH, main heading.
Figure 2Sample MeSH terms assigned to a MEDLINE article. The terms inside the blue box are main headings, and those outside the blue box are subheadings.
Detailed information about the three datasets used in this study
| Dataset | Number of citations | Total number of main headings | Average number of main headings | When main headings were curated | Data availability |
| Small200 | 200 | 2736 | 13.7 | 2002–2009 | Freely available |
| NLM2007 | 200 | 2737 | 13.7 | 1997–2001 | – |
| L1000 | 1000 | 12145 | 12.1 | 1961–2009 | – |
Precision, recall, F-score, and MAP for different methods
| Method | Precision | Recall | F score | MAP |
| MTI system | 0.318 | 0.574 | 0.409 | 0.450 |
| Reflective random indexing | 0.372 | 0.575 | 0.451 | N/A |
| Neighborhood frequency | 0.369 | 0.674 | 0.476 | 0.598 |
| Neighborhood familiarity | 0.376 | 0.677 | 0.483 | 0.604 |
| Learning-to-rank algorithm | 0.390 | 0.712 | 0.504 | 0.626 |
The comparison was performed on the NLM2007 dataset. Statistical significance tests were performance on mean average precision to compare the two baseline ranking strategies with our learning-to-rank algorithm.
p<0.001, indicating that the performance of both baseline strategies was significantly lower than the learning-to-rank algorithm.
We directly used the best results from the paper of Vasuki and Cohen (2009).9
MAP, mean average precision; MTI, Medical Text Indexer.
Precision, recall, F score, and MAP for different methods
| Method | Precision | Recall | F score | MAP |
| MTI system | 0.302 | 0.583 | 0.398 | 0.462 |
| Neighborhood frequency | 0.329 | 0.679 | 0.443 | 0.584 |
| Neighborhood similarity | 0.333 | 0.687 | 0.449 | 0.591 |
| Learning-to-rank algorithm | 0.347 | 0.714 | 0.467 | 0.615 |
The comparison was performed on the L1000 dataset. Statistical significance tests were performance on mean average precision to compare the two baseline ranking strategies with our learning-to-rank algorithm.
p<0.001, indicating that performance of both baseline strategies was significant lower than the learning-to-rank algorithm.
MAP, mean average precision; MTI, Medical Text Indexer.
The upper-bound recall and average number of main headings with different number of neighbor documents
| Dataset | Measure | Number of neighbor documents | |||||||
| 5 | 10 | 15 | 20 | 25 | 30 | 35 | 40 | ||
| NLM 2007 | Upper-bound recall | 0.704 | 0.793 | 0.832 | 0.871 | 0.882 | 0.891 | 0.898 | |
| Number of main heading candidates | 38.8 | 64.1 | 83.6 | 119.7 | 136.4 | 151.7 | 166.4 | ||
| L1000 | Upper-bound recall | 0.702 | 0.786 | 0.825 | 0.870 | 0.882 | 0.891 | 0.899 | |
| Number of main heading candidates | 37.3 | 60.9 | 81.5 | 117.2 | 133.5 | 148.8 | 163.6 | ||
Both NLM2007 and L1000 datasets were used in the experiments.
Figure 3The ranking performance (y-axis) varies with different number of neighbor documents (x-axis). MAP, mean average precision.
Feature ablation study
| Feature set | Precision | Recall | F score | MAP |
| All features | 0.390 | 0.712 | 0.504 | 0.626 |
| Neighborhood features | 0.315* | 0.575* | 0.407* | 0.435* |
| Unigram/bigram features | 0.389 | 0.711 | 0.503 | 0.626 |
| Translation probability features | 0.389 | 0.711 | 0.503 | 0.626 |
| Query likelihood features | 0.385 | 0.704 | 0.498 | 0.626 |
| Synonym features | 0.385 | 0.703 | 0.497 | 0.618 |
| Only neighborhood features | 0.370* | 0.677* | 0.478* | 0.602* |
In rows starting with a minus sign (−), we trained and tested the learning algorithm using all but the given set of features. Those marked with asterisks are significant worse than the accordant measures using all features (p<0.001 with all the three statistical significance tests).