| Literature DB >> 16630348 |
Aditya K Sehgal1, Padmini Srinivasan.
Abstract
<span class="abstract_title">BACKGROUND: Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the <span class="Species">human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings.Entities:
Mesh:
Year: 2006 PMID: 16630348 PMCID: PMC1482725 DOI: 10.1186/1471-2105-7-220
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Performance of Ranking Strategies (MAP). The graph shows the mean AP scores (with 95% confidence interval) for the different strategies on the set of 4,647 genes for which summary and product is available in LL.
Figure 2Performance of Ranking Strategies (NTop5P). The graph shows the mean NTop5P scores (with 95% confidence interval) for the different strategies on the set of 4,647 genes for which summary and product is available in LL.
Figure 3Difference in Average Precision: Genes Binned by B1 AP. The genes are distributed into 10 bins defined by B1 AP score. Each bin has 450 genes except for the right most bin, which has 617 genes. Average B1 scores for the genes in the bins are shown in square brackets along the X axis. The Y axis depicts the mean difference in AP between a given strategy and B1. Thus for example, for the bin closest to the origin, which has average B1 score of 1.0, B2 degrades performances. On average it brings down AP by 0.06. Bars below the X axis indicate negative effects of ranking and bars above indicate positive effects. The height of the bars indicate the extent of the improvement/drop in performance.
Figure 4Difference in NTop5 Precision: Genes Binned by B1 NTop5P. The genes are binned by B1 NTop5P Score. Each bin has 450 genes except for the right most bin, which has 617 genes. The average B1 NTop5P score for each bin is shown in square brackets along the X axis. The Y axis indicates the mean difference in NTop5P between a given strategy and B1. Bars belowthe X axis indicate a drop in performance whereas bars above the X axis indicate an improvement in performance. The extent of improvement/drop is indicated by the height of the bars.
Figure 5Performance with Ambiguous Genes (MAP). The graph shows the MAP score (with 95% confidence interval) of each strategy on genes having duplicate records in LL (DG), general English meanings (ENG) and other biological meanings (BIO).
Figure 6Performance with Ambiguous Genes (NTop5P). The graph shows the mean NTop5P score (with 95% confidence interval) of each strategy on genes having duplicate records in LL (DG), general English meanings (ENG) and other biological meanings (BIO).
Figure 7Difference in Average Precision (AP): Genes Binned by . The figure depicts the relationship between Ambiguityscore (> 1) and ranking strategy performance in terms of AP. Genes are binned along the X axis by their Ambiguityscore. All bins except the right most one have 220 genes. The right most bin has 297 genes. Numbers in parenthesis below the X axis show the average Ambiguityscore for each bin. Mean B1 AP scores for each bin are shown in square brackets in the graph. The Y axis depicts the difference in performance between each strategy and B1. Bars above the X axis denote an improvement whereas bars below the X axis denote a drop in performance.
Figure 8Difference in Average Precision (AP): Genes Binned by Number of Retrieved Documents. The figure shows the relationship between retrieved set size and ranking strategy performance in terms of AP. Genes are binned into equal sized groups based on the number of retrieved documents. Each bin, except for the last one, has 450 genes. The last bin consists of 617 genes. Average retrieved set size for each bin is shown in parenthesis and average B1 AP for each bin is shown in square brackets.
Figure 9Performance of Generic Ranking Strategy on Training Set (1,000 genes) and Test Set (3,195 genes). The figure shows MAP scores (with 95% confidence interval) for our generic ranking strategy, B1 and B2 on training and test sets. M is the number of top ranked terms selected. Since M = 5 is our best generic ranking strategy we show only the performance of this strategy.
Correlation Coefficients. The table shows the strength of the correlations among the different kinds of ambiguities and the number of retrieved documents and their correlation with the B1 AP Score.
| 0.371 | ||||
| 0.175 | 0.456 | |||
| 0.336 | 0.697 | 0.436 | ||
| -0.182 | -0.363 | -0.149 | -0.508 |
Regression Results. The results of the regression to predict the B1 AP score using the size of the retrieved set as the predictive variable.
| Constant ( | 0.783 | 0.009 | 83.23 | 0 |
| N( | -0.071 | 0.002 | -40.14 | 9.314E-303 |
R-square: 0.258, Adjusted R-square: 0.257
Figure 10Performance of Combined strategy over 9,390 genes (MAP). This figure shows the MAP scores (with 95% confidence intervals) of our combined strategy (B2+S+SP), B1 and B2 on the full set of 9,390 genes.
Figure 11Performance of Combined strategy over 9,390 genes (NTop5P). This figure shows the mean NTop5P scores (with 95% confidence intervals) of our combined strategy (B2+S+SP), B1 and B2 on the full set of 9,390 genes.
Figure 12Performance of Ranking Strategies (MAP) on Different Gold Standard Sets. This figure shows the MAP scores (with 95% confidence interval) for each strategy on 4641 genes for which summary and product is available in two versions of LL. The left half of the graph shows the performances using relevance judgments from the 2003 LL file whereas the right half shows performances using relevance judgments from the 2005 LL file.
Figure 13Performance of Ranking Strategies (NTop5P) on Different Gold Standard Sets. This figure shows the mean NTop5P scores (with 95% confidence interval) for each strategy on 4641 genes for which summary and product is available in two versions of LL. The left half of the graph shows the performances using relevance judgments from the 2003 LL file whereas the right half shows performances using relevance judgments from the 2005 LL file.
Distribution of Retrieved and Relevant Documents (9,390 genes) A topic is defined as a gene query. Thus, in the table, 5270 gene queries retrieve between 0–100 documents and 7101 gene queries have between 1–5 relevant documents identified in LocusLink.
| 0–100 | 5270 (56%) | 1–5 | 7101 (76%) |
| 101–500 | 1944 (21%) | 6–10 | 1344 (14%) |
| 500–1000 | 633 (7%) | 11–15 | 430 (5%) |
| 1001–2500 | 676 (7%) | 16–20 | 204 (2%) |
| 2501–5000 | 323 (3%) | 21–25 | 100 (1%) |
| 5001–10000 | 230 (2%) | 26–30 | 58 (<1%) |
| 10001–25000 | 154 (2%) | 31–35 | 38 (<1%) |
| 25001–50000 | 71 (1%) | 36–40 | 29 (<1%) |
| > 50,000 | 89 (<1%) | > 40 | 86 (<1%) |