| Literature DB >> 18834494 |
Alexander A Morgan1, Zhiyong Lu, Xinglong Wang, Aaron M Cohen, Juliane Fluck, Patrick Ruch, Anna Divoli, Katrin Fundel, Robert Leaman, Jörg Hakenberg, Chengjie Sun, Heng-hui Liu, Rafael Torres, Michael Krauthammer, William W Lau, Hongfang Liu, Chun-Nan Hsu, Martijn Schuemie, K Bretonnel Cohen, Lynette Hirschman.
Abstract
BACKGROUND: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%.Entities:
Mesh:
Year: 2008 PMID: 18834494 PMCID: PMC2559987 DOI: 10.1186/gb-2008-9-s2-s3
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Sample PubMed/MEDLINE abstract and extracted Entrez Gene identifiers.
Recall, precision, F-measure and rank for best gene normalization run per team
| Team/run | Recall | Precision | F-measure micro-average | Rank micro | F-measure macro-average | Rank macro | Significance range |
| T042_1 | 0.833 | 0.789 | 0.810 | 1 | 0.811 | 1 | 3-20 |
| T034_1 | 0.815 | 0.792 | 0.804 | 2 | 0.782 | 2 | 8-20 |
| T013_1 | 0.768 | 0.833 | 0.799 | 3 | 0.779 | 3 | 8-20 |
| T004_1 | 0.734 | 0.841 | 0.784 | 4 | 4 | 8-20 | |
| T109_1 | 0.824 | 0.743 | 0.781 | 5 | 0.775 | 5 | 8-20 |
| T104_1 | 0.743 | 0.807 | 0.774 | 6 | 0.773 | 6 | 9-20 |
| T101_2 | 0.743 | 0.801 | 0.771 | 7 | 0.755 | 7 | 10-20 |
| T107_1 | 0.740 | 0.784 | 0.761 | 8 | 12-20 | ||
| T113_2 | 0.761 | 0.752 | 0.756 | 9 | 11-20 | ||
| T108_3 | 0.749 | 0.726 | 0.737 | 10 | 0.724 | 10 | 13-20 |
| T007_2 | 0.703 | 0.746 | 0.724 | 11 | 16-20 | ||
| T017_1 | 0.708 | 0.720 | 0.714 | 12 | 15-20 | ||
| T110_1 | 0.629 | 0.783 | 0.698 | 13 | 16-20 | ||
| T111_3 | 0.664 | 0.717 | 0.689 | 14 | 17-20 | ||
| T030_1 | 0.661 | 0.716 | 0.687 | 15 | 17-20 | ||
| T006_2 | 0.606 | 0.767 | 0.677 | 16 | 19-20 | ||
| T036_1 | 0.713 | 0.520 | 0.602 | 17 | 0.595 | 17 | 19-20 |
| T014_1 | 0.485 | 0.762 | 0.593 | 18 | 0.584 | 18 | 20 |
| T102_3 | 0.790 | 0.425 | 0.552 | 19 | 0.559 | 19 | 20 |
| T058_2 | 0.415 | 0.375 | 0.394 | 20 | 0.398 | 20 |
Columns include recall, precision, F-measure and rank for micro-averaged scores, F-measure and rank for macro-averaged scores, and significance based on macro-averaged score distributions. Bold * indicates that rank for micro-average and macro-average are different; ‡ indicates that a different run was used as the high-scoring run in macro-averaged versus micro-averaged results.
Figure 2Distribution of F-measure rank computed by resampling test data over 10,000 runs. The dark line in each box plot is the median rank based on the resampling results, the boxes are quartiles around the median, the whiskers correspond roughly to the 95% confidence interval for the average rankings, and the circles are outliers.
Figure 3Precision versus recall scatter plot with F-measure isobars for GN micro-averaged results for human, mouse, fly and yeast data. GN, gene normalization.
Statistics comparing BioCreative II gene normalization task (human) with BioCreative I tasks (mouse, fly, and yeast).
| Number of unique IDs | Average synonym length in words | Average synonyms per identifier | Average identifiers per synonym (ambiguity) | BioCreative maximum recall @ precision | BioCreative maximum F-measure | |
| Human | 32,975 | 2.17 | 5.55 | 1.12 | 0.88 @ 0.50 | 0.81 |
| Mouse | 52,494 | 2.77 | 2.48 | 1.02 | 0.90 @ 0.43 | 0.79 |
| Yeast | 7,928 | 1.00 | 1.86 | 1.01 | 0.96 @ 0.65 | 0.92 |
| Fly | 27,749 | 1.47 | 2.94 | 1.09 | 0.84 @ 0.73 | 0.82 |
Statistics on synonyms are based on lexical resources provided by the task organizers.
Number of false positives and true positives at different levels of consensus from best micro-averaged runs of the 20 teams
| Votes | Count FP | Count TP | Precision | Recall | F-measure |
| 20 | 1 | 86 | 0.989 | 0.110 | 0.197 |
| 19 | 3 | 204 | 0.986 | 0.260 | 0.411 |
| 18 | 7 | 288 | 0.976 | 0.367 | 0.533 |
| 17 | 8 | 359 | 0.978 | 0.457 | 0.623 |
| 16 | 11 | 421 | 0.975 | 0.536 | 0.692 |
| 15 | 13 | 470 | 0.973 | 0.599 | 0.741 |
| 14 | 15 | 513 | 0.972 | 0.654 | 0.781 |
| 13 | 19 | 555 | 0.967 | 0.707 | 0.817 |
| 12 | 30 | 572 | 0.950 | 0.729 | 0.825 |
| 11 | 42 | 599 | 0.934 | 0.763 | 0.840 |
| 10 | 51 | 623 | 0.924 | 0.794 | 0.854 |
| 9 | 77 | 644 | 0.893 | 0.820 | 0.855 |
| 8 | 103 | 667 | 0.866 | 0.850 | 0.858 |
| 7 | 130 | 685 | 0.840 | 0.873 | 0.856 |
| 6 | 160 | 704 | 0.815 | 0.897 | 0.854 |
| 5 | 221 | 714 | 0.764 | 0.910 | 0.830 |
| 4 | 304 | 721 | 0.703 | 0.918 | 0.797 |
| 3 | 435 | 743 | 0.631 | 0.946 | 0.757 |
| 2 | 713 | 751 | 0.513 | 0.957 | 0.668 |
| 1 | 2522 | 763 | 0.232 | 0.972 | 0.375 |
| Total | 785 |
The table shows cumulative number of false positives and true positives (columns 2 and 3) obtained for a given level of consensus (column 1) from the top micro-averaged run of each team. Recall, precision, and F-measure were calculated using the consensus level as the minimum number of votes needed to include an identifier as an 'answer'. The total under True Positive Count indicates that there were 22 true positives that no system identified; see additional data file 3 for a listing of these.
Results of 10-fold cross-validation on classifiers trained on the pooled submissions
| Recall | Precision | F-measure | |
| Best macro averaged submission from each group | |||
| Naïve Bayes | 0.92 (0.046) | 0.75 (0.036) | 0.83 (0.028) |
| Support vector machines | 0.88 (0.049) | 0.96 (0.026) | 0.92 (0.031) |
| Best macro averaged from top 5 ranked groups | |||
| Naïve Bayes | 0.91 (0.062) | 0.75 (0.062) | 0.82 (0.060) |
| Support vector machines | 0.82 (0.054) | 0.91 (0.041) | 0.86 (0.040) |