| Literature DB >> 15960832 |
Alexander Yeh1, Alexander Morgan, Marc Colosimo, Lynette Hirschman.
Abstract
BACKGROUND: The biological research literature is a major repository of knowledge. As the amount of literature increases, it will get harder to find the information of interest on a particular topic. There has been an increasing amount of work on text mining this literature, but comparing this work is hard because of a lack of standards for making comparisons. To address this, we worked with colleagues at the Protein Design Group, CNB-CSIC, Madrid to develop BioCreAtIvE (Critical Assessment for Information Extraction in Biology), an open common evaluation of systems on a number of biological text mining tasks. We report here on task 1A, which deals with finding mentions of genes and related entities in text. "Finding mentions" is a basic task, which can be used as a building block for other text mining tasks. The task makes use of data and evaluation software provided by the (US) National Center for Biotechnology Information (NCBI).Entities:
Mesh:
Year: 2005 PMID: 15960832 PMCID: PMC1869012 DOI: 10.1186/1471-2105-6-S1-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Data set size
| Data Set | Sentences | Gene Mentions |
| training | 7500 | 9000 |
| (development) test | 2500 | 3000 |
| (final) test | 5000 | 6000 |
F-score, recall and precision quartiles for the 40 official submissions
| Balanced F-score | Recall | Precision | ||||
| open | closed | open | closed | open | closed | |
| High | 83% | 83% | 84% | 85% | 86% | 86% |
| Quartile 1 | 81% | 80% | 81% | 79% | 83% | 81% |
| Median (Q2) | 78% | 74% | 74% | 72% | 80% | 72% |
| Quartile 3 | 67% | 59% | 70% | 62% | 72% | 53% |
| Low | 25% | 16% | 42% | 36% | 17% | 11% |
F-score, recall and precision for the 40+4 submissions
| Team | Open/closed | precision | recall | balanced-f | unofficial? |
| A | closed | 0.792 | 0.854 | 0.822 | |
| A | open | 0.828 | 0.835 | 0.832 | |
| A | open | 0.831 | 0.805 | 0.818 | |
| A | open | 0.841 | 0.814 | 0.827 | |
| B | closed | 0.800 | 0.805 | 0.802 | |
| B | closed | 0.805 | 0.808 | 0.806 | |
| B | closed | 0.820 | 0.832 | 0.826 | |
| B | open | 0.751 | 0.813 | 0.781 | |
| C | closed | 0.819 | 0.761 | 0.789 | |
| C | open | 0.845 | 0.784 | 0.813 | |
| C | closed | 0.830 | 0.773 | 0.801 | |
| C | open | 0.864 | 0.787 | 0.824 | |
| D | closed | 0.804 | 0.801 | 0.803 | |
| D | open | 0.803 | 0.814 | 0.809 | |
| D | closed | 0.803 | 0.805 | 0.804 | unofficial |
| D | open | 0.801 | 0.818 | 0.809 | |
| E | open | 0.825 | 0.742 | 0.781 | |
| E | open | 0.823 | 0.743 | 0.781 | |
| E | open | 0.823 | 0.741 | 0.780 | |
| F | closed | 0.855 | 0.689 | 0.763 | |
| F | closed | 0.843 | 0.718 | 0.775 | |
| G | closed | 0.712 | 0.781 | 0.745 | |
| G | open | 0.738 | 0.799 | 0.767 | |
| G | closed | 0.707 | 0.785 | 0.744 | |
| H | open | 0.800 | 0.685 | 0.738 | |
| H | open | 0.637 | 0.697 | 0.666 | |
| H | open | 0.632 | 0.705 | 0.667 | |
| I | closed | 0.698 | 0.719 | 0.708 | |
| I | closed | 0.719 | 0.706 | 0.712 | |
| I | closed | 0.763 | 0.617 | 0.683 | |
| I | open | 0.722 | 0.727 | 0.724 | |
| J | open | 0.558 | 0.681 | 0.613 | |
| K | open | 0.555 | 0.683 | 0.612 | unofficial |
| L | closed | 0.501 | 0.719 | 0.591 | |
| L | closed | 0.529 | 0.707 | 0.605 | |
| L | closed | 0.578 | 0.592 | 0.585 | |
| M | open | 0.784 | 0.418 | 0.545 | |
| N | closed | 0.323 | 0.568 | 0.412 | |
| N | closed | 0.315 | 0.567 | 0.405 | |
| N | closed | 0.311 | 0.579 | 0.404 | |
| O | closed | 0.151 | 0.332 | 0.208 | unofficial |
| O | closed | 0.107 | 0.356 | 0.164 | |
| O | open | 0.384 | 0.432 | 0.407 | unofficial |
| O | open | 0.169 | 0.457 | 0.247 | |
Figure 1Balanced F-scores of the 40+4 submissions.
Figure 2Precision versus recall of the 40+4 submissions.
Figure 3Sample phrase with problematic tokenization (red vertical bars give tokenization boundaries).
Figure 4Percent of names of a given length for BioCreAtIvE task 1A gene names and MUC-6 organization names.