| Literature DB >> 15960824 |
Marc E Colosimo1, Alexander A Morgan, Alexander S Yeh, Jeffrey B Colombe, Lynette Hirschman.
Abstract
BACKGROUND: We prepared and evaluated training and test materials for an assessment of text mining methods in molecular biology. The goal of the assessment was to evaluate the ability of automated systems to generate a list of unique gene identifiers from PubMed abstracts for the three model organisms Fly, Mouse, and Yeast. This paper describes the preparation and evaluation of answer keys for training and testing. These consisted of lists of normalized gene names found in the abstracts, generated by adapting the gene list for the full journal articles found in the model organism databases. For the training dataset, the gene list was pruned automatically to remove gene names not found in the abstract; for the testing dataset, it was further refined by manual annotation by annotators provided with guidelines. A critical step in interpreting the results of an assessment is to evaluate the quality of the data preparation. We did this by careful assessment of interannotator agreement and the use of answer pooling of participant results to improve the quality of the final testing dataset.Entities:
Mesh:
Year: 2005 PMID: 15960824 PMCID: PMC1869005 DOI: 10.1186/1471-2105-6-S1-S12
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Statistics for Lexical Resources
| Lexical Stats | # Entries | # Synonyms | Ratio | Maximum # of Synonyms per Gene | # with One Definition | Predicted # of genes |
| Fly | 35,971 | 99,501 | 2.766 | 96 | 10,863 | ~14,000 |
| Mouse | 52,595 | 109,516 | 2.082 | 19 | 39,135 | ~25,000 |
| Yeast | 7,929 | 14,756 | 1.861 | 10 | 2,955 | ~6,000 |
Figure 1Sample Abstract and "Noisy" Gene List. Underlined and bold words in the PubMed Abstract are the genes were found in the text. The answer key consists of four columns. Column 1 is the file name; column 2 has the model organism unique database identifiers. Column 3 shows whether the gene was found automatically in the abstract (Y), not found and pruned (N), or added by hand (X). Column 4 shows the final set of genes in the answer key. This answer key shows that two genes were given by the database curators (FBgn0000592, and FBgn0002722); the first one was found in the abstract, the second one was not. The third gene (FBgn0026412) was found by our annotators based on the guidelines.
Composition of Gene Lists
| Manually Found in Abstracts | ||||||
| Organism | Number of Genes on Database List | Genes on DB List | % on DB List | Genes Added to List | % Total Manual Genes | Total Genes |
| Fly | 1571 | 399 | 25 | 32 | 7 | 431 |
| Mouse | 795 | 290 | 36 | 205 | 41 | 495 |
| Yeast | 737 | 540 | 73 | 75 | 12 | 615 |
Percent of the genes on the database (DB) list describes what percentage of the genes we found in the abstracts that were on the lists given from the databases. Percent total manual genes are the percents of the genes we added out of the total genes found.
Interannotator Agreement Experiment. A gene was marked "disagree" if one out of the three annotators disagreed.
| Organism | Genes Annotated | Disagree | % Disagree |
| Fly | 129 | 17 | 13% |
| Mouse | 89 | 28 | 31% |
| Yeast | 64 | 6 | 9% |
Conflicting Genes from Answer Pooling
| Organism | False Positives | False Negatives | ||
| # Candidates | # Correctly Annotated | # Candidates | # Correctly Annotated | |
| Fly | 5 | 3 | 15 | 12 |
| Mouse | 70 | 18 | 43 | 34 |
| Yeast | 19 | 13 | 14 | 11 |
Changes in the Gold Standards
| Metric | Scores of the Original gold-standards | ||
| Fly | Mouse | Yeast | |
| F-measure | 0.993 | 0.920 | 0.987 |
| Precision | 0.991 | 0.966 | 0.989 |
| Recall | 0.995 | 0.879 | 0.985 |
Figure 2Changes in Participant's Mouse F-measures. Graph showing the differences between the participant's original F-measure and their final F-measure.
Final Comparison of Found Genes
| Organism | # Genes on Database List | # Found from List | % Overlap with Database | Total # Additional Genes Added | %Added Genes of Total | Total Genes |
| Fly | 1571 | 399 | 25.4 | 34 | 7.9 | 429 |
| Mouse | 795 | 290 | 37.1 | 271 | 49.8 | 544 |
| Yeast | 737 | 540 | 73.3 | 84 | 13.7 | 613 |