| Literature DB >> 17854502 |
Helen L Johnson1, William A Baumgartner, Martin Krallinger, K Bretonnel Cohen, Lawrence Hunter.
Abstract
BACKGROUND: Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.Entities:
Year: 2007 PMID: 17854502 PMCID: PMC2072937 DOI: 10.1186/1747-5333-2-4
Source DB: PubMed Journal: J Biomed Discov Collab ISSN: 1747-5333
Figure 1Text block from original PDG corpus. This block of text from the original PDG corpus shows the idiosyncratic format of the protein interaction annotations. "MED" is a deprecated MEDLINE ID. The words that follow "actions" are keywords denoting an interaction type between proteins. The words that follow "Proteins" are the interactors. The text that follows has been altered from the original MEDLINE publication.
Figure 2Refactored corpus: Word Freak format. Example of the text block from Figure 1 in the refactored WordFreak format. The original sentence reads Here we show that E2F binds to two sequence elements within the P2 promoter of the human MYC gene which are within a region that is critical for promoter activity.
Figure 3Refactored corpus: embedded XML format. Example of the text block from Figure 1 in the refactored embedded XML format.
Programming and curation times for each step. Programming times were estimates. Curation times were measured.
| Refactor Step | Program | Curation | Total Project | ||
| ID mapping | 18 h | 10 m | |||
| Finding original sentences | 28 h | 4 h | |||
| Protein and interaction mapping | 32 h | 16 h | 15 m | ||
| Final formatting | 24 h | 0 h | |||
| Total time for programming and curation | 102 h | 20 h | 25 m | 122 h | 25 m |
Performance of the automatic sentence extraction step.
| Overall performance | Percent | Count |
| Correct extraction | 66% | 187/283 |
| Incorrect extraction | 33% | 96/283 |
| Total | 100% | 283/283 |
| Type of error | Percent | Count |
| Too little extracted | 48% | 46/96 |
| Title text not extracted | 39% | 37/96 |
| Too much extracted, expanded text selection | 9% | 9/96 |
| Too much extracted | 4% | 4/96 |
| Total | 100% | 96/96 |
Results on the automatic entity mapping step
| Type of error | Percentage | Count |
| Text blocks requiring no manual correction | 57.6% | 163/283 |
| Text blocks requiring at least one boundary correction | 22.3% | 63/283 |
| Text blocks with at least one unmappable entity | 20.1% | 57/283 |
| Total | 100% | 283/283 |
Results on named entity mapping: time and required corrections
| Curation Step | Number | Time | |
| a) Manually examine output for validity | n/a | 5 h | 15 m |
| b) Fix protein mentions requiring boundary correction | 131 | 1 h | 5 m |
| c) Add protein annotations that were unmappable | 42 | 55 m | |
| d) Remove proteins that were in error in metadata | 23 | ||
| Total repair time (b + c + d) | 2 h | ||
Roadmap for refactoring corpora. The list of corpora came from [32] and [33], where there are links to the corpora. Column headings indicate the steps that corpora may need to undergo to be refactored; those corpora that would require that step are noted with a dot. The heading "get original" means the original text needs to be retrieved. "Detect spans" means the corpus is a metadata corpus so spans of entities need to be detected. "Alt. search" means techniques other than exact-match searching must be used.
| get original | detect spans | alt. search | |
| Arabidopsis Thaliana Circadian Rhythms [34] | • | ||
| Bio1 [35] | • | ||
| BioCreative 2004 Task 1A [28] | • | • | |
| BioCreative 2004 Task 1B [36] | • | • | |
| BioCreative 2004 Task 2 [37] | • | • | |
| BioCreative 2006 Task GM [38] | |||
| BioCreative 2006 Task GN [39] | |||
| BioCreative 2006 Task IPS/IMS [40] | • | • | |
| BioCreative 2006 Task ISS [40] | • | ||
| BioInfer [41] | |||
| BioText: Recognizing Abbreviation Defintions [42] | |||
| BioText: Protein-Protein Interaction Data [43] | • | • | |
| BioText: Relations between Disease/Treatment Entities [44] | • | ||
| Brown-Genia Treebank [45] | • | ||
| DepGenia [46] | • | ||
| DIPPPI [47] | • | • | |
| EDGAR [48] | • | • | |
| GENIA [49, 50] | • | ||
| FetchProt [51] | |||
| Human Gene ID-Serve | • | ||
| IEPA [52] | • | • | |
| ImmunoTome | • | ||
| iProLink [53] | |||
| Medstract [54, 55] | |||
| MedTag [7] | |||
| OHSUMED [56, 57] | • | • | • |
| PASBio [58] | • | ||
| PASTA [59] | |||
| PathBinder [60] | |||
| PennBioIE [12] | |||
| PICorpus | |||
| ProSpecTome [61] | • | • | |
| PDG [9] | • | • | • |
| Texas [62] | • | • | |
| TREC Genomics 2004 Categorization Task [63] | • | • | |
| TREC Genomics 2005 Categorization Task [64] | • | • | |
| TREC Gemonics 2006 IR Task [65] | • | • | |
| TREC Genomics 2007 IR Task [65] | • | • | |
| Wisconsin [66] | • | • | • |
| WSD [67] | |||
| Yapex [68, 69] | • |