| Literature DB >> 22679507 |
Chih-Hsuan Wei1, Hung-Yu Kao, Zhiyong Lu.
Abstract
As suggested in recent studies, species recognition and disambiguation is one of the most critical and challenging steps in many downstream text-mining applications such as the gene normalization task and protein-protein interaction extraction. We report SR4GN: an open source tool for species recognition and disambiguation in biomedical text. In addition to the species detection function in existing tools, SR4GN is optimized for the Gene Normalization task. As such it is developed to link detected species with corresponding gene mentions in a document. SR4GN achieves 85.42% in accuracy and compares favorably to the other state-of-the-art techniques in benchmark experiments. Finally, SR4GN is implemented as a standalone software tool, thus making it convenient and robust for use in many text-mining applications. SR4GN can be downloaded at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/downloads/SR4GN.Entities:
Mesh:
Year: 2012 PMID: 22679507 PMCID: PMC3367953 DOI: 10.1371/journal.pone.0038460
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1An overview of the SR4GN workflow.
Sid, start, end and tax_id in shaded boxes refer to individual sentence identifier; beginning and end text span of a gene or species mention; and NCBI Taxonomy ID.
Evaluation on species detection using the Linnaeus corpus from.
| Precision | Recall | F-measure | |
| SR4GN | 86% | 85% | 86% |
| Linnaeus | 98% | 94% | 96% |
| OrganismTagger | 96% | 63% | 76% |
Evaluation on species assignment using the DECA corpus from Wang et al., (2010).
| Method | Accuracy |
| Kao and Wei, 2011 | 81.08% |
| +R1 | 81.61% |
| +R1+R2 | 82.10% |
| +R1+R2+R3 | 84.20% |
|
|
|
| Wang et. al., 2010 | 83.80% |
| Mu et. al., 2010 | 85.13% |
As in Wang et al., [9] and Mu et al., [10], hand-tagged gene mentions are used.
Evaluation using the test data from the BioCreative III GN task.
| Species Module | TAP-5 | TAP-10 | TAP-20 | F-measure |
| Kao and Wei, 2010 | 0.3254 | 0.3538 | 0.3535 | 0.4553 |
|
|
|
|
|
|
| Linnaeus | 0.3042 | 0.3283 | 0.3283 | 0.4476 |
| OrganismTagger | 0.2915 | 0.3011 | 0.3011 | 0.4456 |
Both traditional F-measure and BC III TAP-k measure [22] are reported. The same software AIIA-GMT was used to tag gene mentions here. The last two rows show decreased GN results when replacing SR4GN with Linnaeus and OganismTagger for species recognition while keeping all other GN modules (e.g. gene recognition) intact.
Comparison of benchmarking time on species detection by Linnaeus, OrganismTagger and SR4GN.
| System | Loadingdictionary | 10 abstracts | 100 abstracts | Output format |
| Linnaeus | 41s | 1.95s | 2.15s | Tab delimited text |
| OrganismTagger | 34s | 37s | 5m21s | XML |
| SR4GN | 0 | 15s | 2m44s | XML |
SR4GN does not preload the species dictionary into the memory, thus requiring the least amount of computer RAM for the tests shown above: Linnaeus (1.2GB), OrganismTagger (1.6GB), and SR4GN (150MB).
According to its online documentation, when running 5 parallel threads with 10GB RAM, OganismTagger needs only 14 seconds for processing 100 documents.
Breakdown of errors by different rules.
| Species assignment rules | Applicable genes (%) | # of errors | Accuracy |
| Prefix | 147(2.46%) | 0 | 100% |
| Co-occurring word | 1951(32.66%) | 382 | 80.42% |
| Focus species | 2881(48.23%) | 332 | 88.48% |
| SRI coefficient method | 995(16.66%) | 157 | 84.22% |
| Total | 5974(100%) | 871 | 85.42% |