| Literature DB >> 20149233 |
Martin Gerner1, Goran Nenadic, Casey M Bergman.
Abstract
BACKGROUND: The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles.Entities:
Mesh:
Year: 2010 PMID: 20149233 PMCID: PMC2836304 DOI: 10.1186/1471-2105-11-85
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of the LINNAEUS species name identification system. (A) Schematic diagram of the species name dictionary and automaton construction. (B) Schematic of species names tagging and post-processing.
Species name tag sets for different evaluation corpora and LINNAEUS output
| Tag set | Document set | Documents | Species | Tags |
|---|---|---|---|---|
| NCBI taxonomy | MEDLINE | 5,237 | 6,871 | 8,701 |
| PMC OA abs | 10 | 21 | 21 | |
| PMC OA | 12 | 26 | 26 | |
| MeSH | MEDLINE | 6,817,973 | 824 | 7,388,958 |
| PMC OA abs | 44,552 | 518 | 51,592 | |
| PMC OA | 88,826 | 527 | 57,874 | |
| Entrez gene | MEDLINE | 440,084 | 3,125 | 486,791 |
| PMC OA abs | 8,371 | 406 | 9,307 | |
| PMC OA | 9,327 | 428 | 10,294 | |
| EMBL | MEDLINE | 174,074 | 149,598 | 396,853 |
| PMC OA abs | 5,157 | 7,582 | 12,775 | |
| PMC OA | 7,374 | 7,867 | 15,136 | |
| PMC linkouts | MEDLINE | 35,534 | 29,351 | 248,222 |
| PMC OA abs | 41,054 | 41,070 | 286,998 | |
| PMC OA | 42,910 | 32,187 | 289,411 | |
| Whatizit-Organisms | MEDLINE | 71,856 | 23,598 | 3,328,853 |
| PMC OA abs | 82,410 (64,228) | 25,375 | 3,791,412 | |
| PMC OA | 94,289 | 26,557 | 4,075,644 | |
| Manual | MEDLINE | 75 | 176 | 3,205 |
| PMC OA abs | 89 (76) | 215 | 3,878 | |
| PMC OA | 100 | 233 | 4,259 | |
| LINNAEUS output | MEDLINE | 9,919,312 | 57,802 | 30,786,517 |
| PMC OA abs | 88,962 (65,739) | 5,114 | 303,146 | |
| PMC OA | 105,106 | 18,943 | 4,189,681 | |
Numbers in parentheses show the portion of abstracts that can be extracted from the document XML files, enabling mention-level accuracy comparisons (see Methods for details).
Composition of species tags in the manually annotated corpus and false negative predictions by LINNAEUS relative to the manually annotated corpus on the same document set
| Category | Number of tags in corpus | Number of false negatives |
|---|---|---|
| Misspelled | 46 | 11 |
| Incorrect case | 130 | 128 |
| OCR/technical errors | 18 | 16 |
| Enumeration | 2 | 1 |
| Incorrectly used name | 79 | 66 |
| Modifier | 1,217 | 125 |
| Normal mention | 2,788 | 12 |
A detailed description of the different tag categories is provided in the Methods.
Levels of ambiguity in LINNAEUS species tags on different document sets.
| None | Earlier | Whole | ||||
|---|---|---|---|---|---|---|
| Strict | Approx. | Strict | Approx. | Strict | Approx. | |
| MEDLINE | 0.111 | 0.053 | 0.059 | 0.030 | 0.054 | 0.028 |
| PMC OA abs | 0.110 | 0.061 | 0.054 | 0.031 | 0.049 | 0.028 |
| PMC OA | 0.143 | 0.075 | 0.029 | 0.015 | 0.027 | 0.013 |
"None" refers to the baseline case where no disambiguation is performed, "earlier" refers to disambiguation of an ambiguous mention by searching for its explicit species mentions earlier in the document and "whole" refers to disambiguation by searching for its explicit mentions across the whole document. In the "approximate" mode, a heuristic is employed to further disambiguate ambiguous mentions based on the probability of correct species usage.
Performance evaluation of LINNAEUS species tagging on different evaluation sets
| Set | Level | Main set | TP | FP | FN | Recall | Prec. |
|---|---|---|---|---|---|---|---|
| NCBI taxonomy | Doc. | MEDLINE | 6,888 | 10,032 | (1,807) | 0.7922 | (0.4071) |
| PMC OA abs | 15 | 20 | (6) | 0.7143 | (0.4286) | ||
| PMC OA full (abs) | 16 | 166 | (3) | 0.8421 | (0.0791) | ||
| PMC OA full (all) | 22 | 196 | (4) | 0.8462 | (0.1010) | ||
| MeSH | Doc. | MEDLINE | 5,073,147 | 4,577,293 | 2,315,811 | 0.6866 | 0.5257 |
| PMC OA abs | 36,641 | 49,151 | (14,797) | 0.7123 | (0.4271) | ||
| PMC OA full (abs) | 46,484 | 291,872 | (2,219) | 0.9544 | (0.1374) | ||
| PMC OA full (all) | 54,814 | 346,071 | (2,880) | 0.9201 | (0.1367) | ||
| Entrez gene | Doc. | MEDLINE | 346,989 | 171,001 | (139,702) | 0.7130 | (0.6699) |
| PMC OA abs | 6,946 | 4,110 | (2,357) | 0.7466 | (0.6283) | ||
| PMC OA full (abs) | 8,184 | 38,275 | (470) | 0.9457 | (0.1762) | ||
| PMC OA full (all) | 9,662 | 42,209 | (628) | 0.9390 | (0.1863) | ||
| EMBL | Doc. | MEDLINE | 158,462 | 183,950 | (235,745) | 0.4020 | (0.4627) |
| PMC OA abs | 4,807 | 4,360 | (7,902) | 0.3782 | (0.5244) | ||
| PMC OA full (abs) | 6,601 | 34,447 | (3,859) | 0.6311 | (0.1608) | ||
| PMC OA full (all) | 9,433 | 40,212 | (5,613) | 0.6269 | (0.1900) | ||
| PMC linkouts | Doc. | MEDLINE | (27,259) | (23,377) | (122,596) | (0.1819) | (0.5383) |
| PMC OA abs | (30,315) | (27,192) | (141,735) | (0.1762) | (0.5272) | ||
| PMC OA full (abs) | 110,288 | 156,012 | 61,656 | 0.6414 | 0.4141 | ||
| PMC OA full (all) | 11,2069 | 163,052 | 61,671 | 0.6450 | 0.4073 | ||
| Whatizit-Organisms | Doc. | PMC OA abs | 64,686 | 29,222 | 12,930 | 0.8334 | 0.6888 |
| PMC OA full (abs) | 308,410 | 67,171 | 100,079 | 0.7550 | 0.8211 | ||
| PMC OA full (all) | 344,445 | 73,489 | 109,668 | 0.7585 | 0.8242 | ||
| Mention | PMC OA abs | 139,077 | 147,426 | 39,351 | 0.7794 | 0.4854 | |
| PMC OA full (xml) | 1,164,799 | 1,596,615 | 527,284 | 0.6883 | 0.4218 | ||
| PMC OA full (all) | 1,304,620 | 2,398,321 | 1,133,018 | 0.5352 | 0.3523 | ||
| Manual | Doc. | PMC OA abs | 101 | 0 | 3 | 0.9712 | 1.0 |
| PMC OA full (abs) | 421 | 46 | 9 | 0.9791 | 0.9015 | ||
| PMC OA full (all) | 462 | 49 | 9 | 0.9809 | 0.9041 | ||
| Mention | PMC OA abs | 326 | 3 | 19 | 0.9449 | 0.9909 | |
| PMC OA full (xml) | 3,190 | 92 | 222 | 0.9350 | 0.9720 | ||
| PMC OA full (all) | 3,973 | 120 | 241 | 0.9428 | 0.9707 | ||
Values in parentheses are for comparisons between document sets of different type (for example, evaluation tag sets based on full text compared against species tags generated on abstracts) or when the evaluation set is likely to exclude a large number of species mentions. PMC OA full (all) shows accuracy for all full-text documents. PMC OA full (abs) shows accuracy for all full-text documents with an abstract that can be extracted, allowing comparison of document-level accuracy between full-text and abstract. PMC OA full (xml) shows accuracy for all full-text documents with XML abstract, allowing comparison of mention-level accuracy between full-text and abstracts.
Top ten most commonly mentioned species in MEDLINE.
| Species | Mentions | Ratio of all mentions | Ratio of all documents |
|---|---|---|---|
| Human | 4,801,489 | 0.4743 | 0.4840 |
| Rat | 831,552 | 0.0821 | 0.0838 |
| Mouse | 655,695 | 0.0647 | 0.0661 |
| Cow | 186,091 | 0.0183 | 0.0187 |
| Rabbit | 162,487 | 0.0160 | 0.0163 |
| Escherichia coli | 144,077 | 0.0142 | 0.0145 |
| HIV | 117,441 | 0.0116 | 0.0118 |
| Dog | 112,366 | 0.0111 | 0.0113 |
| Baker's yeast | 112,254 | 0.0110 | 0.0113 |
| Chicken | 75,440 | 0.0074 | 0.0076 |
Mentions are calculated on a document level in MEDLINE relative to the total number of document-level mentions (n = 10,122,214) and the total number of documents (n = 9,919,312)
Figure 2Number of articles per year in MEDLINE mentioning human, rat, mouse, cow, rabbit and HIV since 1975. Note that the rapid rise in mentions of the term HIV occurs just after its discovery in 1983 [59].