| Literature DB >> 23823062 |
Evangelos Pafilis1, Sune P Frankild, Lucia Fanini, Sarah Faulwetter, Christina Pavloudi, Aikaterini Vasileiadou, Christos Arvanitidis, Lars Juhl Jensen.
Abstract
The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org.Entities:
Mesh:
Year: 2013 PMID: 23823062 PMCID: PMC3688812 DOI: 10.1371/journal.pone.0065390
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Speed and memory efficiency of the LINNAEUS and SPECIES taggers.
The major advantage of the SPECIES tagger over existing methods is its efficiency. Compared to the methodologically similar LINNAEUS tagger, it starts up and loads its dictionary 55× faster (6 seconds vs. 6 minutes 35 seconds), tags Medline abstracts 15× faster (0.26 vs. 4.05 seconds per 1000 documents), and uses 5× less memory in the process (0.5 GB vs. 3.0 GB).
Size and species diversity of the corpora.
| Corpus | Category | Documents | Unique species | Unique names | Mentions |
| S800 | Protistology | 100 | 196 | 284 | 497 |
| Entomology | 100 | 138 | 293 | 614 | |
| Virology | 100 | 117 | 342 | 946 | |
| Bacteriology | 100 | 87 | 179 | 416 | |
| Zoology | 100 | 85 | 160 | 299 | |
| Mycology | 100 | 80 | 178 | 538 | |
| Botany | 100 | 68 | 131 | 308 | |
| Medicine | 100 | 23 | 30 | 90 | |
| Total | 800 | 718 | 1503 | 3708 | |
| L100E | 100 | 218 | 375 | 2988 |
The number of documents and uniquely annotated species taxonomic ID, unique species names and the number of document level species mentions for the S800 and L100E corpora using the latest version of the NCBI taxonomy.
Summary benchmark of LINNAEUS and SPECIES.
| Corpus | Level | Software | Precision | Recall | F1 |
| S800 | Document | LINNAEUS | 86.4% | 89.3% | 87.9% |
| SPECIES | 85.9% | 89.8% | 87.8% | ||
| Mention | LINNAEUS | 84.3% | 75.4% | 79.6% | |
| SPECIES | 83.9% | 72.6% | 77.8% | ||
| L100E | Document | LINNAEUS | 89.2% | 91.4% | 90.3% |
| SPECIES | 89.9% | 94.3% | 92.0% | ||
| Mention | LINNAEUS | 88.7% | 81.8% | 85.1% | |
| SPECIES | 91.5% | 90.8% | 91.1% |
We compared LINNAEUS and SPECIES taggers by calculating their precision and recall on two different corpora (L100E an S800) at the document and at the mention level.
Unsurprisingly, SPECIES performs better than LINNAEUS on the L100E corpus, which we used during the development SPECIES. On the S800 corpus, which did not exist when either tagger was developed, we obtain very similar performance numbers for the two taggers.
Figure 2Precision and recall for separate S800 categories.
Because the S800 corpus consists of seven different taxonomic categories (the eighth category is not taxonomic), it can provide insights into which types of species are hard to identify in text and which are easy. Plotting the precision and recall on each of the seven categories separately for both the LINNAEUS and the SPECIES tagger shows little difference between the taggers, but big differences between categories. It is clear that both methods are considerably worse at tagging names of viruses than at tagging cellular organisms, and that bacterial and fungal species—for which Linnaean nomenclature is primarily used—are the easiest to identify in text.
Figure 3The ORGANISMS web resource.
The ORGANISMS web resource (http://organisms.jensenlab.org) aims to make the results of mining the biomedical literature for taxonomic names easily accessible to biologists. It currently covers 164,084 different taxa that can be queried by name. The screenshot shows an example of what is retrieved when searching for Metatheria; because the system is aware of synonyms as well as taxonomy, it correctly retrieved and tagged an abstract about the tammar wallaby.