| Literature DB >> 28549446 |
Dmitry Y Mozzherin1, Alexander A Myltsev2, David J Patterson3.
Abstract
BACKGROUND: Scientific names in biology act as universal links. They allow us to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, misspellings, etc. Authorship is a part of a scientific name and may also differ significantly. To match all possible variations of a name we need to divide them into their elements and classify each element according to its role. We refer to this as 'parsing' the name. Parsing categorizes name's elements into those that are stable and those that are prone to change. Names are matched first by combining them according to their stable elements. Matches are then refined by examining their varying elements. This two stage process dramatically improves the number and quality of matches. It is especially useful for the automatic data exchange within the context of "Big Data" in biology.Entities:
Keywords: Biodiversity; Biodiversity informatics; Names-based cyberinfrastructure; Parser; Parsing Expression Grammar; Scala; Scientific name; Semantic parser
Mesh:
Year: 2017 PMID: 28549446 PMCID: PMC5446698 DOI: 10.1186/s12859-017-1663-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Some legitimate versions of the scientific name for the ‘Northern Bulrush’ or ‘Singlespike Sedge’. The genus (Carex), species (scirpoidea), and subspecies (convoluta) may be annotated (var., subsp., and ssp.) or include or omit the name of the original authority for the infraspecies (Kükenthal), or for the species (Michaux), or for the current infraspecific combination (Dunlop). The name of the authority is sometimes abbreviated, sometimes differently spelled, and may be with or without initials and dates. This list is not complete. Image courtesy of [42]
Fig. 2Web Graphical User Interface [43]. In this example a user entered a name-string of a hybrid name consisted of 21 elements. The “Results and discussion” section contains detailed parsed output using compact JSON format
Precision/Recall for parsers applied to 1000 name-strings
| gnparser | gbif-parser | Biodiversity | |
|---|---|---|---|
|
| 978 | 955 | 971 |
|
| 13 | 12 | 13 |
|
| 9 | 32 | 16 |
|
| 0 | 1 | 0 |
|
| 0.989 | 0.968 | 0.984 |
|
| 1.0 | 0.999 | 1.0 |
|
| 0.994 | 0.983 | 0.992 |
|
| 0.989 | 0.967 | 0.984 |
Accuracy of non-parseable names detection out of 100,000 name-strings
| gnparser | gbif-parser | Biodiversity | |
|---|---|---|---|
|
| 1131 | 1082 | 1161 |
|
| 1129 | 940 | 1152 |
|
| 2 | 142 | 9 |
|
| 0.998 | 0.869 | 0.992 |
Fig. 3Names parsed per second by GN, GBIF and Biodiversity parsers (running on 1–12 parallel threads)
Summary comparison of Scientific Name Parsers
| gnparser | gbif-parser | Biodiversity | |
|---|---|---|---|
|
| 98.9% | 96.7% | 98.4% |
|
| Yes | No | Yes |
|
| Yes | No | Yes |
|
| 8178 | 6389 | 1111 |
|
| Complete | Partial | Complete |
|
| Yes | Yes | Yes |
|
| Yes | Yes | No |
|
| Yes | No | Yes |
|
| Yes | No | Yes |
|
| Yes | Yes | Yes |
|
| Yes | Yes | Yes |