| Literature DB >> 34068389 |
Matteo Conti1, Pier Luigi Nimis1, Stefano Martellos1.
Abstract
Scientific names are not part of everyday language in any modern country, and their input as strings in a query system can be easily associated with typographical errors. While globally unique identifiers univocally address a taxon name, they can hardly be used for querying a database manually. Thus, matching algorithms are often used to overcome misspelled names in query systems in several data repositories worldwide. In order to improve users' experience in the use of FlorItaly, the Portal to the Flora of Italy, a near match algorithm to resolve misspelled scientific names has been integrated in the query systems. In addition, a novel tool in FlorItaly, capable of rapidly aligning any list of names to the nomenclatural backbone provided by the national checklists, has been developed. This manuscript aims at describing the potential of these new tools.Entities:
Keywords: checklist; database; nomenclature; synonym; taxon
Year: 2021 PMID: 34068389 PMCID: PMC8153551 DOI: 10.3390/plants10050974
Source DB: PubMed Journal: Plants (Basel) ISSN: 2223-7747
Figure 1Results of a query in the advanced query interface of FlorItaly. The input name is incorrectly spelled (“officialis” instead of “officinalis”).
Figure 2Advanced query result from FlorItaly. Since no exact matches were found for the misspelled input string “officialis”, the near match algorithm returned the most similar string in the database (“officinalis”) and performed the query by using it. Users are provided with a warning at the beginning of the results page.
Figure 3Result page after the input of a CSV file containing a list of names. The interface highlights the number of names in the list and the number of no matches, unambiguous, and ambiguous matches. Each part of the list can be hidden. As far as ambiguous matches are concerned, users can select a match among those proposed by the algorithm (each listed with its matching score). When strings are too long, holding the pointer on the incomplete string for a few seconds allows to view the full string in a text pop-up.
Results of the list matching algorithm on three different datasets (see text). For each dataset, other than the total number of taxa, positive matches (divided into unambiguous and ambiguous, the latter requiring user input) and no matches are reported.
| Dataset | Total Names | Unambiguous Matches | Ambiguous Matches | No Matches |
|---|---|---|---|---|
| A | 890 | 609 (68.43%) | 271 (30.45%) | 10 (1.12%) |
| B | 2981 | 2649 (88.86%) | 296 (9.93%) | 36 (1.21%) |
| C | 304 | 293 (96.38%) | 10 (3.29%) | 1 (0.33%) |
The source code developed during this study is available on GitHub by querying for “FlorItaly name match”.