| Literature DB >> 27855645 |
Hong Cui1, Dongfang Xu2, Steven S Chong2, Martin Ramirez3, Thomas Rodenhausen2, James A Macklin4, Bertram Ludäscher5, Robert A Morris6, Eduardo M Soto7, Nicolás Mongiardino Koch7.
Abstract
BACKGROUND: Taxonomic descriptions are traditionally composed in natural language and published in a format that cannot be directly used by computers. The Exploring Taxon Concepts (ETC) project has been developing a set of web-based software tools that convert morphological descriptions published in telegraphic style to character data that can be reused and repurposed. This paper introduces the first semi-automated pipeline, to our knowledge, that converts morphological descriptions into taxon-character matrices to support systematics and evolutionary biology research. We then demonstrate and evaluate the use of the ETC Input Creation - Text Capture - Matrix Generation pipeline to generate body part measurement matrices from a set of 188 spider morphological descriptions and report the findings.Entities:
Keywords: ETC; Evaluation; Explorer of Taxon Concepts; Information extraction; Natural language processing; Phenotypic characters; Phenotypic traits; Spiders; Taxonomic morphological descriptions; Text mining
Mesh:
Year: 2016 PMID: 27855645 PMCID: PMC5114841 DOI: 10.1186/s12859-016-1352-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1ETC site homepage with expanded menu
Fig. 2ETC logic flow for generating matrix
Fig. 3Batch file creation function
Fig. 4An example input file for the text capture tool
Fig. 9Example output file of the text capture tool
Fig. 5The define task step in the text capture tool
Fig. 6ETC task manager. Shown five tasks with their names (Name), task start time (Created), how the tasks are shared (Access), type of tasks (Task Type), and task progress/current step (Status). The green spinning wheel indicates the task is currently running at a specific step
Fig. 7The review step in the text capture tool
Fig. 8Importing term categorizations in the text capture review step
Fig. 11The preview and selection view at the matrix review step
Fig. 12The spreadsheet view at the matrix review step. Two highlighted characters are candidates for a merge
Fig. 13Character-related functions for the first column of the spreadsheet
Fig. 14Example output (CSV) file of the matrix generation tool
Fig. 10The define task step in the matrix generation tool
Summary of the matrices generated from the original and the normalized inputs
| Original input matrix | Normalized input matrix | |||
|---|---|---|---|---|
| Before edits | After edits | Before edits | After edits | |
| Rows/Exemplars | 188 | 188 | 188 | 188 |
| Columns/Characters | 41 | 18 | 43 | 18 |
| Non-empty cells | 2942 | 2864 | 2914 | 2913 |
| Fullness of matrix (%) | 38.17% populated | 85.00% populated | 36.05% populated | 86.08% populated |
Accuracy, edit effort, and precision/recall/F1-based similarity scores of the matrices generated from the original and normalized inputs
| Original input matrix | Normalized input matrix | |
|---|---|---|
| Pre-edit accuracy (%) | 1.46% = 43/2942 | 98.83% = 2880/2914 |
| Number of edits (splits) | 46 (8) | 28 (3) |
| Post-edit precision (%) | 99.79% | 99.91% |
| Post-edit recall (%) | 98.92% | 99.65% |
| Post-edit F1 score (%) | 99.35% | 99.78% |
Summary of edit efforts made to the original and normalized input matrices
| Original input matrix | Normalized input matrix | |
|---|---|---|
| Splits | 8 | 3 |
| Deletions | 9 | 3 |
| Renames | 15 | 0 |
| Merges | 22 | 25 |
| Unedited | 3 | 18 |
| Values affected by edits | 3593 | 162 |
Edit operations performed in the matrix generated from the original input
| Edit type | Characters affected | Operations | Edit effort |
|---|---|---|---|
| Delete | quantity of leg [15], character of carapace [4], length of carapace [147], width of carapace [76], length of abdomen [133], length of sternum [168], quantity of iii tibia/metatarsu [1], quantity of iii [8], quantity of leg-3 [1] | delete column | 9 |
| Rename | quantity of whole-organism [162] | Rename as "length of whole-organism (new)" | 1 |
| quantity of carapace (split, length) [139] | Rename as "length of carapace (new)" | 1 | |
| quantity of carapace (split, width) [129] | Rename as "width of carapace (new)" | 1 | |
| quantity of abdomen (split, length) [133] | Rename as "length of abdomen (new)" | 1 | |
| quantity of sternum (split, length) [165] | Rename as "length of sternum (new)" | 1 | |
| quantity of spiracle-epigastrium [138] | Rename as "distance of spiracle-epigastrium (new)" | 1 | |
| quantity of spiracle-spinneret [155] | Rename as "distance of spiracle-spinneret (new)" | 1 | |
| quantity of i tibia [189] | Rename as "length of leg i tibia (new)" | 1 | |
| quantity of i metatarsus [189] | Rename as "length of leg i metatarsus (new)" | 1 | |
| quantity of ii tibia [188] | Rename as "length of leg ii tibia (new)" | 1 | |
| quantity of ii metatarsus [188] | Rename as "length of leg ii metatarsus (new)" | 1 | |
| quantity of iii tibia [185] | Rename as "length of leg iii tibia (new)" | 1 | |
| quantity of iii metatarsus [185] | Rename as "length of leg iii metatarsus (new)" | 1 | |
| quantity of iv tibia [186] | Rename as "length of leg iv tibia (new)" | 1 | |
| quantity of iv metatarsus [186] | Rename as "length of leg iv metatarsus (new)" | 1 | |
| Merge | length of whole-organism (new) [162], quantity of body [1] | Merge into length of whole-organism (new)" | 1 |
| length of carapace(new) [129], quantity of prosoma(split, length) [37], quantity of thoracic-groove [6], quantity of cephalic-area [1], quantity of front [2], quantity of ocular-area(split, length) [2] | Merge into “length of carapace (new)” | 5 | |
| width of carapace (new) [139], quantity of prosoma(split, width) [37], quantity of ocular-area(split, width) [1] | Merge into “width of carapace (new)” | 2 | |
| length of palpal-tarsus [5]*, quantity of palpal-tarsus [57] | Merge into “length of palpal-tarsus” | 1 | |
| length of abdomen (new) [133], quantity of opisthosomum(split, length) [44] | Merge into “length of abdomen (new)” | 1 | |
| width of abdomen [2]*, quantity of opisthosomum(split, width) [7], quantity of abdomen (split, width) [128] | Merge into “width of abdomen” | 2 | |
| quantity of sternum (split, width) [160], width of sternum [1]* | Merge into “width of sternum” | 1 | |
| distance of spiracle-spinneret (new) [155], quantity of spiracle [1], quantity of spiracle spinneret [2] | Merge into “distance of spinneret-spiracle(new)” | 2 | |
| quantity of epigastric-furrow [1], distance of spiracle-epigastrium (new) [138], quantity of epigastrium-epigastrium [1], quantity of epigastrium-spiracle [20] | Merge into “distance of epigastrium-spiracle (new)” | 3 | |
| length of leg ii tibia (new) [188], quantity of ii (split, tibia) [5] | Merge into ”length of leg ii tibia(new)“ | 1 | |
| length of leg ii metatarsus (new) [188], quantity of ii (split, metatarsus) [3] | Merge into “length of leg ii metatarsus(new)” | 1 | |
| length of leg iv tibia (new) [186], quantity of iv (split, tibia) [4] | Merge into "length of leg iv tibia(new)" | 1 | |
| length of leg iv metatarsus (new) [186], quantity of iv (split, metatarsus) [3] | Merge into "length of leg iv metatarsus(new)" | 1 | |
| Total edits | 46 |
The numbers in “[]” indicate the number of values affected by an edit operation. Characters indicated with an “*” were retained without edits
Edit operations performed in the matrix generated from the normalized input
| Edit type | Characters affected | Operation | Edit effort |
|---|---|---|---|
| Merge | 1. length of whole-organism [161], length of body$1 [1] | Merge into | 1 |
| 2. length of carapace [147], size of carapace [4], length of prosoma$2 [37], length of ocular-area [1], length of thoracic-groove [2], length of cephalic-area [1] | Merge into | 5 | |
| 3. width of carapace [152], width of prosoma$3 [37], width of ocular-area [1], width of thoracic-groove [2], width of cephalic-area [1] | Merge into | 4 | |
| 4. length of abdomen [138], length of opisthosomum$4 [44] | Merge into | 1 | |
| 5. width of abdomen [128], width of opisthosomum$5 [6] | Merge into | 1 | |
| 6. location of spiracle [1], size of spiracle spinneret [2], distance of spinneret-spiracle [155] | Merge into | 2 | |
| 7. distance of epigastric-furrow [1], distance of epigastrium-epigastrium$6 [1], distance of epigastrium-spiracle [158] | Merge into | 2 | |
| 8. length of leg-2 tibia [189], length of leg-2 [1], size_or_shape of leg-2 (split, tibia) [2] | Merge into | 2 | |
| 9. length of leg-2 metatarsus [189], size_or_shape of leg-2 (split, metatarsus) [2] | Merge into | 1 | |
| 10. length of leg-iii tibia [186], length of leg-ii [1]i, size_or_shape of leg-iii (split, tibia) [1] | Merge into | 2 | |
| 11. length of leg-iii metatarsus [186], size_or_shape of leg-iii (split, metatarsus) [1] | Merge into | 1 | |
| 12. length of leg-4 tibia [187], length of leg-4 [1], size_or_shape of leg-4 (split) [2] | Merge into | 2 | |
| 13. length of leg-4 metatarsus [187], size_or_shape of leg-4 (split) [2] | Merge into | 1 | |
| Delete | 14. length of leg [3] | delete | 1 |
| 15. size of abdomen [3] (values are non-numerical, e.g. tiny) | delete | 1 | |
| 16. length of leg-iii tibia/metatarsu [1] | delete | 1 | |
| Total edits | 28 | ||
The numbers in “[]” indicate the number of values affected by an edit operation. The 18 characters in the gold standard were all included in the machine-generated matrix. The characters superscripted with “$N” are considered equivalent to a corresponding character in the gold standard, either by their semantic equivalence (i.e.. $1), or by the experts’ decisions (i.e., $2–$6)
Exemplar-based precision, recall, and F1 scores of the matrix generated from the original input
| Mean | Sd | Min | Max | Number | |
|---|---|---|---|---|---|
| Precision | 0.9981 | 0.01063 | 0.9333 | 1 | 188 |
| recall | 0.9805 | 0.05153 | 0.7222 | 1 | 188 |
| F1 score | 0.9886 | 0.03055 | 0.8387 | 1 | 188 |
Character-based precision and recall of the matrix generated from the original input
| Character | Precision | Recall | F1-score | Character | Precision | Recall | F1-score |
|---|---|---|---|---|---|---|---|
| Length of whole-organism | 0.9947 | 0.9894 | 0.992 | distance of spinneret-spiracle | 1 | 1 | 1 |
| Length of carapace | 0.9883 | 0.8989 | 0.9415 | length of leg-i tibia | 1 | 0.9947 | 0.9973 |
| Width of carapace | 1 | 0.9149 | 0.9556 | length of leg-i metatarsus | 1 | 0.9947 | 0.9973 |
| Length of palpal-tarsus | 1 | 1 | 1 | length of leg-ii tibia | 1 | 1 | 1 |
| Length of sternum | 1 | 0.9681 | 0.9838 | length of leg-ii metatarsus | 1 | 1 | 1 |
| Width of sternum | 1 | 0.9787 | 0.9892 | length of leg-iii tibia | 1 | 0.9787 | 0.9892 |
| Length of abdomen | 0.9836 | 0.9574 | 0.9704 | length of leg-iii metatarsus | 1 | 0.9840 | 0.9920 |
| Width of abdomen | 0.9947 | 0.9894 | 0.992 | length of leg-iv tibia | 1 | 0.9947 | 0.9973 |
| Distance of epigastrium-spiracle | 1 | 1 | 1 | length of leg-iv metatarsus | 1 | 1 | 1 |
Exemplar-based precision, recall, and F1 scores of the matrix generated from the normalized input
| Mean | Sd | Min | Max | Number | |
|---|---|---|---|---|---|
| Precision | 0.9991 | 0.00698 | 0.9444 | 1 | 188 |
| Recall | 0.9965 | 0.01872 | 0.8333 | 1 | 188 |
| F1 score | 0.9977 | 0.01158 | 0.9091 | 1 | 188 |
Character-based precision, recall, and F1 scores of the matrix generated from the normalized input
| Character | Precision | Recall | F1 score | Character | Precision | Recall | F1 score |
|---|---|---|---|---|---|---|---|
| Length of whole-organism | 1 | 0.9947 | 0.9973 | distance of spinneret-spiracle | 1 | 1 | 1 |
| Length of carapace | 1 | 0.9947 | 0.9973 | length of leg-i tibia | 1 | 0.9947 | 0.9973 |
| Width of carapace | 0.9947 | 0.9894 | 0.992 | length of leg-i metatarsus | 1 | 0.9947 | 0.9973 |
| Length of palpal-tarsus | 1 | 1 | 1 | length of leg-ii tibia | 1 | 1 | 1 |
| Length of sternum | 1 | 1 | 1 | length of leg-ii metatarsus | 1 | 1 | 1 |
| Width of sternum | 1 | 0.9947 | 0.9973 | length of leg-iii tibia | 1 | 0.9947 | 0.9973 |
| Length of abdomen | 0.9894 | 0.9894 | 0.9894 | length of leg-iii metatarsus | 1 | 0.9947 | 0.9973 |
| Width of abdomen | 1 | 1 | 1 | length of leg-iv tibia | 1 | 0.9947 | 0.9973 |
| Distance of epigastrium-spiracle | 1 | 1 | 1 | length of leg-iv metatarsus | 1 | 1 | 1 |