| Literature DB >> 31014224 |
Bjørn Tore Kopperud1, Scott Lidgard2, Lee Hsiang Liow1,3.
Abstract
Documented occurrences of fossil taxa are the empirical foundation for understanding large-scale biodiversity changes and evolutionary dynamics in deep time. The fossil record contains vast amounts of understudied taxa. Yet the compilation of huge volumes of data remains a labour-intensive impediment to a more complete understanding of Earth's biodiversity history. Even so, many occurrence records of species and genera in these taxa can be uncovered in the palaeontological literature. Here, we extract observations of fossils and their inferred ages from unstructured text in books and scientific articles using machine-learning approaches. We use Bryozoa, a group of marine invertebrates with a rich fossil record, as a case study. Building on recent advances in computational linguistics, we develop a pipeline to recognize taxonomic names and geologic time intervals in published literature and use supervised learning to machine-read whether the species in question occurred in a given age interval. Intermediate machine error rates appear comparable to human error rates in a simple trial, and resulting genus richness curves capture the main features of published fossil diversity studies of bryozoans. We believe our automated pipeline, that greatly reduced the time required to compile our dataset, can help others compile similar data for other taxa.Entities:
Keywords: cheilostome bryozoans; fossil occurrences; information extraction; literature compilation; natural language processing; palaeobiodiversity
Year: 2019 PMID: 31014224 PMCID: PMC6501925 DOI: 10.1098/rspb.2019.0022
Source DB: PubMed Journal: Proc Biol Sci ISSN: 0962-8452 Impact factor: 5.349
Figure 1.(a) The general workflow for automatic information extraction of fossil occurrence data. (b) The machine-learning classifier. A bidirectional long short-term memory (LSTM) recurrent neural network, with the first example candidate as input. The numbers given are illustratory. Dashed arrows indicate dependency grammar links. See electronic supplementary material for details on the classifier. The figure style is inspired by Miwa & Bansal [23]. (Online version in colour.)
Figure 2.Receiver operating characteristic curve. The rates for the relation classifier are evaluated on the test set. Ninety-nine iterations are plotted in grey, and the one in black is chosen at random. b = 0.50 is the standard decision boundary, and b = 0.95 represents false positive rate of 5%. The area under the black curve is 0.90. The dashed line represents the expected rates given a random classifier.
Figure 3.Range-through genus richness for cheilostomes. The curve from Taylor & Waeschenbach ([29], fig. 12) was obtained using a plot digitizer [54]. Our richness counts are supplemented with extant observations from WoRMS [12]. We used bins that are comparable with the bins used by Taylor & Waeschenbach [29]. The false positive rate evaluated on the test set is 27%. The geologic ranges for all genera are detailed in the electronic supplementary material, table S2. (Online version in colour.)