| Literature DB >> 30692868 |
Gabriel Muñoz1,2, W Daniel Kissling3, E Emiel van Loon3.
Abstract
BACKGROUND: A considerable portion of primary biodiversity data is digitally locked inside published literature which is often stored as pdf files. Large-scale approaches to biodiversity science could benefit from retrieving this information and making it digitally accessible and machine-readable. Nonetheless, the amount and diversity of digitally published literature pose many challenges for knowledge discovery and retrieval. Text mining has been extensively used for data discovery tasks in large quantities of documents. However, text mining approaches for knowledge discovery and retrieval have been limited in biodiversity science compared to other disciplines. NEW INFORMATION: Here, we present a novel, open source text mining tool, the Biodiversity Observations Miner (BOM). This web application, written in R, allows the semi-automated discovery of punctual biodiversity observations (e.g. biotic interactions, functional or behavioural traits and natural history descriptions) associated with the scientific names present inside a corpus of scientific literature. Furthermore, BOM enable users the rapid screening of large quantities of literature based on word co-occurrences that match custom biodiversity dictionaries. This tool aims to increase the digital mobilisation of primary biodiversity data and is freely accessible via GitHub or through a web server.Entities:
Keywords: R.; biodiversity data; biodiversity knowledge; biotic interactions; data mobilisation; scientific names; text mining
Year: 2019 PMID: 30692868 PMCID: PMC6344444 DOI: 10.3897/BDJ.7.e28737
Source DB: PubMed Journal: Biodivers Data J ISSN: 1314-2828
Figure 1.Sections of Biodiversity Observations Miner (BOM) user interface: The figure illustrates the different parts that compose the user interface of BOM web application. The interface is composed of three main components, a header (white bar on top), a sidebar menu (dark blue at in the left side) and the main page (cyan in the centre). The header includes the application name (1), a button to collapse the sidebar menu (2) and a notification menu (3). The sidebar menu (4) contains the individual tabs to navigate across the functionalities of BOM. The main page (5) allows the setting of parameters and obtaining the results of the mining steps. In the main page, the header of setting type boxes are colour-coded yellow whereas the result boxes (i.e. Text snippets) are colour-coded with red headers.
Figure 2.Example of a moving window of n = 6 of a skip-n-gram model over a piece of text from O'Farrill et al. (2013). The text has been cleaned of common stop words (e.g. "the", "all", "however"). Inside the moving window, a central word is fixated (randomly) and all possible word pairs are considered as word vectors. After this step is completed, the moving window advances one word and repeats the process again. Frequencies of co-occurrences within the pool of word vectors are further used to rank word pairs.
Figure 3.Example of one text snippet resulting from running Biodiversity Observations Miner with O'Farrill et al. (2013) as input. This text snippet (i.e. biodiversity observation) contains data about a frugivory interaction between plants and animals. Here, biodiversity data comes from the description of the monkeys and being frugivores of fruits. The terms "swallow" and "dispersal" were part of the frugivory biodiversity dictionary included in BOM. Red boxes highlight the taxonomical entities recognised using the Global Names Architecture API implemented with the taxize (Chamberlain and Szöcs 2013) R package. The green boxes show the matches of frugivory dictionary terms within the text snippet.