| Literature DB >> 35582795 |
Maxwell J Farrell1, Liam Brierley2, Anna Willoughby3,4, Andrew Yates5, Nicole Mideo1.
Abstract
Ecology and evolutionary biology, like other scientific fields, are experiencing an exponential growth of academic manuscripts. As domain knowledge accumulates, scientists will need new computational approaches for identifying relevant literature to read and include in formal literature reviews and meta-analyses. Importantly, these approaches can also facilitate automated, large-scale data synthesis tasks and build structured databases from the information in the texts of primary journal articles, books, grey literature, and websites. The increasing availability of digital text, computational resources, and machine-learning based language models have led to a revolution in text analysis and natural language processing (NLP) in recent years. NLP has been widely adopted across the biomedical sciences but is rarely used in ecology and evolutionary biology. Applying computational tools from text mining and NLP will increase the efficiency of data synthesis, improve the reproducibility of literature reviews, formalize analyses of research biases and knowledge gaps, and promote data-driven discovery of patterns across ecology and evolutionary biology. Here we present recent use cases from ecology and evolution, and discuss future applications, limitations and ethical issues.Entities:
Keywords: computational linguistics; database construction; information extraction; literature synthesis; natural language processing
Mesh:
Year: 2022 PMID: 35582795 PMCID: PMC9114983 DOI: 10.1098/rspb.2021.2721
Source DB: PubMed Journal: Proc Biol Sci ISSN: 0962-8452 Impact factor: 5.530
Figure 1Publication trends indicating an earlier adoption, and greater (a) absolute number and (b) proportion of papers involving text mining in biomedical publications compared to ecology and evolutionary biology. Data were from two Web of Science (WOS) searches: one with ‘*medic*’ and the other with ‘ecology’ OR ‘evolutionary biology’ OR ‘biodiversity’ in the Topic field, plus ‘text mining’ OR ‘Natural Language Processing’ OR ‘NLP’ in All Fields for each search. A total of 5262 biomedical papers and 120 ecology/evolutionary biology papers mentioning text mining or NLP were identified out of a total 2 355 632 biomedical and 354 798 ecology/evolution papers. Searches were conducted on 10 September 2021 via the University of Toronto subscription. Note that variation in WOS search results varies owing to institutional subscriptions [10]. Search results were subset to the years 1990–2020 inclusive. Data and R code to reproduce the figure, and .bib files with citation information for the returned articles can be accessed at https://github.com/maxfarrell/textmining_trends. (Online version in colour.)
Figure 2Potential applications of natural language models in ecology and evolution. The simplest application is training and applying a document classifier to predict relevant documents (top row). Given a training set of relevant and non-relevant documents (may come from existing databases, a manually curated training set, or documents tagged by a set of rules), the relevance of new documents may be predicted and prioritized for manual screening and curation, or downstream information extraction. Manual screening may be used to validate predictive models or re-train and fine-tune the original classifier. Once a set of relevant documents is identified, the subjects of the documents can be explored through named entity recognition (NER; middle row). Named entities can be identified by comparing text strings to a dictionary. If a complete set of entities is not known or available, a machine learning-based NER tool can be used to predict entities and identify never-before-seen terms. Given a training set, NER can be used to identify terms in a text (for example, species, genes, proteins, locations, morphological structures) and tag their locations in a text. Once components of a document are tagged (parts of speech, named entities, numbers), relationships among them can be identified to create structured datasets for analysis (bottom row). Relationships may be inferred through term co-occurrence frequencies, sentence structures (dependency parsing), or through machine learning-based models that predict the nature of the relationship. Relational data can take a variety of forms including species interactions, biological measurements and their associated units, or networks of different relationship types (ontologies). Figure created with BioRender.com. (Online version in colour.)
Table of common relationship types in ecology and evolution, and example texts. Italics are species names, underlining are the entities, bold are the relations.
| example of relation | example text |
|---|---|
| measurements and units | ‘The |
| model-specific parameters | ‘ |
| species interactions | ‘ |
| protein–protein interactions | ‘ |
| habitat associations | ‘ |
| species occurrences | ‘ |
| Linnaean taxonomy/common names/synonyms | ‘ |
| anthropogenic impacts | ‘ |