Literature DB >> 25619994

ENVIRONMENTS and EOL: identification of Environment Ontology terms in text and the annotation of the Encyclopedia of Life.

Evangelos Pafilis1, Sune P Frankild1, Julia Schnetzer2, Lucia Fanini1, Sarah Faulwetter1, Christina Pavloudi1, Katerina Vasileiadou1, Patrick Leary1, Jennifer Hammock1, Katja Schulz1, Cynthia Sims Parr1, Christos Arvanitidis1, Lars Juhl Jensen1.   

Abstract

UNLABELLED: The association of organisms to their environments is a key issue in exploring biodiversity patterns. This knowledge has traditionally been scattered, but textual descriptions of taxa and their habitats are now being consolidated in centralized resources. However, structured annotations are needed to facilitate large-scale analyses. Therefore, we developed ENVIRONMENTS, a fast dictionary-based tagger capable of identifying Environment Ontology (ENVO) terms in text. We evaluate the accuracy of the tagger on a new manually curated corpus of 600 Encyclopedia of Life (EOL) species pages. We use the tagger to associate taxa with environments by tagging EOL text content monthly, and integrate the results into the EOL to disseminate them to a broad audience of users.
AVAILABILITY AND IMPLEMENTATION: The software and the corpus are available under the open-source BSD and the CC-BY-NC-SA 3.0 licenses, respectively, at http://environments.hcmr.gr.
© The Author 2015. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2015        PMID: 25619994      PMCID: PMC4443677          DOI: 10.1093/bioinformatics/btv045

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

The Encyclopedia of Life (EOL; http://eol.org/) is a web resource offering biodiversity knowledge summaries of the world’s species to a vast audience (Parr ). It currently aggregates content from more than 250 providers. These include textual descriptions about the biology, such as habitat, of more than 900 000 taxa. The Environment Ontology (ENVO) project aims to provide a controlled, structured vocabulary to support annotation of organisms with environmental descriptors (Buttigieg ). The ontology comprises ∼1600 terms and is part of recommended (meta-)genomic metadata standards (Yilmaz ). Having the environmental information contained in EOL annotated in the form of ENVO terms, rather than as free text, would enhance search capabilities and enable users to easily compile summary statistics on, for example, the ecological distribution of any taxa. However, manually annotating all EOL entries with ENVO terms would be highly time demanding. An attractive alternative is to use text mining to automatically tag ENVO terms. So far, the few efforts to perform named entity recognition of environments have focused on tagging bacteria biotopes (Bossy ) or made use of generic tools, not optimized for the task (Thessen and Parr, 2014). Here, we present ENVIRONMENTS, a tagger capable of identifying ENVO terms with sufficient accuracy and speed to be useful for annotating large text corpora. To benchmark the method, we developed a new gold standard corpus of manually annotated EOL taxon pages. Last but not least, we have extended the EOL web resource with ENVO terms for each taxon, which are automatically mined from their textual descriptions.

2 ENVO term tagger

ENVIRONMENTS identifies ENVO terms in text using the same fast dictionary-based tagging engine as in Pafilis . The command-line tool requires only a single parameter, namely the path to a folder with the text files to be processed (see supplementary information). We constructed a dictionary based on ENVO by extracting all names and synonyms from the OBO file (see supplementary information), excluding broad synonyms, obsolete terms, terms describing foods rather than environments and terms representing organisms and tissues, which are better captured by other ontologies. Names connected to multiple terms were assigned to the one that best captures its meaning, by ranking the terms based on whether the name was the primary name, an exact synonym, a narrow synonym or a related synonym of the term. Because ENVO usually lists only the singular noun forms, we automatically generated plural and adjective forms. A small fraction of the names will result in many false positives due to homonymy. We created a block list of such names by inspecting text for all names that appeared more than 2000 times in Medline and EOL. To find important synonyms missing in ENVO, we tagged the habitat and ecology sections of 1 342 968 EOL pages that are not part of our evaluation set (see supplementary information), and inspected all words occurring more than 100 times in untagged text segments. Based on this analysis and false negatives found in the development part of the curated corpus, we added 142 synonyms to the dictionary.

3 Manually curated corpus

To construct an evaluation corpus that covers diverse environment types, we retrieved 313 269 EOL species pages representing the clades Actinopterygii, Annelida, Arthropoda, Aves, Chlorophyta, Mammalia, Mollusca, Streptophyta. From these, we extracted the sections about behavior, biology, dispersal, distribution, ecology, habitat, legislation, migration, reproduction and trophic strategy. After discarding very short (<100 words) and very long (>1000 words) pages, we randomly selected 75 species from each of the eight clades. The resulting 600 species pages were randomly distributed among six annotators; 20% of the pages were given to two annotators to allow for assessment of inter-annotator agreement (IAA). Each person independently annotated ENVO terms in the text, with no knowledge of which pages had been assigned to a second annotator. Based on the shared abstracts we find that the median pairwise Cohen's kappa is 0.65, implying that the overall IAA is acceptable despite the difficulty of the annotation task (a Cohen's kappa value of 0 indicates random agreement, while a value of 1 total agreement). The corpus was partitioned into two sets of equal size that were used for development and final evaluation, respectively (see supplementary information).

4 Performance evaluation

Since the ENVIRONMENTS tagger recognizes names within text and links them to ENVO terms, we benchmarked both aspects of its performance. To quantify to which extent the tagger recognizes the same text fragments as the annotators, we calculated precision and recall at the mention level, considering both exact and partially overlapping matches as true positives. On the evaluation part of the corpus, this resulted in 87.8% precision and 77.0% recall, corresponding to an F1 score of 82.0%. For the matches that were considered true positive for the recognition task, we further evaluated if the tagger linked them to the same ENVO terms as the annotators did. In 87.1% of cases, the tagger and the annotator agreed on at least one ENVO term (see supplementary information).

5 Annotation of EOL

To realize the full potential of any text-mining system, it is important that it is adopted by the broad community and that its results are disseminated to the intended end users. To this end, we have integrated the ENVIRONMENTS tagger with the EOL web resource to provide users with ENVO terms for each taxon. Each month, we rerun the tagger on all English text in environment-related sections of EOL (see supplementary information). As of October 2014, this gave rise to 1 077 522 annotations of ENVO terms for 234 582 EOL taxa. We make these environment annotations available to end users in three different ways. First, we show them within the EOL taxon web pages, which provide links to the relevant paragraphs in the textual descriptions for each ENVO term (Fig. 1). Second, they can be queried through the web interface or the application programming interface of the new EOL/Traitbank semantic web data repository of organismal traits (http://eol.org/traitbank). Third, the full annotation dataset can be downloaded in tab-delimited format from http://download.jensenlab.org/EOL/.
Fig. 1.

Top: The “Overview” tab of the EOL taxon pages show a subset of the ENVO terms obtained through text mining; an extended list of such terms is available in the “Data” tab. Parts of the page have been resized to improve readability. Bottom: The latter list provides links to the EOL text sections where each term was found (highlighted in bold)

Top: The “Overview” tab of the EOL taxon pages show a subset of the ENVO terms obtained through text mining; an extended list of such terms is available in the “Data” tab. Parts of the page have been resized to improve readability. Bottom: The latter list provides links to the EOL text sections where each term was found (highlighted in bold)

6 Future work

The ENVIRONMENTS tagger is applicable to other large sources of text than EOL. For example, it can be applied to text fields, such as isolation source, in Genbank (Hirschman ). Combined with the SPECIES tagger (Pafilis ), it can also be used to extract species–environment pairs from the scientific literature, like legacy biodiversity literature (Gwinn and Rinaldo, 2009).

Funding

The Encyclopedia Of Life Rubenstein Fellows Program [CRDF EOL-33066-13/E33066], the LifeWatchGreece Research Infrastructure [384676-94/GSRT/ NSRF(C&E)] and the Novo Nordisk Foundation Center for Protein Research [NNF14CC0001]. Conflict of Interest: none declared.
  6 in total

1.  Habitat-Lite: a GSC case study based on free text terms for environmental metadata.

Authors:  Lynette Hirschman; Cheryl Clark; K Bretonnel Cohen; Scott Mardis; Joanne Luciano; Renzo Kottmann; James Cole; Victor Markowitz; Nikos Kyrpides; Norman Morrison; Lynn M Schriml; Dawn Field
Journal:  OMICS       Date:  2008-06

2.  Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.

Authors:  Pelin Yilmaz; Renzo Kottmann; Dawn Field; Rob Knight; James R Cole; Linda Amaral-Zettler; Jack A Gilbert; Ilene Karsch-Mizrachi; Anjanette Johnston; Guy Cochrane; Robert Vaughan; Christopher Hunter; Joonhong Park; Norman Morrison; Philippe Rocca-Serra; Peter Sterk; Manimozhiyan Arumugam; Mark Bailey; Laura Baumgartner; Bruce W Birren; Martin J Blaser; Vivien Bonazzi; Tim Booth; Peer Bork; Frederic D Bushman; Pier Luigi Buttigieg; Patrick S G Chain; Emily Charlson; Elizabeth K Costello; Heather Huot-Creasy; Peter Dawyndt; Todd DeSantis; Noah Fierer; Jed A Fuhrman; Rachel E Gallery; Dirk Gevers; Richard A Gibbs; Inigo San Gil; Antonio Gonzalez; Jeffrey I Gordon; Robert Guralnick; Wolfgang Hankeln; Sarah Highlander; Philip Hugenholtz; Janet Jansson; Andrew L Kau; Scott T Kelley; Jerry Kennedy; Dan Knights; Omry Koren; Justin Kuczynski; Nikos Kyrpides; Robert Larsen; Christian L Lauber; Teresa Legg; Ruth E Ley; Catherine A Lozupone; Wolfgang Ludwig; Donna Lyons; Eamonn Maguire; Barbara A Methé; Folker Meyer; Brian Muegge; Sara Nakielny; Karen E Nelson; Diana Nemergut; Josh D Neufeld; Lindsay K Newbold; Anna E Oliver; Norman R Pace; Giriprakash Palanisamy; Jörg Peplies; Joseph Petrosino; Lita Proctor; Elmar Pruesse; Christian Quast; Jeroen Raes; Sujeevan Ratnasingham; Jacques Ravel; David A Relman; Susanna Assunta-Sansone; Patrick D Schloss; Lynn Schriml; Rohini Sinha; Michelle I Smith; Erica Sodergren; Aymé Spo; Jesse Stombaugh; James M Tiedje; Doyle V Ward; George M Weinstock; Doug Wendel; Owen White; Andrew Whiteley; Andreas Wilke; Jennifer R Wortman; Tanya Yatsunenko; Frank Oliver Glöckner
Journal:  Nat Biotechnol       Date:  2011-05       Impact factor: 54.908

3.  The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.

Authors:  Evangelos Pafilis; Sune P Frankild; Lucia Fanini; Sarah Faulwetter; Christina Pavloudi; Aikaterini Vasileiadou; Christos Arvanitidis; Lars Juhl Jensen
Journal:  PLoS One       Date:  2013-06-18       Impact factor: 3.240

4.  The environment ontology: contextualising biological and biomedical entities.

Authors:  Pier Luigi Buttigieg; Norman Morrison; Barry Smith; Christopher J Mungall; Suzanna E Lewis
Journal:  J Biomed Semantics       Date:  2013-12-11

5.  Knowledge extraction and semantic annotation of text from the encyclopedia of life.

Authors:  Anne E Thessen; Cynthia Sims Parr
Journal:  PLoS One       Date:  2014-03-03       Impact factor: 3.240

6.  The Encyclopedia of Life v2: Providing Global Access to Knowledge About Life on Earth.

Authors:  Cynthia S Parr; Nathan Wilson; Patrick Leary; Katja S Schulz; Kristen Lans; Lisa Walley; Jennifer A Hammock; Anthony Goddard; Jeremy Rice; Marie Studer; Jeffrey T G Holmes; Robert J Corrigan
Journal:  Biodivers Data J       Date:  2014-04-29
  6 in total
  9 in total

1.  Emerging semantics to link phenotype and environment.

Authors:  Anne E Thessen; Daniel E Bunker; Pier Luigi Buttigieg; Laurel D Cooper; Wasila M Dahdul; Sami Domisch; Nico M Franz; Pankaj Jaiswal; Carolyn J Lawrence-Dill; Peter E Midford; Christopher J Mungall; Martín J Ramírez; Chelsea D Specht; Lars Vogt; Rutger Aldo Vos; Ramona L Walls; Jeffrey W White; Guanyang Zhang; Andrew R Deans; Eva Huala; Suzanna E Lewis; Paula M Mabee
Journal:  PeerJ       Date:  2015-12-14       Impact factor: 2.984

2.  EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation.

Authors:  Evangelos Pafilis; Pier Luigi Buttigieg; Barbra Ferrell; Emiliano Pereira; Julia Schnetzer; Christos Arvanitidis; Lars Juhl Jensen
Journal:  Database (Oxford)       Date:  2016-02-20       Impact factor: 3.451

3.  The flora phenotype ontology (FLOPO): tool for integrating morphological traits and phenotypes of vascular plants.

Authors:  Robert Hoehndorf; Mona Alshahrani; Georgios V Gkoutos; George Gosline; Quentin Groom; Thomas Hamann; Jens Kattge; Sylvia Mota de Oliveira; Marco Schmidt; Soraya Sierra; Erik Smets; Rutger A Vos; Claus Weiland
Journal:  J Biomed Semantics       Date:  2016-11-14

4.  Overview of the interactive task in BioCreative V.

Authors:  Qinghua Wang; Shabbir S Abdul; Lara Almeida; Sophia Ananiadou; Yalbi I Balderas-Martínez; Riza Batista-Navarro; David Campos; Lucy Chilton; Hui-Jou Chou; Gabriela Contreras; Laurel Cooper; Hong-Jie Dai; Barbra Ferrell; Juliane Fluck; Socorro Gama-Castro; Nancy George; Georgios Gkoutos; Afroza K Irin; Lars J Jensen; Silvia Jimenez; Toni R Jue; Ingrid Keseler; Sumit Madan; Sérgio Matos; Peter McQuilton; Marija Milacic; Matthew Mort; Jeyakumar Natarajan; Evangelos Pafilis; Emiliano Pereira; Shruti Rao; Fabio Rinaldi; Karen Rothfels; David Salgado; Raquel M Silva; Onkar Singh; Raymund Stefancsik; Chu-Hsien Su; Suresh Subramani; Hamsa D Tadepally; Loukia Tsaprouni; Nicole Vasilevsky; Xiaodong Wang; Andrew Chatr-Aryamontri; Stanley J F Laulederkind; Sherri Matis-Mitchell; Johanna McEntyre; Sandra Orchard; Sangya Pundir; Raul Rodriguez-Esteban; Kimberly Van Auken; Zhiyong Lu; Mary Schaeffer; Cathy H Wu; Lynette Hirschman; Cecilia N Arighi
Journal:  Database (Oxford)       Date:  2016-09-01       Impact factor: 3.451

5.  Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges.

Authors:  Ayush Singhal; Robert Leaman; Natalie Catlett; Thomas Lemberger; Johanna McEntyre; Shawn Polson; Ioannis Xenarios; Cecilia Arighi; Zhiyong Lu
Journal:  Database (Oxford)       Date:  2016-12-26       Impact factor: 3.451

6.  Seqenv: linking sequences to environments through text mining.

Authors:  Lucas Sinclair; Umer Z Ijaz; Lars Juhl Jensen; Marco J L Coolen; Cecile Gubry-Rangin; Alica Chroňáková; Anastasis Oulas; Christina Pavloudi; Julia Schnetzer; Aaron Weimann; Ali Ijaz; Alexander Eiler; Christopher Quince; Evangelos Pafilis
Journal:  PeerJ       Date:  2016-12-20       Impact factor: 2.984

7.  PREGO: A Literature and Data-Mining Resource to Associate Microorganisms, Biological Processes, and Environment Types.

Authors:  Haris Zafeiropoulos; Savvas Paragkamian; Stelios Ninidakis; Georgios A Pavlopoulos; Lars Juhl Jensen; Evangelos Pafilis
Journal:  Microorganisms       Date:  2022-01-26

8.  Crowdsourcing and curation: perspectives from biology and natural language processing.

Authors:  Lynette Hirschman; Karën Fort; Stéphanie Boué; Nikos Kyrpides; Rezarta Islamaj Doğan; Kevin Bretonnel Cohen
Journal:  Database (Oxford)       Date:  2016-08-07       Impact factor: 3.451

9.  The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation.

Authors:  Pier Luigi Buttigieg; Evangelos Pafilis; Suzanna E Lewis; Mark P Schildhauer; Ramona L Walls; Christopher J Mungall
Journal:  J Biomed Semantics       Date:  2016-09-23
  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.