Literature DB >> 35758881

RegEl corpus: identifying DNA regulatory elements in the scientific literature.

Samuele Garda1, Freyda Lenihan-Geels2, Sebastian Proft3,4, Stefanie Hochmuth2, Markus Schülke2, Dominik Seelow3, Ulf Leser1.   

Abstract

High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Regulatory Elements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48-0.91 for entity detection and 0.71-0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL: https://zenodo.org/record/6418451#.YqcLHvexVqg.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2022        PMID: 35758881      PMCID: PMC9235371          DOI: 10.1093/database/baac043

Source DB:  PubMed          Journal:  Database (Oxford)        ISSN: 1758-0463            Impact factor:   4.462


  28 in total

1.  dbSNP: the NCBI database of genetic variation.

Authors:  S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  PubTator central: automated concept annotation for biomedical full text articles.

Authors:  Chih-Hsuan Wei; Alexis Allot; Robert Leaman; Zhiyong Lu
Journal:  Nucleic Acids Res       Date:  2019-07-02       Impact factor: 16.971

Review 3.  Next-generation sequencing in the clinic: promises and challenges.

Authors:  Jiekun Xuan; Ying Yu; Tao Qing; Lei Guo; Leming Shi
Journal:  Cancer Lett       Date:  2012-11-19       Impact factor: 8.679

4.  SETH detects and normalizes genetic variants in text.

Authors:  Philippe Thomas; Tim Rocktäschel; Jörg Hakenberg; Yvonne Lichtblau; Ulf Leser
Journal:  Bioinformatics       Date:  2016-06-02       Impact factor: 6.937

5.  NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition.

Authors:  Rezarta Islamaj; Chih-Hsuan Wei; David Cissel; Nicholas Miliaras; Olga Printseva; Oleg Rodionov; Keiko Sekiya; Janice Ward; Zhiyong Lu
Journal:  J Biomed Inform       Date:  2021-04-09       Impact factor: 6.317

6.  Gene: a gene-centered information resource at NCBI.

Authors:  Garth R Brown; Vichet Hem; Kenneth S Katz; Michael Ovetsky; Craig Wallin; Olga Ermolaeva; Igor Tolstoy; Tatiana Tatusova; Kim D Pruitt; Donna R Maglott; Terence D Murphy
Journal:  Nucleic Acids Res       Date:  2014-10-29       Impact factor: 16.971

Review 7.  Gene Regulatory Elements, Major Drivers of Human Disease.

Authors:  Sumantra Chatterjee; Nadav Ahituv
Journal:  Annu Rev Genomics Hum Genet       Date:  2017-04-07       Impact factor: 8.929

Review 8.  A survey on annotation tools for the biomedical literature.

Authors:  Mariana Neves; Ulf Leser
Journal:  Brief Bioinform       Date:  2012-12-18       Impact factor: 11.622

9.  The NCBI Taxonomy database.

Authors:  Scott Federhen
Journal:  Nucleic Acids Res       Date:  2011-12-01       Impact factor: 16.971

10.  HunFlair: An Easy-to-Use Tool for State-of-the-Art Biomedical Named Entity Recognition.

Authors:  Leon Weber; Mario Sänger; Jannes Münchmeyer; Maryam Habibi; Ulf Leser; Alan Akbik
Journal:  Bioinformatics       Date:  2021-01-28       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.