Literature DB >> 34289903

ResidueFinder: extracting individual residue mentions from protein literature.

Ton E Becker1, Eric Jakobsson2,3.   

Abstract

BACKGROUND: The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts.
RESULTS: We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called "cut") which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute Fβ for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted.
CONCLUSIONS: ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed.
© 2021. The Author(s).

Entities:  

Keywords:  Amino Acid Residue; Bioinformatics; Mutation; MutationFinder; Natural Language Processing; Point Mutation; Text Mining

Year:  2021        PMID: 34289903      PMCID: PMC8293528          DOI: 10.1186/s13326-021-00243-3

Source DB:  PubMed          Journal:  J Biomed Semantics


  34 in total

Review 1.  Accomplishments and challenges in literature data mining for biology.

Authors:  Lynette Hirschman; Jong C Park; Junichi Tsujii; Limsoon Wong; Cathy H Wu
Journal:  Bioinformatics       Date:  2002-12       Impact factor: 6.937

2.  Towards a systematic evaluation of protein mutation extraction systems.

Authors:  René Witte; Christopher J O Baker
Journal:  J Bioinform Comput Biol       Date:  2007-12       Impact factor: 1.122

3.  A workflow for mutation extraction and structure annotation.

Authors:  Rajaraman Kanagasabai; Khar Heng Choo; Shoba Ranganathan; Christopher J O Baker
Journal:  J Bioinform Comput Biol       Date:  2007-12       Impact factor: 1.122

4.  The structural and content aspects of abstracts versus bodies of full text journal articles are different.

Authors:  K Bretonnel Cohen; Helen L Johnson; Karin Verspoor; Christophe Roeder; Lawrence E Hunter
Journal:  BMC Bioinformatics       Date:  2010-09-29       Impact factor: 3.169

5.  Automated extraction and semantic analysis of mutation impacts from the biomedical literature.

Authors:  Nona Naderi; René Witte
Journal:  BMC Genomics       Date:  2012-06-18       Impact factor: 3.969

6.  Automatic extraction of protein point mutations using a graph bigram association.

Authors:  Lawrence C Lee; Florence Horn; Fred E Cohen
Journal:  PLoS Comput Biol       Date:  2007-02-02       Impact factor: 4.475

7.  Mutation extraction tools can be combined for robust recognition of genetic variants in the literature.

Authors:  Antonio Jimeno Yepes; Karin Verspoor
Journal:  F1000Res       Date:  2014-01-21

8.  Extraction of human kinase mutations from literature, databases and genotyping studies.

Authors:  Martin Krallinger; Jose M G Izarzugaza; Carlos Rodriguez-Penagos; Alfonso Valencia
Journal:  BMC Bioinformatics       Date:  2009-08-27       Impact factor: 3.169

9.  OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature.

Authors:  Laura I Furlong; Holger Dach; Martin Hofmann-Apitius; Ferran Sanz
Journal:  BMC Bioinformatics       Date:  2008-02-05       Impact factor: 3.169

10.  BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations.

Authors:  Kyubum Lee; Sunwon Lee; Sungjoon Park; Sunkyu Kim; Suhkyung Kim; Kwanghun Choi; Aik Choon Tan; Jaewoo Kang
Journal:  Database (Oxford)       Date:  2016-04-13       Impact factor: 3.451

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.