Literature DB >> 12424124

Finding relevant references to genes and proteins in Medline using a Bayesian approach.

Julie E Leonard1, Jeffrey B Colombe, Joshua L Levy.   

Abstract

MOTIVATION: Mining the biomedical literature for references to genes and proteins always involves a tradeoff between high precision with false negatives, and high recall with false positives. Having a reliable method for assessing the relevance of literature mining results is crucial to finding ways to balance precision and recall, and for subsequently building automated systems to analyze these results. We hypothesize that abstracts and titles that discuss the same gene or protein use similar words. To validate this hypothesis, we built a dictionary- and rule-based system to mine Medline for references to genes and proteins, and used a Bayesian metric for scoring the relevance of each reference assignment.
RESULTS: We analyzed the entire set of Medline records from 1966 to late 2001, and scored each gene and protein reference using a Bayesian estimated probability (EP) based on word frequency in a training set of 137837 known assignments from 30594 articles to 36197 gene and protein symbols. Two test sets of 148 and 150 randomly chosen assignments, respectively, were hand-validated and categorized as either good or bad. The distributions of EP values, when plotted on a log-scale histogram, are shown to markedly differ between good and bad assignments. Using EP values, recall was 100% at 61% precision (EP=2 x 10(-5)), 63% at 88% precision (EP=0.008), and 10% at 100% precision (EP=0.1). These results show that Medline entries discussing the same gene or protein have similar word usage, and that our method of assessing this similarity using EP values is valid, and enables an EP cutoff value to be determined that accurately and reproducibly balances precision and recall, allowing automated analysis of literature mining results. .

Mesh:

Substances:

Year:  2002        PMID: 12424124     DOI: 10.1093/bioinformatics/18.11.1515

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  4 in total

1.  Deafness mutation mining using regular expression based pattern matching.

Authors:  Christopher M Frenz
Journal:  BMC Med Inform Decis Mak       Date:  2007-10-25       Impact factor: 2.796

2.  Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids.

Authors:  Rick Jordan; Shyam Visweswaran; Vanathi Gopalakrishnan
Journal:  J Clin Bioinforma       Date:  2014-10-23

3.  OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature.

Authors:  Laura I Furlong; Holger Dach; Martin Hofmann-Apitius; Ferran Sanz
Journal:  BMC Bioinformatics       Date:  2008-02-05       Impact factor: 3.169

4.  TXTGate: profiling gene groups with text-based information.

Authors:  Patrick Glenisson; Bert Coessens; Steven Van Vooren; Janick Mathys; Yves Moreau; Bart De Moor
Journal:  Genome Biol       Date:  2004-05-28       Impact factor: 13.583

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.