Literature DB >> 11779846

Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature.

Soumya Raychaudhuri1, Jeffrey T Chang, Patrick D Sutphin, Russ B Altman.   

Abstract

Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.

Mesh:

Year:  2002        PMID: 11779846      PMCID: PMC155261          DOI: 10.1101/gr.199701

Source DB:  PubMed          Journal:  Genome Res        ISSN: 1088-9051            Impact factor:   9.043


  34 in total

1.  The EMOTIF database.

Authors:  J Y Huang; D L Brutlag
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  Information access. Building a "GenBank" of the published literature.

Authors:  R J Roberts; H E Varmus; M Ashburner; P O Brown; M B Eisen; C Khosla; M Kirschner; R Nusse; M Scott; B Wold
Journal:  Science       Date:  2001-03-23       Impact factor: 47.728

3.  Automatic extraction of protein interactions from scientific abstracts.

Authors:  J Thomas; D Milward; C Ouzounis; S Pulman; M Carroll
Journal:  Pac Symp Biocomput       Date:  2000

4.  A pragmatic information extraction strategy for gathering data on genetic interactions.

Authors:  D Proux; F Rechenmann; L Julliard
Journal:  Proc Int Conf Intell Syst Mol Biol       Date:  2000

5.  A biochemical genomics approach for identifying genes by the activity of their products.

Authors:  M R Martzen; S M McCraith; S L Spinelli; F M Torres; S Fields; E J Grayhack; E M Phizicky
Journal:  Science       Date:  1999-11-05       Impact factor: 47.728

6.  Quantitative monitoring of gene expression patterns with a complementary DNA microarray.

Authors:  M Schena; D Shalon; R W Davis; P O Brown
Journal:  Science       Date:  1995-10-20       Impact factor: 47.728

7.  Hidden Markov models in computational biology. Applications to protein modeling.

Authors:  A Krogh; M Brown; I S Mian; K Sjölander; D Haussler
Journal:  J Mol Biol       Date:  1994-02-04       Impact factor: 5.469

8.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.

Authors:  O Emanuelsson; H Nielsen; S Brunak; G von Heijne
Journal:  J Mol Biol       Date:  2000-07-21       Impact factor: 5.469

Review 9.  Functions of the gene products of Escherichia coli.

Authors:  M Riley
Journal:  Microbiol Rev       Date:  1993-12

10.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.

Authors:  S T Cole; R Brosch; J Parkhill; T Garnier; C Churcher; D Harris; S V Gordon; K Eiglmeier; S Gas; C E Barry; F Tekaia; K Badcock; D Basham; D Brown; T Chillingworth; R Connor; R Davies; K Devlin; T Feltwell; S Gentles; N Hamlin; S Holroyd; T Hornsby; K Jagels; A Krogh; J McLean; S Moule; L Murphy; K Oliver; J Osborne; M A Quail; M A Rajandream; J Rogers; S Rutter; K Seeger; J Skelton; R Squares; S Squares; J E Sulston; K Taylor; S Whitehead; B G Barrell
Journal:  Nature       Date:  1998-06-11       Impact factor: 49.962

View more
  46 in total

1.  A literature-based method for assessing the functional coherence of a gene group.

Authors:  Soumya Raychaudhuri; Russ B Altman
Journal:  Bioinformatics       Date:  2003-02-12       Impact factor: 6.937

2.  Linking biomedical language information and knowledge resources: GO and UMLS.

Authors:  I N Sarkar; M N Cantor; R Gelman; F Hartel; Y A Lussier
Journal:  Pac Symp Biocomput       Date:  2003

3.  A method for finding communities of related genes.

Authors:  Dennis M Wilkinson; Bernardo A Huberman
Journal:  Proc Natl Acad Sci U S A       Date:  2004-02-02       Impact factor: 11.205

4.  Inferring higher functional information for RIKEN mouse full-length cDNA clones with FACTS.

Authors:  Takeshi Nagashima; Diego G Silva; Nikolai Petrovsky; Luis A Socha; Harukazu Suzuki; Rintaro Saito; Takeya Kasukawa; Igor V Kurochkin; Akihiko Konagaya; Christian Schönbach
Journal:  Genome Res       Date:  2003-06       Impact factor: 9.043

5.  Using text analysis to identify functionally coherent gene groups.

Authors:  Soumya Raychaudhuri; Hinrich Schütze; Russ B Altman
Journal:  Genome Res       Date:  2002-10       Impact factor: 9.043

6.  The computational analysis of scientific literature to define and recognize gene expression clusters.

Authors:  Soumya Raychaudhuri; Jeffrey T Chang; Farhad Imam; Russ B Altman
Journal:  Nucleic Acids Res       Date:  2003-08-01       Impact factor: 16.971

7.  Predicting gene function from patterns of annotation.

Authors:  Oliver D King; Rebecca E Foulger; Selina S Dwight; James V White; Frederick P Roth
Journal:  Genome Res       Date:  2003-04-14       Impact factor: 9.043

8.  Genestrace: phenomic knowledge discovery via structured terminology.

Authors:  Michael N Cantor; Indra Neil Sarkar; Olivier Bodenreider; Yves A Lussier
Journal:  Pac Symp Biocomput       Date:  2005

9.  A semantic analysis of the annotations of the human genome.

Authors:  Purvesh Khatri; Bogdan Done; Archana Rao; Arina Done; Sorin Draghici
Journal:  Bioinformatics       Date:  2005-06-14       Impact factor: 6.937

10.  PhenoGO: assigning phenotypic context to gene ontology annotations with natural language processing.

Authors:  Yves Lussier; Tara Borlawsky; Daniel Rappaport; Yang Liu; Carol Friedman
Journal:  Pac Symp Biocomput       Date:  2006
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.