Literature DB >> 17827057

Literature-based concept profiles for gene annotation: the issue of weighting.

Rob Jelier1, Martijn J Schuemie, Peter-Jan Roes, Erik M van Mulligen, Jan A Kors.   

Abstract

BACKGROUND: Text-mining has been used to link biomedical concepts, such as genes or biological processes, to each other for annotation purposes or the generation of new hypotheses. To relate two concepts to each other several authors have used the vector space model, as vectors can be compared efficiently and transparently. Using this model, a concept is characterized by a list of associated concepts, together with weights that indicate the strength of the association. The associated concepts in the vectors and their weights are derived from a set of documents linked to the concept of interest. An important issue with this approach is the determination of the weights of the associated concepts. Various schemes have been proposed to determine these weights, but no comparative studies of the different approaches are available. Here we compare several weighting approaches in a large scale classification experiment.
METHODS: Three different techniques were evaluated: (1) weighting based on averaging, an empirical approach; (2) the log likelihood ratio, a test-based measure; (3) the uncertainty coefficient, an information-theory based measure. The weighting schemes were applied in a system that annotates genes with Gene Ontology codes. As the gold standard for our study we used the annotations provided by the Gene Ontology Annotation project. Classification performance was evaluated by means of the receiver operating characteristics (ROC) curve using the area under the curve (AUC) as the measure of performance. RESULTS AND DISCUSSION: All methods performed well with median AUC scores greater than 0.84, and scored considerably higher than a binary approach without any weighting. Especially for the more specific Gene Ontology codes excellent performance was observed. The differences between the methods were small when considering the whole experiment. However, the number of documents that were linked to a concept proved to be an important variable. When larger amounts of texts were available for the generation of the concepts' vectors, the performance of the methods diverged considerably, with the uncertainty coefficient then outperforming the two other methods.

Mesh:

Year:  2007        PMID: 17827057     DOI: 10.1016/j.ijmedinf.2007.07.004

Source DB:  PubMed          Journal:  Int J Med Inform        ISSN: 1386-5056            Impact factor:   4.046


  19 in total

1.  Finding potentially new multimorbidity patterns of psychiatric and somatic diseases: exploring the use of literature-based discovery in primary care research.

Authors:  Rein Vos; Sil Aarts; Erik van Mulligen; Job Metsemakers; Martin P van Boxtel; Frans Verhey; Marjan van den Akker
Journal:  J Am Med Inform Assoc       Date:  2013-06-17       Impact factor: 4.497

Review 2.  Popular computational methods to assess multiprotein complexes derived from label-free affinity purification and mass spectrometry (AP-MS) experiments.

Authors:  Irina M Armean; Kathryn S Lilley; Matthew W B Trotter
Journal:  Mol Cell Proteomics       Date:  2012-10-15       Impact factor: 5.911

3.  Huntington Disease Gene Expression Signatures in Blood Compared to Brain of YAC128 Mice as Candidates for Monitoring of Pathology.

Authors:  Elsa C Kuijper; Lodewijk J A Toonen; Maurice Overzier; Roula Tsonaka; Kristina Hettne; Marco Roos; Willeke M C van Roon-Mom; Eleni Mina
Journal:  Mol Neurobiol       Date:  2022-01-29       Impact factor: 5.590

4.  Beegle: from literature mining to disease-gene discovery.

Authors:  Sarah ElShal; Léon-Charles Tranchevent; Alejandro Sifrim; Amin Ardeshirdavani; Jesse Davis; Yves Moreau
Journal:  Nucleic Acids Res       Date:  2015-09-17       Impact factor: 16.971

5.  Evaluation of genome-wide association study results through development of ontology fingerprints.

Authors:  Lam C Tsoi; Michael Boehnke; Richard L Klein; W Jim Zheng
Journal:  Bioinformatics       Date:  2009-04-05       Impact factor: 6.937

6.  Proteomic analysis of the dysferlin protein complex unveils its importance for sarcolemmal maintenance and integrity.

Authors:  Antoine de Morrée; Paul J Hensbergen; Herman H H B M van Haagen; Irina Dragan; André M Deelder; Peter A C 't Hoen; Rune R Frants; Silvère M van der Maarel
Journal:  PLoS One       Date:  2010-11-05       Impact factor: 3.240

7.  The Text-mining based PubChem Bioassay neighboring analysis.

Authors:  Lianyi Han; Tugba O Suzek; Yanli Wang; Steve H Bryant
Journal:  BMC Bioinformatics       Date:  2010-11-08       Impact factor: 3.169

8.  The autoimmune tautology: an in silico approach.

Authors:  Ricardo A Cifuentes; Daniel Restrepo-Montoya; Juan-Manuel Anaya
Journal:  Autoimmune Dis       Date:  2012-03-05

9.  Multi-label literature classification based on the Gene Ontology graph.

Authors:  Bo Jin; Brian Muller; Chengxiang Zhai; Xinghua Lu
Journal:  BMC Bioinformatics       Date:  2008-12-08       Impact factor: 3.169

10.  Novel protein-protein interactions inferred from literature context.

Authors:  Herman H H B M van Haagen; Peter A C 't Hoen; Alessandro Botelho Bovo; Antoine de Morrée; Erik M van Mulligen; Christine Chichester; Jan A Kors; Johan T den Dunnen; Gert-Jan B van Ommen; Silvère M van der Maarel; Vinícius Medina Kern; Barend Mons; Martijn J Schuemie
Journal:  PLoS One       Date:  2009-11-18       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.