Literature DB >> 19802376

The Ineffectiveness of Within - Document Term Frequency in Text Classification.

W John Wilbur1, Won Kim.   

Abstract

For the purposes of classification it is common to represent a document as a bag of words. Such a representation consists of the individual terms making up the document together with the number of times each term appears in the document. All classification methods make use of the terms. It is common to also make use of the local term frequencies at the price of some added complication in the model. Examples are the naïve Bayes multinomial model (MM), the Dirichlet compound multinomial model (DCM) and the Exponential-Family Approximation of the DCM (EDCM), as well as support vector machines (SVM). Although it is usually claimed that incorporating local word frequency in a document improves text classification performance, we here test whether such claims are true or not. In this paper we show experimentally that simplified forms of the MM, EDCM, and SVM models which ignore the frequency of each word in a document perform about at the same level as MM, DCM, EDCM and SVM models which incorporate local term frequency. We also present a new form of the naïve Bayes multivariate Bernoulli model (MBM) which is able to make use of local term frequency and show again that it offers no significant advantage over the plain MBM. We conclude that word burstiness is so strong that additional occurrences of a word essentially add no useful information to a classifier.

Entities:  

Year:  2009        PMID: 19802376      PMCID: PMC2744136          DOI: 10.1007/s10791-008-9069-5

Source DB:  PubMed          Journal:  Inf Retr Boston        ISSN: 1386-4564            Impact factor:   2.293


  1 in total

1.  PubMed: bridging the information gap.

Authors:  J McEntyre; D Lipman
Journal:  CMAJ       Date:  2001-05-01       Impact factor: 8.262

  1 in total
  6 in total

1.  Identifying well-formed biomedical phrases in MEDLINE® text.

Authors:  Won Kim; Lana Yeganova; Donald C Comeau; W John Wilbur
Journal:  J Biomed Inform       Date:  2012-06-08       Impact factor: 6.317

2.  Stochastic Gradient Descent and the Prediction of MeSH for PubMed Records.

Authors:  W John Wilbur; Won Kim
Journal:  AMIA Annu Symp Proc       Date:  2014-11-14

3.  PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database.

Authors:  Rezarta Islamaj; W John Wilbur; Natalie Xie; Noreen R Gonzales; Narmada Thanki; Roxanne Yamashita; Chanjuan Zheng; Aron Marchler-Bauer; Zhiyong Lu
Journal:  Database (Oxford)       Date:  2019-01-01       Impact factor: 3.451

4.  A text data mining approach to the study of emotions triggered by new advertising formats during the COVID-19 pandemic.

Authors:  Angela Maria D'Uggento; Albino Biafora; Fabio Manca; Claudia Marin; Massimo Bilancia
Journal:  Qual Quant       Date:  2022-06-30

5.  The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text.

Authors:  Martin Krallinger; Miguel Vazquez; Florian Leitner; David Salgado; Andrew Chatr-Aryamontri; Andrew Winter; Livia Perfetto; Leonardo Briganti; Luana Licata; Marta Iannuccelli; Luisa Castagnoli; Gianni Cesareni; Mike Tyers; Gerold Schneider; Fabio Rinaldi; Robert Leaman; Graciela Gonzalez; Sergio Matos; Sun Kim; W John Wilbur; Luis Rocha; Hagit Shatkay; Ashish V Tendulkar; Shashank Agarwal; Feifan Liu; Xinglong Wang; Rafal Rak; Keith Noto; Charles Elkan; Zhiyong Lu; Rezarta Islamaj Dogan; Jean-Fred Fontaine; Miguel A Andrade-Navarro; Alfonso Valencia
Journal:  BMC Bioinformatics       Date:  2011-10-03       Impact factor: 3.169

6.  Using cited references to improve the retrieval of related biomedical documents.

Authors:  Francisco M Ortuño; Ignacio Rojas; Miguel A Andrade-Navarro; Jean-Fred Fontaine
Journal:  BMC Bioinformatics       Date:  2013-03-27       Impact factor: 3.169

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.