Literature DB >> 23715801

Comparing Medline citations using modified N-grams.

Rao Muhammad Adeel Nawab1, Mark Stevenson, Paul Clough.   

Abstract

OBJECTIVE: We aim to identify duplicate pairs of Medline citations, particularly when the documents are not identical but contain similar information.
MATERIALS AND METHODS: Duplicate pairs of citations are identified by comparing word n-grams in pairs of documents. N-grams are modified using two approaches which take account of the fact that the document may have been altered. These are: (1) deletion, an item in the n-gram is removed; and (2) substitution, an item in the n-gram is substituted with a similar term obtained from the Unified Medical Language System Metathesaurus. N-grams are also weighted using a score derived from a language model. Evaluation is carried out using a set of 520 Medline citation pairs, including a set of 260 manually verified duplicate pairs obtained from the Deja Vu database.
RESULTS: The approach accurately detects duplicate Medline document pairs with an F1 measure score of 0.99. Allowing for word deletions and substitution improves performance. The best results are obtained by combining scores for n-grams of length 1-5 words. DISCUSSION: Results show that the detection of duplicate Medline citations can be improved by modifying n-grams and that high performance can also be obtained using only unigrams (F1=0.959), particularly when allowing for substitutions of alternative phrases.

Entities:  

Keywords:  Natural Language Processing; PubMed

Mesh:

Year:  2013        PMID: 23715801      PMCID: PMC3912705          DOI: 10.1136/amiajnl-2012-001552

Source DB:  PubMed          Journal:  J Am Med Inform Assoc        ISSN: 1067-5027            Impact factor:   4.497


  8 in total

1.  Duplicate publication in the field of otolaryngology-head and neck surgery.

Authors:  Byron J Bailey
Journal:  Otolaryngol Head Neck Surg       Date:  2002-03       Impact factor: 3.497

2.  One in 13 'original' articles in the Journal of Bone and Joint Surgery are duplicate or fragmented publications.

Authors:  S E Gwilym; M C Swan; H Giele
Journal:  J Bone Joint Surg Br       Date:  2004-07

3.  An overview of MetaMap: historical perspective and recent advances.

Authors:  Alan R Aronson; François-Michel Lang
Journal:  J Am Med Inform Assoc       Date:  2010 May-Jun       Impact factor: 4.497

4.  Duplicate publications: redundancy in plastic surgery literature.

Authors:  Piyush Durani
Journal:  J Plast Reconstr Aesthet Surg       Date:  2006-03-23       Impact factor: 2.740

5.  Déjà vu--a study of duplicate citations in Medline.

Authors:  Mounir Errami; Justin M Hicks; Wayne Fisher; David Trusty; Jonathan D Wren; Tara C Long; Harold R Garner
Journal:  Bioinformatics       Date:  2007-12-01       Impact factor: 6.937

6.  Text similarity: an alternative way to search MEDLINE.

Authors:  James Lewis; Stephan Ossowski; Justin Hicks; Mounir Errami; Harold R Garner
Journal:  Bioinformatics       Date:  2006-08-22       Impact factor: 6.937

7.  Identifying duplicate content using statistically improbable phrases.

Authors:  Mounir Errami; Zhaohui Sun; Angela C George; Tara C Long; Michael A Skinner; Jonathan D Wren; Harold R Garner
Journal:  Bioinformatics       Date:  2010-05-13       Impact factor: 6.937

8.  eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications.

Authors:  Mounir Errami; Jonathan D Wren; Justin M Hicks; Harold R Garner
Journal:  Nucleic Acids Res       Date:  2007-04-22       Impact factor: 16.971

  8 in total
  1 in total

Review 1.  No wisdom in the crowd: genome annotation in the era of big data - current status and future prospects.

Authors:  Antoine Danchin; Christos Ouzounis; Taku Tokuyasu; Jean-Daniel Zucker
Journal:  Microb Biotechnol       Date:  2018-05-28       Impact factor: 5.813

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.