Literature DB >> 18757888

Deja vu: a database of highly similar citations in the scientific literature.

Mounir Errami1, Zhaohui Sun, Tara C Long, Angela C George, Harold R Garner.   

Abstract

In the scientific research community, plagiarism and covert multiple publications of the same data are considered unacceptable because they undermine the public confidence in the scientific integrity. Yet, little has been done to help authors and editors to identify highly similar citations, which sometimes may represent cases of unethical duplication. For this reason, we have made available Déjà vu, a publicly available database of highly similar Medline citations identified by the text similarity search engine eTBLAST. Following manual verification, highly similar citation pairs are classified into various categories ranging from duplicates with different authors to sanctioned duplicates. Déjà vu records also contain user-provided commentary and supporting information to substantiate each document's categorization. Déjà vu and eTBLAST are available to authors, editors, reviewers, ethicists and sociologists to study, intercept, annotate and deter questionable publication practices. These tools are part of a sustained effort to enhance the quality of Medline as 'the' biomedical corpus. The Déjà vu database is freely accessible at http://spore.swmed.edu/dejavu. The tool eTBLAST is also freely available at http://etblast.org.

Entities:  

Mesh:

Year:  2008        PMID: 18757888      PMCID: PMC2686470          DOI: 10.1093/nar/gkn546

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Authorship of scientific papers is one of the most valuable currencies for scientists and engineers, and is an asset not only for climbing the corporate or academic ladder (1), but also most importantly to secure funding for academic laboratories. The fierce competition in most scientific disciplines and the increasing necessity to publish may lead authors to engage in questionable behavior such as publishing a single piece of work more than once, or emulating the style, or copying the content of another person's work. Duplicate publication may be useful to provide wider access to the scientific community or to report important updates to surveys or clinical trials, but publications that simply reproduce a previous work with virtually identical results and conclusions often lack the novelty to justify additional publication. The latter types of duplicate publication are considered unethical because they undermine the public confidence in scientific integrity. Others have previously described additional duplicate publication behaviors referred to as ‘salami slicing’ (dissecting a scientific work into multiple least publishable units) and ‘meat extenders’ (building on a previous publication with new data that would not be publishable alone) (2–4). Most previous studies of duplicate publication have been limited to a particular scientific field where duplication was painstakingly identified manually, underscoring the need for an automated method to detect putative duplications (5–16). We have established a method to identify highly similar citations in Medline, the comprehensive literature database of life sciences and biomedical information, using the text similarity search engine eTBLAST (17,18). We were able to statistically calibrate eTBLAST to identify citations that have unusually high similarity, which were then saved in Déjà vu pending manual inspection (19,20).

CONTENT AND METHODS

Identification of highly similar citations

Technical details describing the detection of highly similar citations and its application to the entire Medline database have been reported previously (19,20). Briefly, the method which has contributed the preponderance of entries in Déjà vu involves ‘eTBLASTing’ each Medline citation against its most related article (a feature available from Medline). Upon comparison, citation pairs are so highly similar that predetermined similarity thresholds exceeded are flagged as a highly similar pair and stored in Déjà vu awaiting manual verification by human curators.

Manual classification of highly similar citations

Déjà vu was designed and developed to allow for collaborative work among the multiple curators. It was also necessary to define a broad, flexible and extensible classification scheme to accommodate a wide range of highly similar documents dealing with all areas of biomedical research, reflecting different publication behaviors, styles and agreements. Upon manual verification, highly similar citation pairs were classified in one or more of the categories listed and defined in Table 1. In particular, we sought to distinguish between appropriate and inappropriate duplication, a process which is admittedly subjective. A pair of duplicates with different authors may indicate potential plagiarism, while two publications with shared authors may indicate multiple publication of the same study. Updates to clinical trials or survey type research are instances where complete duplication is not necessarily inappropriate. Similarly, studies with different outcomes using similar phraseology may bring valuable new information. Errata, which may or may not be tagged as such in Medline, are most similar to the initial record, often involving only a typographical correction. All of these determinations are difficult or impossible to accomplish computationally, and thus are best made by human curators.
Table 1.

Déjà vu content by category and category definitions

Duplication typeCountDescription
DISTINCT1379There are a number of reasons for different citations to have a high similarity, including citations that describe related, but very distinct publications. A pair of citations identified by computer similarity, which after inspection is, for example, clearly a continuation of a study which has evolved, and the text represents new information that is categorized as a distinct and unique work
DUPLICATE2443A pair of citations that was identical or nearly identical. The citations report on a study with the same or very similar results and conclusions.
ERRATUM188Only a fraction of the MEDLINE records that are apparently corrections to previous entries are marked as errata. If a title/abstract pair is either labeled as errata or if it is clear that a correction has been made (author list, spelling, small changes to abstract or title wording, etc.), then the errata classification is used.
SANCTIONED1619There are a number of reasons for different citations to have a high level of similarity, some of which play a special, very important, and very legitimate role in the reporting of science. Examples include periodic reviews, periodic guidelines, specialized databases and specialized federal register citations. Citation pairs of this type, identified through computer text similarity have been manually classified to the category sanctioned.
NO ABSTRACT16In some cases highly similar titles are flagged as potential duplicates, but the non-identity MEDLINE record does not contain an abstract, we designate that pair as a ‘NO ABSTRACT’ to indicate that its status cannot be determined.
UNVERIFIED69115Deja vu is a database of duplicate publications, as identified using a number of different techniques, with the principle one being text similarity comparisons. Those putative duplicates identified by any of these techniques, prior to human verification and assignment to another category, are initially loaded into these categories, and since our software also inspects the author lists, they are loaded into unverified categories that have either overlapping authors (SA) or not (DA).
TOTAL74 760

Up to date statistics and definitions are available at http://spore.swmed.edu/dejavu/help and http://spore.swmed.edu/dejavu/statistics/.

Déjà vu content by category and category definitions Up to date statistics and definitions are available at http://spore.swmed.edu/dejavu/help and http://spore.swmed.edu/dejavu/statistics/.

Déjà vu in numbers

All data collected have been consolidated into a web-accessible database, available at http://spore.swmed.edu/dejavu. As of 22 July 2008, Déjà vu contains a total 74 760 records of which 5645 have been manually inspected (Table 1). Déjà vu has received over 40 000 visits since 1 January 2008 and currently receives an average of about 2000 visits per month.

QUERIES AND INTERFACE

The Déjà vu interface was designed using python (http://python.org) and the Django web framework (http://djangoproject.com). Data are stored in a backend MySQL Database (http://mysql.com). Déjà vu was designed to allow real-time collaborative annotation by multiple curators who need not be programmers to add comments and updates or create new records. On the Déjà vu website users can: (i) browse Déjà vu entries with no specific search method (Each entry links to the scientific citation along with full text when freely available.); (ii) perform generic searches within the Déjà vu content by authors, address, title word, abstract word, year and comment word; (iii) perform detailed searches by specifying search criteria specific to PMID, journal names, title words, abstract, address and year; (iv) filter and view Déjà vu results in a particular category or identified by particular authors (same or different), language, availability of full text, discovery method, etc.; (v) send comments or reports to contest a record or submit a potential duplication to be reviewed by human curators; and (vi) access statistics using different filters including category, language, country, journals, etc. For each duplicate record, a viewing window presents citations side-by-side with similarities or differences highlighted (Figure 1), providing a user-friendly interface to search, browse and facilitate rapid and rigorous interpretation of the results. Déjà vu data are also available for data mining in two formats: comma-separated values and a MySQL script to recreate the MySQL database.
Figure 1.

The Déjà vu citation presentation output. (A) Browsing interface for database content. (B) Query box to search duplicate records by author names, title, abstract, year of publication and comment words. (C) List of records in Déjà vu including PMIDs, author names, publication date and links to Medline citations and free full text when available. (D) Category filters to browse records in a particular category. (E) Side-by-side view of a duplicate record highlighting overlapping keywords in blue. (F) Miscellaneous information for each article involved.

The Déjà vu citation presentation output. (A) Browsing interface for database content. (B) Query box to search duplicate records by author names, title, abstract, year of publication and comment words. (C) List of records in Déjà vu including PMIDs, author names, publication date and links to Medline citations and free full text when available. (D) Category filters to browse records in a particular category. (E) Side-by-side view of a duplicate record highlighting overlapping keywords in blue. (F) Miscellaneous information for each article involved.

CONCLUSION AND FUTURE DIRECTIONS

The Déjà vu database is the first of its kind to publically present cases of highly similar citations in Medline. In addition to presenting the list of highly similar citations, a goal of Déjà vu is to help scientists study in depth the behaviors of authors and the characteristics underlying multiple publications and related ethics issues surrounding the process of scientific publication. A friendly interface provides users with various browsing options along with a graphical representation of the overlapping information between citations. Ultimately, Déjà vu may act as a deterrent to the unethical practice of duplication. Further work, currently in progress, that will substantially improve Déjà vu includes: (i) a streamlined process to update Déjà vu on a daily basis. (ii) a more collaborative approach for recruitment and qualification of topical experts as volunteer curators for specific publication areas. (iii) New methods to better address the question most often asked by authors introduced to Déjà vu, ‘Am I in it, or has my work been duplicated? ’ Authors can now check if their work has been duplicated by submitting their abstracts one by one directly to eTBLAST, which then flags highly similar citations for the authors to pursue. Utilities are being developed to allow authors to scan their entire bibliography at once (retrieved using Medline Entrez keyword queries) to obtain a list of highly similar citations for each citation entered. Authors will also be able to automatically submit suspicious highly similar citations found by this process directly to Déjà vu curators. (iv) Currently, duplications found in Déjà vu were obtained from Medline citations. Other literature databases will be added as they are scanned by eTBLAST, including the Institute of Physics, NASA and NIH CRISP.

FUNDING

P.O’B. Montgomery Distinguished Chair (to H.G.); the Hudson Foundation (to H.G.); National Institute of Health/National Library of Medicine grant (R01 LM009758-01 to H.R.G.). Funding for open access charge: P.O'B;. Montgomery Distinguished Chair.
  19 in total

1.  [Duplicate publication of articles in the Dutch Journal of Medicine in 1996].

Authors:  D G Bloemenkamp; H C Walvoort; W Hart; A J Overbeke
Journal:  Ned Tijdschr Geneeskd       Date:  1999-10-23

2.  Duplicate publication in the field of otolaryngology-head and neck surgery.

Authors:  Byron J Bailey
Journal:  Otolaryngol Head Neck Surg       Date:  2002-03       Impact factor: 3.497

3.  Redundant surgical publications: tip of the iceberg?

Authors:  M Schein; R Paladugu
Journal:  Surgery       Date:  2001-06       Impact factor: 3.982

4.  Redundant publications in scientific ophthalmologic journals: the tip of the iceberg?

Authors:  Stefania M Mojon-Azzi; Xiaoyi Jiang; Ulrich Wagner; Daniel S Mojon
Journal:  Ophthalmology       Date:  2004-05       Impact factor: 12.079

5.  Different patterns of duplicate publication: an analysis of articles used in systematic reviews.

Authors:  Erik von Elm; Greta Poglia; Bernhard Walder; Martin R Tramèr
Journal:  JAMA       Date:  2004-02-25       Impact factor: 56.272

6.  Multiple publication of reports of drug trials.

Authors:  P C Gøtzsche
Journal:  Eur J Clin Pharmacol       Date:  1989       Impact factor: 2.953

7.  Duplicate publications in the otolaryngology literature.

Authors:  Eben L Rosenthal; Jimmy Lee Masdon; Christy Buckman; Mary Hawn
Journal:  Laryngoscope       Date:  2003-05       Impact factor: 3.325

8.  Irresponsible authorship and wasteful publication.

Authors:  E J Huth
Journal:  Ann Intern Med       Date:  1986-02       Impact factor: 25.391

9.  [Duplicate publication of original manuscripts in and from the Nederlands Tijdschrift voor Geneeskunde].

Authors:  H Barnard; A J Overbeke
Journal:  Ned Tijdschr Geneeskd       Date:  1993-03-20

10.  The publishing game: getting more for less.

Authors:  W J Broad
Journal:  Science       Date:  1981-03-13       Impact factor: 47.728

View more
  14 in total

1.  Duplicate publications: A sample of redundancy in the Journal of Urology.

Authors:  Kiara K Hennessey; Aaron R Williams; Kourosh Afshar; Andrew E Macneily
Journal:  Can Urol Assoc J       Date:  2012-06       Impact factor: 1.862

2.  Prevalence of plagiarism in recent submissions to the Croatian Medical Journal.

Authors:  Ksenija Baždarić; Lidija Bilić-Zulle; Gordana Brumini; Mladen Petrovečki
Journal:  Sci Eng Ethics       Date:  2011-12-30       Impact factor: 3.525

3.  Revisiting Information Technology tools serving authorship and editorship: a case-guided tutorial to statistical analysis and plagiarism detection.

Authors:  P D Bamidis; C Lithari; S T Konstantinidis
Journal:  Hippokratia       Date:  2010-12       Impact factor: 0.471

4.  Plagiarism and other Types of Publication Misconduct: A case for teaching publication ethics in medical schools.

Authors:  Lamk Al-Lamki
Journal:  Sultan Qaboos Univ Med J       Date:  2009-03-16

Review 5.  Scientists Admitting to Plagiarism: A Meta-analysis of Surveys.

Authors:  Vanja Pupovac; Daniele Fanelli
Journal:  Sci Eng Ethics       Date:  2014-10-29       Impact factor: 3.525

Review 6.  Multiple systematic reviews: methods for assessing discordances of results.

Authors:  Lorenzo Moja; M Pilar Fernandez del Rio; Rita Banzi; Cristina Cusi; Roberto D'Amico; Alessandro Liberati; Giovanni Lodi; Ersilia Lucenteforte; Silvia Minozzi; Valentina Pecoraro; Gianni Virgili; Elena Parmelli
Journal:  Intern Emerg Med       Date:  2012-09-02       Impact factor: 3.397

Review 7.  Combating unethical publications with plagiarism detection services.

Authors:  H R Garner
Journal:  Urol Oncol       Date:  2011 Jan-Feb       Impact factor: 3.498

8.  Identifying duplicate content using statistically improbable phrases.

Authors:  Mounir Errami; Zhaohui Sun; Angela C George; Tara C Long; Michael A Skinner; Jonathan D Wren; Harold R Garner
Journal:  Bioinformatics       Date:  2010-05-13       Impact factor: 6.937

9.  Systematic characterizations of text similarity in full text biomedical publications.

Authors:  Zhaohui Sun; Mounir Errami; Tara Long; Chris Renard; Nishant Choradia; Harold Garner
Journal:  PLoS One       Date:  2010-09-15       Impact factor: 3.240

10.  Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009.

Authors:  Michael Y Galperin; Guy R Cochrane
Journal:  Nucleic Acids Res       Date:  2008-11-25       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.