| Literature DB >> 20856807 |
Zhaohui Sun1, Mounir Errami, Tara Long, Chris Renard, Nishant Choradia, Harold Garner.
Abstract
BACKGROUND: Computational methods have been used to find duplicate biomedical publications in MEDLINE. Full text articles are becoming increasingly available, yet the similarities among them have not been systematically studied. Here, we quantitatively investigated the full text similarity of biomedical publications in PubMed Central. METHODOLOGY/PRINCIPALEntities:
Mesh:
Year: 2010 PMID: 20856807 PMCID: PMC2939881 DOI: 10.1371/journal.pone.0012704
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Linear regression of abstract similarity vs. full text similarity.
The linear regression of full text similarity ratio versus abstract similarity ratio was performed among the citation pairs whose abstract similarity and full text similarity ratios were both higher than 0.4. The figure indicates a modest correlation between significant abstract similarity and full text similarity of citations in the similarity ratio range.
Figure 2Distribution of full text similarity ratio for citation pairs with and without similar abstracts.
A similarity ratio threshold of 0.5 was used to classify the abstracts as either similar or dissimilar. The figure shows that high abstract similarity is a predictor of higher full text similarity.
Figure 3Distribution of abstract similarity ratio for citation pairs with and without full text similarity.
A similarity ratio threshold of 0.5 was used to classify the full text as either similar or dissimilar. Like the trends shown in Figure 2, significant full text similarity has a correspondingly high probability of having very high abstract similarity.
Text similarity within different sections of articles.
| Introduction | Methods | Results | |
| Number of documents | 61149 | 50360 | 135062 |
| Frequency of similar pairs (SA) | 222 (0.0036) | 605 (0.012) | 220 (0.0016) |
| Frequency of similar pairs (DA) | 96(0.0016) | 330 (0.0066) | 213 (0.0016) |
| Frequency of similar pairs (total) | 318(0.0052) | 935 (0.019) | 433 (0.0032) |
| Odds of similar pair having shared authors | 2.31 | 1.83 | 1.03 |
Duplication of methods or introduction sections is more likely committed by the same authors than duplication of results sections.
Abbreviation: SA, sharing at least one author; DA, no shared authors.
Values are expressed as number of similar pairs (relative frequency of similar pairs).
Values are calculated as frequency of similar pairs (SA)/frequency of similar pairs (DA).
Figure 4Frequency of similar pairs within different sections in PMC citations and duplicate citations.
Whereas similarity in methods sections is generally more common than in other sections, similarity among results sections is the best indicator of a duplicate publication.