| Literature DB >> 25588387 |
John Rathbone1, Matt Carter, Tammy Hoffmann, Paul Glasziou.
Abstract
BACKGROUND: A major problem arising from searching across bibliographic databases is the retrieval of duplicate citations. Removing such duplicates is an essential task to ensure systematic reviewers do not waste time screening the same citation multiple times. Although reference management software use algorithms to remove duplicate records, this is only partially successful and necessitates removing the remaining duplicates manually. This time-consuming task leads to wasted resources. We sought to evaluate the effectiveness of a newly developed deduplication program against EndNote.Entities:
Mesh:
Year: 2015 PMID: 25588387 PMCID: PMC4320616 DOI: 10.1186/2046-4053-4-6
Source DB: PubMed Journal: Syst Rev ISSN: 2046-4053
SRA-DM algorithm changes
| Iterations | Changes to algorithms |
|---|---|
| First iteration | Matching criteria were based on simple field comparison (ignoring punctuation) with checks against the year field since this field has a lower probability for errors because it is restricted to integers 0–9 and therefore the best non-mistakable field. |
| Second iteration | Short format page numbers were converted to full format (e.g. 221–226, 221–6), and the algorithm was further modified to increase the sensitivity by incorporating matching criteria on authors OR title. |
| Third iteration | Match author AND title with the extension of the non-reference fields from only ‘year’ to year OR volume OR edition. |
| Fourth iteration | The fourth algorithm extended the matching criteria of the third algorithm, with the addition of an improved name matching system. This was context aware of author name variations, i.e. initialisation, punctuation and rearranged author listings using fuzzy logic, so that differences could be accommodated. For example, the following names are all syntactically equivalent and will match as identical authors: |
| 1. William Shakespeare | |
| 2. W. Shakespeare | |
| 3. W Shakespeare | |
| 4. William John Shakespeare | |
| 5. William J. Shakespeare | |
| 6. W. J. Shakespeare | |
| 7. W J Shakespeare | |
| 8. Shakespeare, William | |
| 9. Shakespeare, W | |
| 10. Shakespeare, W, A | |
| 11. Shakespeare, W, A, B, C | |
| 12. William Shakespeare 1st | |
| 13. William Shakespeare 2nd | |
| 14. William Shakespeare IV | |
| 15. William Adam Bob Charles Shakespeare XVI |
Databases searched for retrieval of citations for validation testing
| Datasets | Databases searched |
|---|---|
|
| 1. Cochrane Controlled Trials Register (CCTR) |
| 2. Cochrane Database of Systematic Reviews (CDSR) | |
| 3. EMBASE | |
| 4. MEDLINE | |
| 5. National Research Register (NRR) | |
| 6. Database of Assessments of Reviews of Effectiveness (DARE) | |
| 7. NHS Health Technology Assessment (HTA) | |
| 8. PreMEDLINE | |
| 9. Science Citation Index | |
| 10. Social Sciences Citation Index | |
|
| 1. MEDLINE |
| 2. EMBASE | |
| 3. MEDLINE In-Process | |
| 4. Biological Abstracts | |
| 5. NHS Health Technology Assessment (HTA) | |
| 6. Cochrane Controlled Trials Register (CCTR) | |
| 7. Cochrane Database of Systematic Reviews (CDSR) | |
| 8. CINAHL | |
| 9. Science Citation Index | |
| 10. Social Sciences Citation Index | |
|
| 1. MEDLINE |
| 2. EMBASE | |
| 3. CENTRAL | |
| 4. CINAHL | |
| 5. PsycInfo |
Sensitivity† and specificity‡ of SRA-DM prototype algorithms and EndNote auto-deduplication (in a dataset of 1,988 citations, including 799 duplicates)
| Respiratory study | |||||
|---|---|---|---|---|---|
| First iteration SRA-DM | Second iteration SRA-DM | Third iteration SRA-DM | Fourth iteration SRA-DM | EndNote | |
| True positive ( | 600 | 765 | 543 | 674 | 410 |
| False negative ( | 199 | 34 | 256 | 125 | 391 |
| Sensitivity (%) |
|
|
|
|
|
| True negative ( | 1,188 | 1,186 | 1,189 | 1,189 | 1,185 |
| False positive ( | 1 | 3 | 0 | 0 | 2 |
| Specificity (%) |
|
|
|
|
|
Sensitivity† and specificity‡ of SRA-DM and EndNote auto-deduplication (validation testing)
| Cytology screening | Stroke | Haematology | ||||
|---|---|---|---|---|---|---|
| SRA-DM | EndNote | SRA-DM | EndNote | SRA-DM | EndNote | |
| True positive (correctly identified duplicates) | 1,265 | 885 | 426 | 372 | 208 | 159 |
| False negative (duplicates missed) | 139 | 518 | 81 | 134 | 38 | 87 |
| Sensitivity (%) |
|
|
|
|
|
|
| True negative (correctly identified unique records) | 452 | 452 | 785 | 784 | 1,169 | 1,165 |
| False positive (incorrectly identified duplicates) | 0 | 1 | 0 | 2 | 0 | 4 |
| Specificity (%) |
|
|
|
|
|
|