Literature DB >> 28334741

Benchmarks for measurement of duplicate detection methods in nucleotide databases.

Qingyu Chen1, Justin Zobel1, Karin Verspoor1.   

Abstract

Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources. DATABASE URL: : https://bitbucket.org/biodbqual/benchmarks.
© The Author(s) 2017. Published by Oxford University Press.

Entities:  

Year:  2017        PMID: 28334741     DOI: 10.1093/database/baw164

Source DB:  PubMed          Journal:  Database (Oxford)        ISSN: 1758-0463            Impact factor:   3.451


  3 in total

1.  Exploring automatic inconsistency detection for literature-based gene ontology annotation.

Authors:  Jiyu Chen; Benjamin Goudey; Justin Zobel; Nicholas Geard; Karin Verspoor
Journal:  Bioinformatics       Date:  2022-06-24       Impact factor: 6.931

2.  Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases.

Authors:  Qingyu Chen; Ramona Britto; Ivan Erill; Constance J Jeffery; Arthur Liberzon; Michele Magrane; Jun-Ichi Onami; Marc Robinson-Rechavi; Jana Sponarova; Justin Zobel; Karin Verspoor
Journal:  Genomics Proteomics Bioinformatics       Date:  2020-07-09       Impact factor: 7.691

3.  Exploration into the origins and mobilization of di-hydrofolate reductase genes and the emergence of clinical resistance to trimethoprim.

Authors:  Miquel Sánchez-Osuna; Pilar Cortés; Montserrat Llagostera; Jordi Barbé; Ivan Erill
Journal:  Microb Genom       Date:  2020-11
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.