| Literature DB >> 28365737 |
Mohamed Reda Bouadjenek, Karin Verspoor, Justin Zobel.
Abstract
Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records. Database URL: https://github.com/rbouadjenek/DQBioinformatics.Entities:
Mesh:
Year: 2017 PMID: 28365737 PMCID: PMC5467556 DOI: 10.1093/database/bax021
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Box-plot of the distribution of record ages.
Figure 4.Curation task. (a) Zones curated. (b) Records curated. (c) Results of the curation task.
Dataset statistics
| Article statistics | |||||
|---|---|---|---|---|---|
| # Articles | Citations of records | ||||
| Avg | Median | Max | Max entity | ||
| 1 135 611 | 0.5114 | 1 | 8062 | PMC2848993 | |
Figure 2.Grouped and splitted violin-plots of the features for both alive and dead records. (a) Overlap distribution. (b) Retrieval Performance Score distribution. (c) Query scope distribution. (d) Sum TF distribution. (e) Sum TFIDF distribution. (f) LMDirichlet Score distribution. (g) SCQ Score distribution. (h) Clarity Score distribution.
Figure 3.Dataset plotted in 2D, using PCA for dimensionality reduction with 46.41% of variance retained in the two first components. (a) Records of the group Rm4, which have been removed for inconsistency reasons. (b) Records of groups Rm1, Rm2, Rm3 and Rm5, which have been removed for other reasons.
Details of the results of the curation task
| Alive | Dead | Total | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Consistent | Inconsistent | Consistent | Inconsistent | ||||||
| Records | Total | Rate (%) | Total | Rate (%) | Total | Rate (%) | Total | Rate | |
| Zone A | 98 | 98 | 2 | 2 | — | — | — | — | 100 |
| Zone B | 62 | 74.69 | 21 | 25.30 | 9 | 52.94% | 8 | 47.57% | 100 |
| Zone C | 100 | 100 | 0 | 0 | — | — | — | — | 100 |
| Random | 93 | 93 | 7 | 7 | — | — | — | — | 100 |