| Literature DB >> 26201396 |
George Papadatos1, Anna Gaulton1, Anne Hersey1, John P Overington2.
Abstract
The emergence of a number of publicly available bioactivity databases, such as ChEMBL, PubChem BioAssay and BindingDB, has raised awareness about the topics of data curation, quality and integrity. Here we provide an overview and discussion of the current and future approaches to activity, assay and target data curation of the ChEMBL database. This curation process involves several manual and automated steps and aims to: (1) maximise data accessibility and comparability; (2) improve data integrity and flag outliers, ambiguities and potential errors; and (3) add further curated annotations and mappings thus increasing the usefulness and accuracy of the ChEMBL data for all users and modellers in particular. Issues related to activity, assay and target data curation and integrity along with their potential impact for users of the data are discussed, alongside robust selection and filter strategies in order to avoid or minimise these, depending on the desired application.Entities:
Keywords: Data curation; Data quality; Public bioactivity databases
Mesh:
Year: 2015 PMID: 26201396 PMCID: PMC4607714 DOI: 10.1007/s10822-015-9860-5
Source DB: PubMed Journal: J Comput Aided Mol Des ISSN: 0920-654X Impact factor: 3.686
Fig. 3A subset of the target information section of the ChEMBL 20 database schema
Sources of errors and ambiguities related with bioactivity databases
| Error source | Examples | References |
|---|---|---|
| Experimental | Compound purity and stability | [ |
| Errors in compound vendor catalogues. Errors in cell-line identity | ||
| Data extraction | Missing stereochemistry or functional group | [ |
| Incorrect or incomplete target assignment | ||
| Author of publication | Insufficient assay description. Citation of previously reported activity values | [ |
| Wrong activity type and units. Incorrect data processing | ||
| Database user | Merging activities from different assays | [ |
| Dealing with censored data points, tautomers, prodrugs, salts and duplicates |
Fig. 1The current in-house compound, activity, assay and target curation workflow in ChEMBL production. The steps involved in the activity, assay and target curation branches, along with suggestions on how the users/modellers can utilise these to improve data integrity and minimise or avoid ambiguity are discussed in the following sections
Fig. 2The experimental data section of the ChEMBL 20 database schema, showing the columns of the ACTIVITIES and ASSAYS tables
Number of distinct published activity units (a) and activity types (b) mapped to standard, normalised units and types, respectively, after the standardisation step
| Number of distinct published activity units | STANDARD_Unit |
|---|---|
|
| |
| 133 | nM |
| 83 | ng × h × mL−1 |
| 56 | μg × mL−1 |
| 36 | μM × h |
| 28 | mL × min−1 × kg−1 |
| 20 | mL × min−1 × g−1 |
| 17 | mg × kg−1 |
| 16 | μmol × g−1 |
| 15 | h |
| 10 | L × kg−1 |
The activity records curation workflow along with the count and percentage of affected records in ChEMBL 20
| Order | Step | Data validity comment | Num. and % affected records |
|---|---|---|---|
| 1 | Flag missing activities | ‘Potential missing data’ | 12,263—0.09 % |
| 2 | Flag non-standard units for activity type | ‘Non standard unit for type’ | 81,060—0.6 % |
| 3 | Convert log activity values | N/A | 2.6 × 106—20.3 % |
| 4 | Flag out of range values | ‘Outside typical range’ | 187,108—1.7 % |
| 5 | Flag potential duplicate values | N/A | 64,860—0.48 % |
| 6 | Flag potential transcription errors | ‘Potential transcription error’ | 382—0.003 % |
| 7 | Calculate standard negative log values | N/A | 2.8 × 106—20.7 % |