| Literature DB >> 21752279 |
Xin Fu1, Anna Wojak, Daniel Neagu, Mick Ridley, Kim Travis.
Abstract
BACKGROUND: Due to recent advances in data storage and sharing for further data processing in predictive toxicology, there is an increasing need for flexible data representations, secure and consistent data curation and automated data quality checking. Toxicity prediction involves multidisciplinary data. There are hundreds of collections of chemical, biological and toxicological data that are widely dispersed, mostly in the open literature, professional research bodies and commercial companies. In order to better manage and make full use of such large amount of toxicity data, there is a trend to develop functionalities aiming towards data governance in predictive toxicology to formalise a set of processes to guarantee high data quality and better data management. In this paper, data quality mainly refers in a data storage sense (e.g. accuracy, completeness and integrity) and not in a toxicological sense (e.g. the quality of experimental results).Entities:
Year: 2011 PMID: 21752279 PMCID: PMC3584675 DOI: 10.1186/1758-2946-3-24
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Decision domains for data governance.
Figure 2Data domains for predictive toxicology.
Discussion of public data sources in terms of data governance
| Database | Data Accuracy | Data Completeness | Data Integrity | Metadata | Data Availability | Data Authorisation |
|---|---|---|---|---|---|---|
| ChemSpider | manual and automated data curation, crowd-sourcing community supported | claimed to be the richest source of structure-based chemistry | an aggregator of nearly 400 different data sources | includes data sources metadata, provenance metadata (e.g. owner and time-stamp of data creation, curation and update) and user meta-data | can be accessed by web GUI and web services via PC and mobile devices; query results can be downloaded as a set | publicly available, no multi-level user access supported |
| CEBS | manual data curation, collaboration between data de-positors and internal curation staffs | contains 132 chemicals and their response in 34 detailed studies | permits users to integrate various data types and studies, a database schema is well designed by the support of controlled vocabularies | includes domain-specific metadata (e.g. owner, study details) and provenance metadata (e.g. time-stamp of the study start, curation updates) | allows users to retrieve and combine customised information and export to various formats for downloading. Able to support up to 100 concurrent users | publicly available, but also provides private data access mode to protect sensitive user data |
| CTD | manual data curation, sup-ported by the scientific com-munity | includes 1.4 million chemical-gene-disease data connections and has been widely recognised | employs community-accepted vocabularies and ontologies to capture data and is integrated with external resources | includes domain-specific, provenance metadata and supporting literature sources are recorded | access to entire database (downloadable as a dump file) and individual data sources; query results can be customised and exported to different formats | publicly available, no multi-level user access provided |
| DSSTox | manual and automated data curation, quality assurance log files recorded; but no curation for external data | contains over 8000 chemicals and have been incorporated into several external sources | integrates molecular structures and toxicity data into standardised DSSTox SDF | includes domain-specific and provenance metadata | data sources and associated documents can be downloaded individually and included data is searchable via many options | publicly available, no user registration is required |
| ToxCast | manual data curation, includes internal and external review | covers various chemical classes and diverse mechanism of action, 320 chemicals have been collected in Phase I and 1000 more is currently being screened | well integrated into many other EPA databases | only domain-specific metadata is included, no recorded track of provenance metadata (e.g. curator and time-stamp) | data sources are available for download individually in the ToxCast website; included data can be browsed and queried via ToxCast DB web GUI | publicly available, no user registration is required |
| ACToR | does not provide any data curation itself, data quality totally depends on the original data sources | contains over 500,000 chemicals and associated toxicity data from nearly 500 sources | itself is a chemical toxicity data aggregator and employs a clear and flexible database schema for data integration | metadata of data source and domain-specific information is well recorded, no recorded track of provenance meta-data | open-source implementation and the entire database can be downloaded | publicly available, no user registration is required |
| OpenTox | allows for automated, man-ual and global data quality validations | contains different categories of public data sources which supporting predictive toxicology | employs ontology to support efficient integration of data coming from different sources into a unifying structure | metadata of data source and domain-specific information is well recorded, no recorded track of provenance meta-data | provides APIs and REST-ful web services for included data, algorithms, models, ontologies and reports | publicly available, but employs OpenSSO for initial implementation of multi-level user access |