| Literature DB >> 19814812 |
Thomas C Wiegers1, Allan Peter Davis, K Bretonnel Cohen, Lynette Hirschman, Carolyn J Mattingly.
Abstract
BACKGROUND: The Comparative Toxicogenomics Database (CTD) is a publicly available resource that promotes understanding about the etiology of environmental diseases. It provides manually curated chemical-gene/protein interactions and chemical- and gene-disease relationships from the peer-reviewed, published literature. The goals of the research reported here were to establish a baseline analysis of current CTD curation, develop a text-mining prototype from readily available open source components, and evaluate its potential value in augmenting curation efficiency and increasing data coverage.Entities:
Mesh:
Year: 2009 PMID: 19814812 PMCID: PMC2768719 DOI: 10.1186/1471-2105-10-326
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1CTD curated data relationships. Biocurators capture three types of data relationships from the literature using controlled vocabularies, including chemical-gene interactions, and chemical-disease and gene/protein-disease relationships. These three relationships generate a chemical-gene/protein-disease triad that enables users to infer novel connections between all three actors.
Figure 2Documentation of curated data. a) Currently curated data are captured using controlled vocabularies in Excel spreadsheets that include: Curator ID, date of curation, PubMed identification number, interaction (designated using a CTD coding schema), species in which the interaction was observed, interacting chemical, interacting gene/protein, associated diseases (not shown) and author contact information for follow-up purposes (not shown). b) Codes used to capture interactions are translated into readable sentences for the public web application.
CTD manual curation metrics
| 112 | 112 | 112 | 112 | |
| 57 (51) | 74 (66) | 69 (62) | 67 (60) | |
| 55 (49) | 38 (34) | 43 (38) | 45 (40) | |
| 1331 | 893 | 2263 | 1496 | |
| 1198 (90) | 822 (92) | 2133 (94) | 1384 (93) | |
| 133 (10) | 71 (8) | 130 (6) | 111 (7) | |
| 21.0 (31.1) | 11.1 (13.1) | 30.9 (52.9) | 20.7 | |
| 2.4 (3.4) | 1.9 (3.1) | 3.0 (4.4) | 2.5 | |
| 828 | 2330 | 3039 | 2066 | |
| 14.5 (34.4) | 31.5 (143.7) | 44.0 (209.8) | 30.8 | |
| 0.5 (0.3) | 1.4 (1.7) | 0.6 (0.6) | 0.8 |
All times and rates were recorded or calculated in minutes
Curation rate = Time spent per curated article. SD = standard deviation.
Rejection rate = Time spent per rejected article.
Total data extracted = total number of chemical-gene, chemical-disease, and gene-disease interactions.
Data extraction rate = macro-average of individual rates of the number of interactions for each curatable article.
Degree of consensus to curate
| 52 | 57 | 65 | 52 | 58 | |
| 34 | 38 | 34 | 38 | 37 | |
| 86 | 95 | 99 | 90 | 95 | |
| 26 | 17 | 13 | 22 | 17 | |
| 112 | 112 | 112 | 112 | 112 | |
| 0.77 | 0.85 | 0.88 | 0.80 | 0.85 |
Numbers represent each of the three CTD biocurators.
Precision, recall, and f-measure of CTD manual curation
| 53 | 68 | 65 | 62 | |
| 0.90 | 0.97 | 0.86 | 0.91 | |
| 0.62 | 0.79 | 0.71 | 0.71 | |
| 0.70 | 0.85 | 0.75 | 0.77 |
Only curated chemical-gene interactions were analyzed (disease interactions were not considered); consequently, these numbers differ slightly from those reported in Table 2.
Precision = Correct interactions for individual biocurator/Total interactions for individual biocurator.
Recall = Correct interactions for individual biocurator/Total correct interactions from all biocurators.
F1-measure is the harmonic mean of precision and recall = (2 × Precision × Recall)/(Precision + Recall). Precision, Recall, and F1-measure were macro-averaged over the set of articles curated by that curator.
Calculated using data from three biocurators.
Figure 3Rules-based ranking of articles enhances yield of curated data. When ranked using the rules-based application vs. PubMed ordering (control case), the top 10% of articles would result in an increased yield of curated data; specifically 426 more chemical-gene interactions, comprising 82 additional genes, 81 additional chemicals and 5 more diseases.
Figure 4Text mining improves the ranking of journal articles for curation. A test set of 354 articles slated for curation were first ranked by two different methods: (a) via each article's PubMed identification number in descending order (which typically reflects the publication date from newest to oldest paper) and (b) via the rank order determined by our rule-based text-mining application. The articles were then reviewed by a biocurator who determined that 167 of the papers contained relevant data (curated, black bars) while 187 of them did not (rejected, white bars). For presentation, the 354 articles are grouped into progressive quartiles (1st, 2nd, 3rd, and 4th) each containing 89 papers. The overall percent of total curated papers (167) vs. rejected papers (187) are shown distributed over each quartile. The text-mining tool (b) effectively ranked the more relevant papers into the first and second quartile and the less relevant papers to the third and fourth quartile compared to the less informed criteria of PubMed identification numbers (a).
Figure 5Future CTD manual curation workflow. Articles will continue to be identified for curation using PubMed and chemical terms of interest. Articles will be text mined using chemical (OSCAR 3), gene (ABNER) and disease (MetaMap) identifiers as described. Actors identified by text mining will be matched against vocabularies in CTD and journal articles without matches will be removed. Remaining journal articles will be ranked and loaded into the CTD curation database. Biocurators will curate or reject journal articles using an online application tool that is integrated with the CTD curation and production databases. Curated data will be approved and loaded into the CTD production database.