| Literature DB >> 27307137 |
Àlex Bravo1, Tong Shu Li2, Andrew I Su2, Benjamin M Good2, Laura I Furlong3.
Abstract
Drug toxicity is a major concern for both regulatory agencies and the pharmaceutical industry. In this context, text-mining methods for the identification of drug side effects from free text are key for the development of up-to-date knowledge sources on drug adverse reactions. We present a new system for identification of drug side effects from the literature that combines three approaches: machine learning, rule- and knowledge-based approaches. This system has been developed to address the Task 3.B of Biocreative V challenge (BC5) dealing with Chemical-induced Disease (CID) relations. The first two approaches focus on identifying relations at the sentence-level, while the knowledge-based approach is applied both at sentence and abstract levels. The machine learning method is based on the BeFree system using two corpora as training data: the annotated data provided by the CID task organizers and a new CID corpus developed by crowdsourcing. Different combinations of results from the three strategies were selected for each run of the challenge. In the final evaluation setting, the system achieved the highest Recall of the challenge (63%). By performing an error analysis, we identified the main causes of misclassifications and areas for improving of our system, and highlighted the need of consistent gold standard data sets for advancing the state of the art in text mining of drug side effects.Database URL: https://zenodo.org/record/29887?ln¼en#.VsL3yDLWR_V.Entities:
Mesh:
Year: 2016 PMID: 27307137 PMCID: PMC4908671 DOI: 10.1093/database/baw094
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.The workflow diagram of the developed system for CID-RE. Note that the CID-relations found both at abstract and sentence level are processed by BeFree, CID-patterns and EK-based approach.
Performance of the different methods evaluated on the BC5D (the first 50 abstracts)
| Exp. | Method | Train data | Level | |||
|---|---|---|---|---|---|---|
| 1 | Co-occurrence | NA | Both | 81.71 | 16.46 | 27.40 |
| 2 | EK | NA | Abst. | 60.98 | 42.37 | 50.00 |
| 3 | CID-patterns | NA | Sent. | 12.20 | 71.43 | 20.83 |
| 4 | CID-pat. + EK | NA | Both | 63.41 | 42.62 | 50.98 |
| 5 | BeFree system | BC5T | Sent. | 47.56 | 38.61 | 42.62 |
| 6 | BeFree system | crowdCID | Sent. | 54.88 | 33.33 | 41.47 |
| 7 | BeFree system | BC5T + crowdCID | Sent. | 53.66 | 39.20 | 45.62 |
| 8 | Run no. 1 | BC5T | Both | 57.31 | 63.51 | 60.25 |
| 9 | Run no. 1 | crowdCID | Both | 58.53 | 66.66 | 62.33 |
| 10 | Run no. 1 | BC5T + crowdCID | Both | 57.32 | 66.20 | 61.44 |
| 11 | Run no. 1 | BC5T + crowdCID | Both | 46.74 | 56.04 | 50.97 |
| 12 | Run no. 1 | BC5T+D + crowdCID | Both | 38.27 | 48.57 | 42.81 |
| 13 | Run no. 2 | BC5T + crowdCID | Both | 79.27 | 43.91 | 56.52 |
| 14 | Run no. 3 | BC5T + crowdCID | Both | 79.27 | 44.83 | 57.27 |
In this case, the full set (500 abstracts) was used.
Performance of the different methods evaluated on the BC5E (with NER results as provided by organizers)
| Exp. | Method | Train data | Level | |||
|---|---|---|---|---|---|---|
| 1 | Co-occurrence | NA | Both | 72.05 | 16.38 | 26.69 |
| 2 | EK | NA | Abst. | 56.10 | 39.19 | 46.14 |
| 3 | CID-patterns | NA | Sent. | 15.85 | 73.16 | 26.06 |
| 4 | CID-pat. + EK | NA | Both | 14.54 | 79.49 | 24.58 |
| 5 | BeFree system | BC5T+D | Sent. | 43.34 | 42.82 | 43.08 |
| 6 | BeFree system | crowdCID | Sent. | 45.59 | 36.05 | 40.27 |
| 7 | BeFree system | BC5T+D + crowdCID | Sent. | 44.00 | 41.18 | 42.54 |
| 12 | Run no. 1 | BC5T+D + crowdCID | Both | 40.80 | 49.38 | 44.68 |
| 13 | Run no. 2 | BC5T+D + crowdCID | Both | 63.04 | 38.23 | 47.59 |
| 14 | Run no. 3 | BC5T+D + crowdCID | Both | 62.48 | 38.12 | 47.35 |
Figure 2.Number of Gold Standard CID associations stated in a single sentence (Sentence-level) or spanning several sentences (Abstract-level).
Figure 3.Number of CID-relations at the abstract level identified by the EK approach in relation to the Gold standard. In each column, the lighter colors represent the fraction of False Negatives (FN).
Figure 4.Number of CID-relations at the sentence level identified by our system in relation to the Gold standards. In each column, the lighter colors represent the fraction of FN.
Description and examples of each type of error identified
| Type of Error | Description and examples |
|---|---|
| When the NER (DNorm or tmChem) did not detect a mention in the text. | |
| When the NER did not correctly normalized a mention in the text. | |
| The BeFree system identified incorrectly a CID-Relation. | |
| When a single span refers to more than one concept. | |
| The CID-relation only happening at the abstract-level, so BeFree cannot identify it. | |
| This type of error is perhaps the most controversial, where we have indicated the possible inconsistencies between the GS and our criteria for CID-relations. | |
We show in bold the entities detected by the NER, and underlined the entities that were not detected.
Figure 5.Percentages of the origin of the FP and FN reported by the BeFree system.