| Literature DB >> 27902695 |
Ayush Singhal1, Michael Simmons1, Zhiyong Lu1.
Abstract
The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient's genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer's disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships.Entities:
Mesh:
Year: 2016 PMID: 27902695 PMCID: PMC5130168 DOI: 10.1371/journal.pcbi.1005017
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1An example showing the complexity of mining triplet information from a PubMed abstract.
Fig 2Overview of the proposed approach.
Fig 3A schematic of identifying a ranked list of genes using global knowledge for a given mutation.
Comparison of proposed approach with EMU approach on benchmark datasets.
The parantheses values correspond to (true positives, false positives) for precision and (true positive, false negatives) for recall.
| Corpus | EMU: Without sequence filter | Our baseline approach: Co-occurrence only | Our approach: Without sequence filter | EMU: With sequence filter | Our full approach: With sequence filter |
|---|---|---|---|---|---|
| 0.39 (151, 237) | 0.37 (154, 263) | 0.75 (132, 42) | 0.59 (127, 89) | ||
| Recall | 0.80 (151, 37) | 0.70 (132, 56) | 0.66 (127, 61) | 0.77 (144, 44) | |
| F-measure | 0.52 | 0.51 | 0.724 | 0.62 | |
| 0.34 (242, 470) | 0.33 (252, 504) | 0.738 (206, 73) | 0.61 (193, 121) | ||
| Recall | 0.85 (242, 42) | 0.725 (206, 78) | 0.68 (193, 91) | 0.73 (207, 77) | |
| F-measure | 0.49 | 0.49 | 0.73 | 0.64 |
Fig 4Comparing PubMed text-mined results with UniProtKB curated set.
Fig 5Three-tier analysis of text-mined vs. curated gene-disease-variant triplets.
Precision computation based on human annotation of random samples from uncurated mutations.
| Frequency Group | High (138) | Medium (343) | Low (4903) | Total (5384) |
|---|---|---|---|---|
| 47 | 89 | 195 | 331 | |
| 58 | 112 | 260 | 430 | |
| 0.81 | 0.80 | 0.75 | 0.77 |
Analysis of missed mutations.
| Disease | <Protein-Mutation-PMID >analyzed | Had disease mention | Had protein mention | Had mutation mention | Had triplet mention | #of PMIDs | Comments |
|---|---|---|---|---|---|---|---|
| 677 | 622 | 50 | 3 | 0 | 45 | PMID: 16959974 had 587/677 mutations | |
| 19 | 5 | 4 | 7 | 0 | 15 | ||
| 87 | 51 | 69 | 10 | 0 | 38 | ||
| 123 | 43 | 112 | 9 | 44 | PMID: 17349580- classification error. PMID: 18227510- no lung cancer Mesh terms | ||
| 19 | 17 | 17 | 0 | 0 | 7 | ||
| 49 | 29 | 45 | 6 | 26 | PMID: 16247455: missed due to classification error | ||
| 15 | 8 | 12 | 0 | 0 | 10 | ||
| 1 | 0 | 1 | 1 | 0 | 1 | ||
| 19 | 8 | 14 | 3 | 0 | 19 | ||
| 5 | 1 | 4 | 0 | 0 | 5 |