| Literature DB >> 27073839 |
A S M Ashique Mahmood1, Tsung-Jung Wu2, Raja Mazumder2,3, K Vijay-Shanker1.
Abstract
The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.Entities:
Mesh:
Year: 2016 PMID: 27073839 PMCID: PMC4830514 DOI: 10.1371/journal.pone.0152725
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Schematic diagram of DiMeX.
Example sentences for different sections from PMID:10810408.
| Rhetorical zone/section | Example sentence |
|---|---|
| Title | Missense alterations of BRCA1 gene detected in diverse cancer patients. |
| Introduction/Background | BRCA1 gene mutations may also be related with other types of cancers such as prostate cancer and colorectal cancer. |
| Methods/Aims | We used PCR-NIRCA and PCR-SSCP methods for screening the BRCA1 mutation hot regions, exons 2, 5, 11, 16 and 20. |
| Results | We have identified a rare sequence variant, A3537G (Ser 1140Gly) in a B cell lymphoma patient and two polymorphisms, A1186G (Gln356Arg) in a brain cancer patient and A3667G (Lys1183Arg) in a germline tumor patient. |
| Conclusion | In conclusion, 3 missense alterations of BRCA1 gene have been identified in cancers other than breast cancer. |
Summary information of the datasets used for evaluation purposes.
| Name of dataset | Used for tasks of | Used for evaluation of | Size |
|---|---|---|---|
| BiomutaC | <M,G,D> & <M,G> | DiMeX | 62 abstracts (119 <M,G,D> triplets) |
| PCa_filtered_UD | <M,G,D> & <M,G> | DiMeX & EMU | 97 abstracts (170 <M,G,D> triplets) |
| BCa_filtered_UD | <M,G,D> & <M,G> | DiMeX & EMU | 132 abstracts (216 <M,G,D> triplets) |
| MF | M | DiMeX | 508 abstracts (910 point mutations) |
| Variome | M | DiMeX | 10 full text articles (118 mutations) |
| tmVar | M | DiMeX | 166 abstracts (464 mutations) |
DiMeX’s performance of mutation-disease (
| Dataset | DiMeX performance in <M,G,D> extraction | DiMeX performance in <M,G> extraction | ||||
|---|---|---|---|---|---|---|
| P | R | F | P | R | F | |
| BiomutaC | 0.87 | 0.89 | 0.88 | 0.90 | 0.95 | 0.93 |
DiMeX’s performance of mutation-disease association (
| Datasets | DiMeX performance in <M,G,D> extraction | EMU performance in <M,G,D> extraction | ||||
|---|---|---|---|---|---|---|
| P | R | F | P | R | F | |
| PCa_filtered_UD | 0.95 | 0.88 | 0.91 | 0.76 | 0.75 | 0.75 |
| BCa_filtered_UD | 0.93 | 0.85 | 0.89 | 0.64 | 0.71 | 0.67 |
DiMeX’s performance of mutation-gene association (
| Datasets | DiMeX performance in <M,G> extraction | EMU performance in<M,G> extraction | ||||
|---|---|---|---|---|---|---|
| P | R | F | P | R | F | |
| PCa_filtered_UD | 0.96 | 0.92 | 0.94 | 0.77 | 0.75 | 0.76 |
| BCa_filtered_UD | 0.94 | 0.94 | 0.94 | 0.65 | 0.72 | 0.68 |
Evaluation of mutation detection systems on various datasets.
| Tool | Performance measures | Corpus | ||
|---|---|---|---|---|
| MF (MF mutations normalized) | Variome | tmVar | ||
| MF | P | 0.98 (0.98) | 0.94 | - |
| R | 0.82 (0.81) | 0.16 | - | |
| F | 0.89 (0.89) | 0.24 | - | |
| EMU | P | - (0.99) | 0.97 | - |
| R | - (0.81) | 0.76 | - | |
| F | - (0.89) | 0.85 | - | |
| tmVar | P | 0.99 (0.98) | 0.97 | 0.91 |
| R | 0.90 (0.84) | 0.91 | 0.91 | |
| F | 0.94 (0.90) | 0.94 | 0.91 | |
| SETH | P | 0.98 (0.97) | 0.99 | 0.94 |
| R | 0.82 (0.81) | 0.76 | 0.81 | |
| F | 0.89 (0.88) | 0.86 | 0.87 | |
| DiMeX | P | 0.99 (0.98) | 0.96 | 0.94 |
| R | 0.89 (0.89) | 0.92 | 0.89 | |
| F | 0.94 (0.93) | 0.94 | 0.91 | |
The values in precision (P), recall (R) and F-measure (F) for tools other than DiMeX are obtained from comparisons performed in [44] and published results in [20], [28] and SETH tool website. A dash (‘-’) indicates unavailability of data. The tools are MutationFinder (MF), Extractor of Mutations (EMU), tmVar and SNP Extraction Tool for Human Variations (SETH) and DiMeX. For the MF corpus, the results in parenthesis represent evaluation on normalized mutations where multiple occurrences of the same mutation are normalized to one entry.
Fig 2Querying the database with “pancreatic cancer”.
(A) Screenshot from the DiMeX website showing a portion of the triplet-view results for the query (B) Options to select PMID-view for abstracts that are review and/or meta-analysis studies or associated with one or more populations. (C) An example showing that the results can be instantly sorted by clicking on the column name.
Fig 3Querying the database with “pancreatic cancer”.
This screenshot from the DiMeX website shows a portion of the PMID-view results for abstracts that are associated with one or more populations.
Characteristics of the extracted results.
The query used to select the PMIDs is “cancer[tiab]) AND (mutation[tiab] OR variant[tiab] OR polymorphism[tiab])”, with abstracts selected from 2009 to 2011.
| Abstracts | 9727 |
| Abstracts with at least one triplet | 2511 |
| Total <M,G,D> triplets | 7175 |
| Unique <M,G,D> triplets | 6410 |
| Unique mutations | 3204 |