| Literature DB >> 22166595 |
Tomoko Ohta1, Sampo Pyysalo, Makoto Miwa, Jun'ichi Tsujii.
Abstract
BACKGROUND: We consider the task of automatically extracting DNA methylation events from the biomedical domain literature. DNA methylation is a key mechanism of epigenetic control of gene expression and implicated in many cancers, but there has been little study of automatic information extraction for DNA methylation.Entities:
Year: 2011 PMID: 22166595 PMCID: PMC3239302 DOI: 10.1186/2041-1480-2-S5-S2
Source DB: PubMed Journal: J Biomed Semantics
Figure 1DNA methylation in PubMed. Citations tagged with the MeSH term DNA Methylation compared to all citations in PubMed by publication year. Note different scales.
Examples of PubMeth evidence sentence annotation
| a) | MS-PCR revealed the [ |
| b) | 30% (27 of 91) of [ |
| c) | [ |
| d) | The promoter region of the [ |
Annotated spans delimited by brackets and statements expressing methylation underlined, gene mentions shown in italics, and cancer mentions in bold.
Figure 2Event representation. BioNLP Shared Task representation for annotation of phosphorylation events (above) and representation applied for DNA methylation (below).
Figure 3Gene/protein mentions in DNA methylation abstracts. Number of abstracts with given number of automatically tagged gene/protein mentions.
Corpus statistics
| PubMeth | PubMed | Total | |
|---|---|---|---|
| Abstracts | 100 | 100 | 200 |
| Sentences | 1118 | 1009 | 2127 |
| Entities | |||
| GGP | 1695 | 1195 | 2890 |
| Site | 240 | 234 | 474 |
| Total | 1935 | 1429 | 3364 |
| Events | |||
| Theme only | 660 | 214 | 874 |
| Theme and Site | 323 | 297 | 620 |
| DNA methylation | 977 | 485 | 1462 |
| DNA demethylation | 6 | 26 | 38 |
| Total | 983 | 511 | 1494 |
Overall extraction performance
| Event type | precision | recall | F-score |
|---|---|---|---|
| DNA methylation | 77.6% | 77.2% | 77.4% |
| DNA demethylation | 100.0% | 11.1% | 20.0% |
| Total | 77.7% | 76.0% | 76.8% |
Extraction performance by subcorpus F-score performance shown.
| Test set | |||
|---|---|---|---|
| Training set | PubMed | PubMeth | Both |
| PubMed | 64.9% | 71.2% | 71.6% |
| PubMeth | 62.9% | 80.0% | 74.0% |
| Both | 66.2% | 82.5% | 76.8% |
F-score performance shown.
Figure 4Learning curves. Learning curves for the two subcorpora and their combination. Both subcorpora used for training, development sets for testing. Average and error bars calculated by 10 repetitions of random subsampling of training data.