| Literature DB >> 24438362 |
Manabu Torii1, Kavishwar Wagholikar, Hongfang Liu.
Abstract
BACKGROUND: Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance.Entities:
Year: 2014 PMID: 24438362 PMCID: PMC3908466 DOI: 10.1186/2041-1480-5-3
Source DB: PubMed Journal: J Biomed Semantics
Descriptive statistics of the corpora
| Documents | 349 | 2,000 | ||
| Tokens | 260,570 | 492,301 | ||
| Concept phrases | Problem | 11,968 | Protein | 30,269 |
| Test | 7,369 | DNA | 9,530 | |
| Treatment | 8,500 | Cell Type | 6,710 | |
| | | Cell Line | 3,830 | |
| RNA | 951 | |||
Comparison of detection performance
| i2b2/VA | Problem | All-at-once | 964 | 267 | 231 | 0.783 | 0.806 | 0.794 |
| One-at-a-time | 932 | 244 | 264 | 0.792 | 0.779 | 0.785 | ||
| Test | All-at-once | 582 | 114 | 153 | 0.835 | 0.791 | 0.813 | |
| One-at-a-time | 551 | 112 | 185 | 0.831 | 0.748 | 0.787 | ||
| Treatment | All-at-once | 653 | 139 | 196 | 0.823 | 0.769 | 0.795 | |
| One-at-a-time | 625 | 138 | 223 | 0.818 | 0.737 | 0.775 | ||
| JNLPBA | Protein | All-at-once | 2,373 | 840 | 653 | 0.739 | 0.784 | 0.761 |
| One-at-a-time | 2,251 | 752 | 775 | 0.749 | 0.744 | 0.747 | ||
| DNA | All-at-once | 581 | 270 | 371 | 0.683 | 0.610 | 0.644 | |
| One-at-a-time | 527 | 339 | 425 | 0.609 | 0.553 | 0.580 | ||
| Cell Type | All-at-once | 496 | 167 | 174 | 0.748 | 0.740 | 0.744 | |
| One-at-a-time | 455 | 168 | 215 | 0.730 | 0.678 | 0.703 | ||
| Cell Line | All-at-once | 233 | 102 | 149 | 0.695 | 0.610 | 0.649 | |
| One-at-a-time | 212 | 180 | 170 | 0.543 | 0.554 | 0.548 | ||
| RNA | All-at-once | 36 | 24 | 59 | 0.594 | 0.383 | 0.462 | |
| One-at-a-time | 33 | 18 | 62 | 0.640 | 0.345 | 0.447 |
Figure 1Detection performance for the 2010 i2b2/VA challenge corpus. The horizontal axis shows incremental sets of types, including the selected target type (e.g., “Problem” in the top figure), and the rightmost set corresponds to the all-at-once setting. The reported F-scores are for the selected target type.
Figure 2Detection performance for the JNLPBA corpus. The horizontal axis shows incremental sets of types, including the selected target type, and the rightmost set corresponds to the all-at-once setting. The reported F-scores are for the selected target type.
Additional errors introduced in one-type-at-a-time on the i2b2/VA corpus
| Problem | 42 | 199 | 244 |
| Test | 50 | 92 | 299 |
| Treatment | 47 | 266 | 113 |
Time to train and apply HMM models on the i2b2/VA and JNLPBA corpora
| i2b2 | Problem, Test, Treatment | 619 | 42 |
| Problem, Treatment | 763 | 41 | |
| Problem, Test | 879 | 42 | |
| Problem | 1,117 | 43 | |
| JNLPBA | Protein, DNA, Cell Type, Cell line, RNA | 3,010 | 88 |
| Protein, DNA, Cell Type, Cell line | 3,812 | 92 | |
| Protein, DNA, Cell Type | 4,292 | 98 | |
| Protein, RNA | 4,694 | 100 | |
| Protein | 4,763 | 98 |
1The experiments were conducted on a server with six-core AMD Opteron 2.8 GHz processors running CentOS 2.6. The reported times are the average of ten runs in ten-fold cross-validation.