| Literature DB >> 26817711 |
Suvir Jain1, Kashyap R Tumkur2, Tsung-Ting Kuo3, Shitij Bhargava4, Gordon Lin5, Chun-Nan Hsu6.
Abstract
BACKGROUND: Numerous publicly available biomedical databases derive data by curating from literatures. The curated data can be useful as training examples for information extraction, but curated data usually lack the exact mentions and their locations in the text required for supervised machine learning. This paper describes a general approach to information extraction using curated data as training examples. The idea is to formulate the problem as cost-sensitive learning from noisy labels, where the cost is estimated by a committee of weak classifiers that consider both curated data and the text.Entities:
Mesh:
Year: 2016 PMID: 26817711 PMCID: PMC4847485 DOI: 10.1186/s12859-015-0844-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An examplar entry in the Catalog of GWAS and its currated data. Example of an entry in the Catalog of GWAS (upper panel) and after matching to the curated data in the text of the source paper [54] (lower panel)
Fig. 2Example sentences describing sample ethnicity groups and stages. Example of sentences in free text from which the system extracts study targets
Fig. 3System architecture. Our proposed system architecture summarizing the steps (a-e) in the machine learning training process
Precision-at-2 (P@2) of identifying target disease/trait mention of a GWAS study
| Method | Arithmetic | Harmonic |
|---|---|---|
| Cost-insensitive | 68.65 % | 79.62 % |
| Cost-sensitive | 78.05 % | 87.46 % |
| BIOADI+Cost-sensitive | 75.57 % | 87.29 % |
| CRF+Cost-sensitive | 65.79 % | 75.24 % |
Performance of stage-ethnicity extraction (micro average)
| Method | Precision | Recall | F1 Score |
|---|---|---|---|
| Baseline | 0.4898 | 1.0000 | 0.6576 |
| Cost-insensitive | 0.6965 | 0.7077 | 0.7020 |
| Cost-sensitive | 0.7471 | 0.7711 | 0.7589 |
Performance of stage-ethnicity extraction (macro average)
| Method | Precision | Recall | F1 Score |
|---|---|---|---|
| Baseline | 0.5972 | 1.0000 | 0.7478 |
| Cost-insensitive | 0.7408 | 0.7943 | 0.7666 |
| Cost-sensitive | 0.7893 | 0.8757 | 0.8302 |