| Literature DB >> 27074804 |
Kyubum Lee1, Sunwon Lee1, Sungjoon Park1, Sunkyu Kim1, Suhkyung Kim1, Kwanghun Choi1, Aik Choon Tan2, Jaewoo Kang3.
Abstract
Comprehensive knowledge of genomic variants in a biological context is key for precision medicine. As next-generation sequencing technologies improve, the amount of literature containing genomic variant data, such as new functions or related phenotypes, rapidly increases. Because numerous articles are published every day, it is almost impossible to manually curate all the variant information from the literature. Many researchers focus on creating an improved automated biomedical natural language processing (BioNLP) method that extracts useful variants and their functional information from the literature. However, there is no gold-standard data set that contains texts annotated with variants and their related functions. To overcome these limitations, we introduce a Biomedical entity Relation ONcology COrpus (BRONCO) that contains more than 400 variants and their relations with genes, diseases, drugs and cell lines in the context of cancer and anti-tumor drug screening research. The variants and their relations were manually extracted from 108 full-text articles. BRONCO can be utilized to evaluate and train new methods used for extracting biomedical entity relations from full-text publications, and thus be a valuable resource to the biomedical text mining research community. Using BRONCO, we quantitatively and qualitatively evaluated the performance of three state-of-the-art BioNLP methods. We also identified their shortcomings, and suggested remedies for each method. We implemented post-processing modules for the three BioNLP methods, which improved their performance.Database URL:http://infos.korea.ac.kr/bronco.Entities:
Mesh:
Year: 2016 PMID: 27074804 PMCID: PMC4830473 DOI: 10.1093/database/baw043
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Manual curation of BRONCO. (a) Workflow of manual curation. (b) Example of manual curation.
Figure 2.Workflow for assessing the performance of MF, EMU and tmVar in this study.
Characteristics of the variant-centric corpora
| Corpus | MF | EMU | EMU (New) | tmVar | Variome | BRONCO |
|---|---|---|---|---|---|---|
| Type | Title and abstract | Title and abstract | Title and abstract | Title and abstract | Full-text | Full-text |
| Contained Information | Variants | Variants-genes-diseases (BC, PC) | Variants-genes-diseases (BC, PC) | Variants | Annotated Texts | Var-gene-disease-drug-cell-line |
| Total number of docs | 508 | 109 | 109 | 500 | 10 | 108 |
| Word count | 107 812 | 25 495 | 25 495 | 119 649 | 42 921 | 505 311 |
| File size (kbytes) | 714 | 169 | 169 | 806 | 272 | 3405 |
| Normalized mutation | 482 | 179 | 287 | 1057 | 52 | 403 |
| Extracted mutation | Unknown | Unknown | Unknown | 1410 | 118 | 2311 |
| Publication period | Nov. 1969–Feb. 2006 | 1994–May. 2008 | 1994–May. 2008 | 2002–Apr. 2012 | Oct. 2005–Jan. 2011 | Sep. 2009–Apr. 2014 |
| Mutation Types | Sub | Sub | Sub, Del, Ins, SNP, FS | Sub, Del, Ins, Dup, InDel, SNP, FS | Sub, Del, Ins, FS | Sub, Del, Ins, InDel, SNP, FS |
| Unique Var | 451 | 172 | 237 | 871 | 50 | 275 |
| Var with genes | 179 (100%) | 179 (75.53%) | (Unknown) | 403 (100%) | ||
| Var with diseases | 170 | 170 | (Unknown) | 338 (83.87%) | ||
| Var with drugs | (Unknown) | 202 (50.12%) | ||||
| Var with cell lines | (Unknown) | 182 (45.16%) |
Precision, recall and F1-score of the methods tested on four corpora.
| Method | Measure | Corpora | Average | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MF | EMU | tmVar | BRONCO | |||||||||||||
| Orig. | P.P. | Imprv. (%) | Orig. | P.P. | Imprv. (%) | Orig. | P.P. | Imprv. (%) | Orig. | P.P. | Imprv. (%) | Orig. | P.P. | Imprv. (%) | ||
| MutationFinder | Precision | 0.985 | 0.985 | 0.00 | 0.995 | 0.995 | 0.00 | 0.985 | 0.985 | 0.00 | 0.934 | 2.08 | 0.974 | 0.41 | ||
| Recall | 0.805 | 0.801 | −0.50 | 0.747 | 0.743 | −0.54 | 0.294 | 0.292 | −0.68 | 0.876 | 0.876 | 0.00 | 0.681 | 0.678 | −0.44 | |
| F1-score | 0.886 | 0.883 | −0.34 | 0.854 | 0.851 | −0.35 | 0.453 | 0.451 | −0.44 | 0.904 | 1.01 | 0.772 | 0.772 | 0.00 | ||
| EMU | Precision | 0.977 | 0.982 | 0.51 | 0.956 | 0.973 | 1.78 | 0.845 | 0.932 | 0.773 | 0.847 | 9.57 | 0.888 | 0.934 | 5.18 | |
| Recall | 0.801 | 0.797 | −0.50 | 0.959 | 0.955 | −0.42 | 0.699 | 0.711 | 1.72 | 0.903 | 0.906 | 0.841 | 0.842 | |||
| F1-score | 0.880 | 0.880 | 0.00 | 0.957 | 0.964 | 0.73 | 0.765 | 0.807 | 0.833 | 0.876 | 5.16 | 0.859 | 0.882 | 2.68 | ||
| tmVar | Precision | 0.985 | 0.988 | 0.30 | 0.988 | 0.988 | 0.00 | 0.955 | 0.972 | 1.78 | 0.767 | 0.881 | 14.86 | 0.924 | 0.957 | 3.57 |
| Recall | 0.842 | 0.838 | −0.48 | 0.952 | 0.944 | −0.84 | 0.937 | 0.930 | −0.75 | 0.938 | 0.933 | −0.53 | 0.917 | 0.911 | −0.65 | |
| F1-score | −0.11 | 0.970 | 0.966 | −0.41 | 0.53 | 0.844 | 0.906 | 7.35 | 1.64 | |||||||
| Simple Merging | Precision | 0.956 | 0.962 | 0.950 | 0.967 | 0.852 | 0.930 | 9.15 | 0.684 | 0.795 | 0.861 | 0.914 | ||||
| Recall | −0.59 | −0.70 | −0.63 | −0.21 | −0.43 | |||||||||||
| F1-score | 0.900 | 0.901 | 0.897 | 0.936 | 4.35 | 0.792 | 0.862 | 0.891 | 0.919 | |||||||
| Majority voting | Precision | 0.00 | 0.00 | 0.30 | 0.867 | 8.54 | 0.962 | 1.98 | ||||||||
| Recall | 0.809 | 0.803 | −0.74 | 0.926 | 0.918 | −0.86 | 0.694 | 0.708 | 0.906 | 0.906 | 0.00 | 0.834 | 0.834 | 0.00 | ||
| F1-score | 0.892 | 0.889 | −0.34 | 0.960 | 0.956 | −0.42 | 0.816 | 0.826 | 1.23 | 0.886 | 4.18 | 0.889 | 0.898 | 1.01 | ||
The highest scores are highlighted in bold in each corpus. Orig, original results; P.P., post-processing results; Imprv., improvement of the post-processing results over original results.
Figure 3.The post-processing module’s performance on the BRONCO corpus.
Running time of the three methods
| Corpus | MF (508 abstracts) | EMU (109 abstracts) | tmVar (500 abstracts) | BRONCO (108 full-texts) |
|---|---|---|---|---|
| MutationFinder | 28 s | 7 s | 31 s | 139 s |
| EMU | 75 s | 17 s | 77 s | 201 s |
| tmVar | 61 s | 23 s | 70 s | 952 s |
Figure 4.Examples of true and false positives identified by the three methods.
The result of EMU’s gene mapping module that used the BRONCO data set
| All variants | Only protein variants | NER was successful | |
|---|---|---|---|
| Number of variants | 333 | 308 | 232 |
| Gene numbers before filtering | 4467 | 4416 | 3450 |
| Gene numbers after filtering | 681 | 647 | 524 |
| Precision | 0.283 | 0.287481 | 0.443 |
| Recall | 0.580 | 0.604 | 0.832 |