| Literature DB >> 25285203 |
Antonio Jimeno Yepes1, Karin Verspoor1.
Abstract
As the cost of genomic sequencing continues to fall, the amount of data being collected and studied for the purpose of understanding the genetic basis of disease is increasing dramatically. Much of the source information relevant to such efforts is available only from unstructured sources such as the scientific literature, and significant resources are expended in manually curating and structuring the information in the literature. As such, there have been a number of systems developed to target automatic extraction of mutations and other genetic variation from the literature using text mining tools. We have performed a broad survey of the existing publicly available tools for extraction of genetic variants from the scientific literature. We consider not just one tool but a number of different tools, individually and in combination, and apply the tools in two scenarios. First, they are compared in an intrinsic evaluation context, where the tools are tested for their ability to identify specific mentions of genetic variants in a corpus of manually annotated papers, the Variome corpus. Second, they are compared in an extrinsic evaluation context based on our previous study of text mining support for curation of the COSMIC and InSiGHT databases. Our results demonstrate that no single tool covers the full range of genetic variants mentioned in the literature. Rather, several tools have complementary coverage and can be used together effectively. In the intrinsic evaluation on the Variome corpus, the combined performance is above 0.95 in F-measure, while in the extrinsic evaluation the combined recall performance is above 0.71 for COSMIC and above 0.62 for InSiGHT, a substantial improvement over the performance of any individual tool. Based on the analysis of these results, we suggest several directions for the improvement of text mining tools for genetic variant extraction from the literature.Entities:
Year: 2014 PMID: 25285203 PMCID: PMC4176422 DOI: 10.12688/f1000research.3-18.v2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Mutation extraction systems evaluation on existing corpora.
The results in precision, recall and F1 measure are obtained from already published studies or the tool website like SETH. A dash(-) is used to indicate that no results are available. The tools are MutationFinder (MF), OpenMutationMiner (OMM), Extractor of Mutations (EMU), tmVar and SNP Extraction Tool for Human Variations (SETH). The corpora used for evaluation are MutationFinder corpora (MF), OpenMutationMiner corpora (OMM), EMU Prostate Cancer set (PCa), the tmVar corpora, the OSIRIS corpora and the corpora made available by Thomas et al. [24] (Thomas).
| Corpus | |||||||
|---|---|---|---|---|---|---|---|
| Tool | Performance
| MF | OMM | PCa | tmVar | Osiris | Thomas |
| MF | P
| 0.98
| -
| -
| -
| -
| -
|
| OMM | P
| -
| 0.99
| -
| -
| -
| -
|
| EMU | P
| 0.99
| -
| 0.92
| -
| -
| -
|
| tmVar | P
| 0.99
| -
| -
| 0.91
| -
| -
|
| SETH | P
| 0.97
| -
| -
| 0.94
| 0.98
| 0.95
|
Count of supplementary file types.
Each row denotes one of file type. For each type, for COSMIC the number of files and the number of articles, denoted by PMIDs, is shown. In the case of the InSiGHT database, there is only one supplementary file in MS Word format linked to one article.
| Set | COSMIC | InSiGHT | |
|---|---|---|---|
| Files | PMIDs | ||
| MS Word
| 176
| 87
| 1
|
| Total | 505 | 138 | 1 |
Variant extraction results on the Variome corpus.
Results for exact and partial matching are present. Each row shows the performance of each method in terms of true positives (TP), false negatives (FN), false positives (FP), Precision, Recall and F1 measure (F1). The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH) and their combination (either selecting the longest span Combined_longest or the shortest span Combined_shortest).
| Exact | TP | FN | FP | Precision | Recall | F1 | Char Overlap (%) |
|---|---|---|---|---|---|---|---|
| EMU
| 66
| 52
| 25
| 0.7253
| 0.5593
| 0.6316
| 100.00
|
| Partial | TP | FN | FP | Precision | Recall | F1 | Char Overlap (%) |
| EMU
| 90
| 28
| 3
| 0.9677
| 0.7627
| 0.8531
| 82.08
|
DNA variant extraction recall result on the Variome corpus.
Results for exact and partial matching are present. Each row shows the performance of each method in terms of true positives (TP), false negatives (FN) and Recall. The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH).
| Exact | TP | FN | Recall | Char Overlap (%) |
|---|---|---|---|---|
| EMU
| 20
| 32
| 0.3846
| 100.00
|
| Partial | TP | FN | Recall | Char Overlap (%) |
| EMU
| 33
| 17
| 0.6600
| 68.60
|
Protein extraction recall result on the Variome corpus.
Results for exact and partial matching are present. Each row shows the performance of each method in terms of true positives (TP), false negatives (FN) and Recall. The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH).
| Exact | TP | FN | Recall | Char Overlap (%) |
|---|---|---|---|---|
| EMU
| 46
| 20
| 0.697
| 100.00
|
| Partial | TP | FN | Recall | Char Overlap (%) |
| EMU
| 55
| 11
| 0.8333
| 95.12
|
DNA variant extraction error types and their frequency are presented for each system.
Only the tools that perform DNA variant extraction are considered. The tools are Extractor of Mutations (EMU), tmVar and SNP Extraction Tool for Human Variations (SETH).
| FP | Variant type | Count |
|---|---|---|
| EMU
| DELETION
| 1
|
| FN | Variant type | Count |
| EMU
| SUBSTITUTION
| 12
|
Protein variant extraction error types and their frequency are presented for each system.
The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH).
| FP | Variant type | Count |
|---|---|---|
| EMU
| SUBSTITUTION
| 1
|
| FN | Variant type | Count |
| EMU
| SUBSTITUTION
| 11
|
COSMIC variant extraction coverage result.
The table shows the number of variants in the reference set (Total), the number of matched variants by the mutation extraction tool (Matched), the proportion of matched variants (Recall), the number of variants matched when the gene is not considered (M NG) and the proportion of matched variants when the gene is not considered (Rec NG). The data sets considered are MEDLINE abstracts (medline), Open Access PMC articles (pmc.ft), PDF articles when no Open Access PMC articles are available (pdf), PDF representation for all the articles (pdf.all), tables available from the Open Access PMC Articles’ XML (table), supplementary material (sup) and the combination from all the sources (all). The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH). The row with tool value as All indicates the result when the variants extracted by all the tools are merged.
| Data set | Tool | Total | Matched | Recall | M NG | Rec NG |
|---|---|---|---|---|---|---|
| medline
| EMU
| 33814
| 146
| 0.0043
| 157
| 0.0046
|
| medline | All | 33814 | 156 | 0.0046 | 169 | 0.0050 |
| pmc.ft
| EMU
| 33814
| 726
| 0.0215
| 758
| 0.0224
|
| pmc.ft | All | 33814 | 814 | 0.0241 | 853 | 0.0252 |
| pdf
| EMU
| 33814
| 34
| 0.0010
| 47
| 0.0014
|
| All | 33814 | 34 | 0.0010 | 47 | 0.0014 | |
| pdf.all
| EMU
| 33814
| 1094
| 0.0324
| 1114
| 0.0329
|
| pdf.all | All | 33814 | 1304 | 0.0386 | 1327 | 0.0392 |
| table
| EMU
| 33814
| 580
| 0.0172
| 681
| 0.0201
|
| table | All | 33814 | 694 | 0.0205 | 831 | 0.0246 |
| sup
| EMU
| 33814
| 19177
| 0.5671
| 19217
| 0.5683
|
| sup | All | 33814 | 22756 | 0.6730 | 22829 | 0.6751 |
| all
| EMU
| 33814
| 20203
| 0.5975
| 20284
| 0.5999
|
| all | All | 33814 | 23859 | 0.7056 | 23969 | 0.7088 |
InSiGHT variant extraction coverage result.
The table shows the number of variants in the reference set (Total), the number of matched variants by the mutation extraction tool (Matched) and the proportion of matched variants (Recall). When relaxing the gene matching (M NG), the results do not change, thus this data is not shown. The data sets considered are MEDLINE abstracts (medline), Open Access PMC articles (pmc.ft), PDF articles when no Open Access PMC articles are available (pdf), PDF representation for all the articles (pdf.all), tables available from the Open Access PMC Articles’ XML (table), supplementary material (sup) and the combination from all the sources (all). The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH). The row with tool value as All indicates the result when the variants extracted by all the tools are merged.
| Data set | Tool | Total | Matched | Recall |
|---|---|---|---|---|
| medline
| EMU
| 252
| 1
| 0.0040
|
| medline | All | 252 | 5 | 0.0198 |
| pmc.ft
| EMU
| 252
| 23
| 0.0913
|
| pmc.ft | All | 252 | 26 | 0.1032 |
| pdf
| EMU
| 252
| 7
| 0.0278
|
| All | 252 | 8 | 0.0317 | |
| pdf.all
| EMU
| 252
| 41
| 0.1627
|
| pdf.all | All | 252 | 85 | 0.3373 |
| table
| EMU
| 252
| 39
| 0.1548
|
| table | All | 252 | 74 | 0.2937 |
| sup
| EMU
| 252
| 88
| 0.3492
|
| sup | All | 252 | 92 | 0.3651 |
| all
| EMU
| 252
| 127
| 0.5040
|
| all | All | 252 | 157 | 0.6230 |
COSMIC variant extraction coverage result when only articles with variants in the reference set as well as at least one result identified by the information extraction tool are considered.
The table shows the number of common articles (PMIDs), the number of variants in the reference set (Total), the number of matched variants by the mutation extraction tool (Matched), the proportion of matched variants (Recall), the number of variants matched when the gene is not considered (M NG) and the proportion of matched variants when the gene is not considered (Rec NG). The data sets considered are MEDLINE abstracts (medline), Open Access PMC articles (pmc.ft), PDF articles when no Open Access PMC articles are available (pdf), PDF representation for all the articles (pdf.all), tables available from the Open Access PMC Articles’ XML (table), supplementary material (sup) and the combination from all the sources (all). The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH). The row with tool value as All indicates the result when the variants extracted by all the tools are merged.
| Data set | Tool | PMIDs | Total | Matched | Recall | M NG | Rec NG |
|---|---|---|---|---|---|---|---|
| medline
| EMU
| 128
| 627
| 146
| 0.2329
| 157
| 0.2504
|
| medline | All | 137 | 676 | 156 | 0.2308 | 169 | 0.2500 |
| pmc.ft
| EMU
| 339
| 31742
| 726
| 0.0229
| 758
| 0.0239
|
| pmc.ft | All | 351 | 31848 | 814 | 0.0256 | 853 | 0.0268 |
| pdf
| EMU
| 61
| 474
| 34
| 0.0717
| 47
| 0.0992
|
| all | 64 | 505 | 34 | 0.0673 | 47 | 0.0931 | |
| pdf.all
| EMU
| 439
| 33134
| 1094
| 0.0330
| 1114
| 0.0336
|
| pdf.all | All | 446 | 33295 | 1304 | 0.0392 | 1327 | 0.0399 |
| table
| EMU
| 197
| 1946
| 580
| 0.2980
| 681
| 0.3499
|
| table | All | 211 | 2019 | 694 | 0.3437 | 831 | 0.4116 |
| sup
| EMU
| 77
| 27888
| 19177
| 0.6876
| 19217
| 0.6891
|
| sup | All | 106 | 31564 | 22756 | 0.7209 | 22829 | 0.7233 |
| all
| EMU
| 450
| 33409
| 20203
| 0.6047
| 20284
| 0.6071
|
| all | All | 458 | 33676 | 23859 | 0.7085 | 23969 | 0.7118 |
InSiGHT variant extraction coverage result when only articles with variants in the reference set as well as at least one result identified by the information extraction tool are considered.
The table shows the number of common articles (PMIDs), the number of variants in the reference set (Total), the number of matched variants by the mutation extraction tool (Matched) and the proportion of matched variants (Recall). When relaxing the gene matching (M NG), the results do not change, thus this data is not shown. The data sets considered are MEDLINE abstracts (medline), Open Access PMC articles (pmc.ft), PDF articles when no Open Access PMC articles are available (pdf), PDF representation for all the articles (pdf.all), tables available from the Open Access PMC Articles’ XML (table), supplementary material (sup) and the combination from all the sources (all). The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH). The row with tool value as All indicates the result when the variants extracted by all the tools are merged.
| Data set | Tool | PMIDs | Total | Matched | Recall |
|---|---|---|---|---|---|
| medline
| EMU
| 2
| 2
| 1
| 0.5000
|
| medline | All | 3 | 17 | 5 | 0.2941 |
| pmc.ft
| EMU
| 8
| 179
| 23
| 0.1285
|
| pmc.ft | All | 9 | 234 | 26 | 0.1111 |
| pdf
| EMU
| 3
| 18
| 7
| 0.3889
|
| All | 3 | 18 | 8 | 0.4444 | |
| pdf.all
| EMU
| 11
| 251
| 41
| 0.1633
|
| pdf.all | All | 11 | 251 | 85 | 0.3386 |
| table
| EMU
| 4
| 197
| 39
| 0.1980
|
| table | All | 4 | 197 | 74 | 0.3756 |
| sup
| EMU
| 1
| 103
| 88
| 0.8544
|
| sup | All | 1 | 103 | 92 | 0.8932 |
| all
| EMU
| 12
| 252
| 127
| 0.5040
|
| all | All | 12 | 252 | 157 | 0.6230 |
Overlap of mutations extracted by each tool on the COSMIC database from all article data sources.
The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH).
| COSMIC | EMU | OMM | MF | SETH | tmVar |
|---|---|---|---|---|---|
| EMU
| -
| 0.7939
| 0.5114
| 0.7546
| 0.6962
|
Overlap of mutations extracted by each tool on the InSiGHT database from all article data sources.
The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH).
| InSiGHT | EMU | OMM | MF | SETH | tmVar |
|---|---|---|---|---|---|
| EMU
| -
| 0.1463
| 0.1367
| 0.3300
| 0.7249
|
COSMIC and InSiGHT variant extraction coverage combining OMM and SETH.
The data sets considered are MEDLINE abstracts (medline), Open Access PMC articles (pmc.ft), PDF articles when no Open Access PMC articles are available (pdf), tables available from the Open Access PMC Articles’ XML (table), supplementary material (sup) and the combination from all the sources (all).
| Database | Data set | PMIDS | Total | Matched | Recall |
|---|---|---|---|---|---|
| COSMIC
| medline
| 129
| 33814
| 145
| 0.0043
|
| COSMIC | all | 400 | 33814 | 23544 | 0.6963 |
|
|
|
|
|
|
|
| InSiGHT
| medline
| 3
| 252
| 5
| 0.0198
|
| InSiGHT | all | 9 | 252 | 61 | 0.2421 |
COSMIC and InSiGHT variant extraction coverage combining OMM, SETH and EMU.
The data sets considered are MEDLINE abstracts (medline), Open Access PMC articles (pmc.ft), PDF articles when no Open Access PMC articles are available (pdf), tables available from the Open Access PMC Articles’ XML (table), supplementary material (sup) and the combination from all the sources (all).
| Database | Data set | PMIDS | Total | Matched | Recall |
|---|---|---|---|---|---|
| COSMIC
| medline
| 149
| 33814
| 151
| 0.0045
|
| COSMIC | all | 511 | 33814 | 23826 | 0.7046 |
|
|
|
|
|
|
|
| InSiGHT
| medline
| 3
| 252
| 5
| 0.0198
|
| InSiGHT | all | 12 | 252 | 152 | 0.6032 |
COSMIC database variants, taking into account the DNA variant and the impact on the gene product.
Variants have been annotated using SETH, thus the variant types delivered by SETH are considered.
| Frequency | DNA variant type | Protein variant type |
|---|---|---|
| 30080
| SUBSTITUTION
| SUBSTITUTION
|
InSiGHT database variants, taking into account the DNA variant and the impact on the gene product.
Variants have been annotated using SETH, thus the variant types delivered by SETH are considered.
| Frequency | DNA variant type | Protein variant type |
|---|---|---|
| 134
| SUBSTITUTION
| SUBSTITUTION
|
COSMIC missing variants by the combination of variant extraction tools grouped by type, taking into account the DNA variant and the impact on the gene product.
Missing variants have been annotated using SETH, thus the variant types delivered by SETH are considered.
| Frequency | DNA variant type | Protein variant type |
|---|---|---|
| 7163
| SUBSTITUTION
| SUBSTITUTION
|
InSiGHT missing variants by the combination of variant extraction tools grouped by type, taking into account the DNA variant and the impact on the gene product.
Missing variants have been annotated using SETH, thus the variant types delivered by SETH are considered.
| Frequency | DNA variant type | Protein variant type |
|---|---|---|
| 18
| DELETION
| FRAMESHIFT
|
DNA variants annotated by each mutation extraction tool, using all the sources from the articles, tool grouped by type.
The type of each variant has been annotated using SETH, thus the variant types delivered by SETH are considered. The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH). A large number of DNA mutations are annotated by EMU. 1887448 of the COSMIC variants are substitutions derived from the supplementary material of a single article, PMID:22622578. Most of these substitutions are provided in terms of chromosome location rather than relative to the gene.
| Tool | Type | Frequency
| Frequency
|
|---|---|---|---|
| EMU | All
| 1914828
| 182
|
| OMM | - | - | - |
| MF | - | - | - |
| SETH | All
| 9029
| 58
|
| tmVar | All
| 1864
| 135
|
Protein variants annotated by each mutation extraction, using all the sources from the articles, tool grouped by type.
The type of each variant has been annotated using SETH, thus the variant types delivered by SETH are considered. The tools are Extractor of Mutations (EMU), OpenMutationMiner (OMM), MutationFinder (MF), tmVar and SNP Extraction Tool for Human Variations (SETH).
| Tool | Type | Frequency
| Frequency
|
|---|---|---|---|
| EMU | All
| 35871
| 102
|
| OMM | SUBSTITUTION | 29157 | 164 |
| MF | SUBSTITUTION | 4836 | 117 |
| SETH | All
| 28862
| 42
|
| tmVar | All
| 16461
| 54
|