| Literature DB >> 24520105 |
Antonio Jimeno Yepes1, Karin Verspoor.
Abstract
A major focus of modern biological research is the understanding of how genomic variation relates to disease. Although there are significant ongoing efforts to capture this understanding in curated resources, much of the information remains locked in unstructured sources, in particular, the scientific literature. Thus, there have been several text mining systems developed to target extraction of mutations and other genetic variation from the literature. We have performed the first study of the use of text mining for the recovery of genetic variants curated directly from the literature. We consider two curated databases, COSMIC (Catalogue Of Somatic Mutations In Cancer) and InSiGHT (International Society for Gastro-intestinal Hereditary Tumours), that contain explicit links to the source literature for each included mutation. Our analysis shows that the recall of the mutations catalogued in the databases using a text mining tool is very low, despite the well-established good performance of the tool and even when the full text of the associated article is available for processing. We demonstrate that this discrepancy can be explained by considering the supplementary material linked to the published articles, not previously considered by text mining tools. Although it is anecdotally known that supplementary material contains 'all of the information', and some researchers have speculated about the role of supplementary material (Schenck et al. Extraction of genetic mutations associated with cancer from public literature. J Health Med Inform 2012;S2:2.), our analysis substantiates the significant extent to which this material is critical. Our results highlight the need for literature mining tools to consider not only the narrative content of a publication but also the full set of material related to a publication.Entities:
Mesh:
Year: 2014 PMID: 24520105 PMCID: PMC3920087 DOI: 10.1093/database/bau003
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
COSMIC and InSiGHT data set statistics. Each row reflects figures for cited articles (PMIDs) in the reference database
| Set | PMIDs | Mut Art | Mut Cnt | Avg Mut | SD |
|---|---|---|---|---|---|
| COSMIC (reference) | 9950 | 7898 | 198 864 | 25.18 | 521.18 |
| InSiGHT (reference) | 809 | 809 | 7022 | 8.68 | 18.55 |
Mut Art = number of articles associated to at least one mutation; Mut Cnt = the number of mutations associated with those articles; Avg Mut = the average number of mutations per article; SD = standard deviation of Mut Cnt.
Counts of mutations identified by EMU, by corpus and by mutation type
| Set | COSMIC | InSiGHT | ||
|---|---|---|---|---|
| Abstracts | Full text | Abstracts | Full text | |
| Papers with mutation mentions | 2486 | 2395 | 235 | 165 |
| DNA | 139 | 623 | 165 | 97 |
| Genome | 21 | 10 | 3 | 5 |
| Protein | 3266 | 18 015 | 283 | 1071 |
| Protein; DNA | 786 | 3229 | 137 | 269 |
| Protein; DNA; RNA | 118 | 517 | 32 | 132 |
| RSID | 55 | 275 | 14 | 92 |
| All | 4267 | 22 575 | 602 | 1593 |
| Average | 1.76 | 9.43 | 2.67 | 9.65 |
| SD | 1.44 | 16.94 | 2.96 | 16.87 |
| Mutations with no gene | 542 | 48 | 110 | 1 |
| Papers w/HGVS mutation + gene | 2373 | 2251 | 195 | 150 |
| HGVS mutation + gene count | 8960 | 57 369 | 1649 | 12 908 |
| Average | 3.78 | 71.66 | 8.45 | 86.05 |
| SD | 4.61 | 148.36 | 17.11 | 225.52 |
The table shows the statistics after normalizing the mutations to HGVS and assigning one related gene per mutation. (insert this as a footnote for this table)
Recall of COSMIC and InSiGHT curated mutations, evaluated over the full reference database (Recall), articles common to each subcorpus and the reference database (Cmn Art) (Recall Common), and considering relaxation of gene match for each case (NG = no gene; Recall NG/Recall CmnNG)
| Set | Cmn art | Match mutation | Recall | Recall NG | Mutations common | Recall common | Recall CmnNG |
|---|---|---|---|---|---|---|---|
| COSMIC Abs | 2200 | 1884 | 0.0095 | 0.0122 | 12,940 | 0.1456 | 0.1875 |
| COSMIC FT | 2071 | 3656 | 0.0184 | 0.0215 | 104,756 | 0.0349 | 0.0408 |
| COSMIC Abs + FT | 3738 | 4754 | 0.0239 | 0.0289 | 114,279 | 0.0416 | 0.0503 |
| InSiGHT Abs | 195 | 230 | 0.0328 | 0.0450 | 1233 | 0.1865 | 0.2562 |
| InSiGHT FT | 150 | 404 | 0.0575 | 0.0612 | 1626 | 0.2484 | 0.2644 |
| InSiGHT Abs + FT | 295 | 588 | 0.0837 | 0.0961 | 2657 | 0.2213 | 0.2540 |
COSMIC and InSiGHT results on common articles
| Set | Art set | Match mutation | Recall common | Recall commonNG |
|---|---|---|---|---|
| COSMIC Abs | 822 | 806 | 0.2272 | 0.2740 |
| COSMIC FT | 822 | 1310 | 0.3692 | 0.4247 |
| InSiGHT Abs | 50 | 50 | 0.2475 | 0.3713 |
| InSiGHT FT | 50 | 90 | 0.4455 | 0.4950 |
MeSH headings denoting high-throughput papers
| MeSH heading | MeSH tree code |
|---|---|
| Computational biology | H01.158.273.180 |
| Genetic techniques | E05.393 |
| Genome | G05.360.340 |
| Molecular sequence data | L01.453.245.667 |
| Proteome | D12.776.817 |
| Proteomics | H01.181.122.738 |
COSMIC high-throughput (HT)/non-high throughput (NHT) subset statistics
| Group | PMIDs | Count | Average mutation | SD | Mutation recall |
|---|---|---|---|---|---|
| COSMIC | 7898 | 198 864 | 25.18 | 521.27 | 100.00% |
| COSMIC-HT | 6266 | 187 367 | 29.90 | 584.82 | 94.22% |
| COSMIC-NHT | 1632 | 11 497 | 7.04 | 38.05 | 5.78% |
COSMIC High-Throughput (HT)/Non-High Throughput (NHT) subsets, mutation extraction results
| Set | Cmn art | Match mutation | Recall | Recall NG | Recall common | Recall CmnNG |
|---|---|---|---|---|---|---|
| HT abstract | 1650 | 1357 | 0.0072 | 0.0096 | 0.1209 | 0.1608 |
| HT full text | 1545 | 2719 | 0.0145 | 0.0172 | 0.0270 | 0.0319 |
| HT Abs + FT | 2608 | 3501 | 0.0187 | 0.0231 | 0.0320 | 0.0395 |
| NHT abstract | 550 | 530 | 0.0461 | 0.0543 | 0.3055 | 0.3597 |
| NHT full text | 526 | 937 | 0.0815 | 0.0915 | 0.2350 | 0.2639 |
| NHT Abs + FT | 841 | 1259 | 0.1090 | 0.1243 | 0.2538 | 0.2895 |
Descriptive statistics of mutations in COSMIC, grouped by the number of mutations per curated article
| Group | PMIDs | Count | Average mutation | SD | Mutation recall |
|---|---|---|---|---|---|
| COSMIC | 7898 | 198 864 | 25.18 | 521.27 | 100.00% |
| COSMIC ≤ 10 | 6549 | 20 491 | 3.13 | 2.50 | 10.30% |
| COSMIC ≤ 20 | 7339 | 31 814 | 4.33 | 4.30 | 16.00% |
| COSMIC ≤ 30 | 7589 | 38 015 | 5.01 | 5.61 | 19.12% |
| COSMIC | 309 | 160 849 | 520.55 | 2590.32 | 80.88% |
COSMIC mutation extraction results at several frequency thresholds
| Set | Cmn art | Match mutation | Recall | Recall NG | Recall common | Recall CmnNG |
|---|---|---|---|---|---|---|
| C ≤ 10 Abstract | 2024 | 1700 | 0.0830 | 0.1054 | 0.3460 | 0.4394 |
| C ≤ 10 Full text | 1664 | 2172 | 0.1060 | 0.1230 | 0.4218 | 0.4894 |
| C ≤ 10 Abs + FT | 2941 | 3719 | 0.1551 | 0.1880 | 0.3838 | 0.4652 |
| C ≤ 20 Abstract | 2144 | 1832 | 0.0576 | 0.0743 | 0.2780 | 0.3586 |
| C ≤ 20 Full text | 1885 | 3449 | 0.0916 | 0.1084 | 0.3510 | 0.4153 |
| C ≤ 20 Abs + FT | 3233 | 3988 | 0.1253 | 0.1537 | 0.3205 | 0.3930 |
| C ≤ 30 Abstract | 2171 | 1858 | 0.0489 | 0.0631 | 0.2568 | 0.3317 |
| C ≤ 30 Full text | 1969 | 3299 | 0.0868 | 0.1014 | 0.3179 | 0.3712 |
| C ≤ 30 Abs + FT | 3330 | 4381 | 0.1152 | 0.1397 | 0.2955 | 0.3583 |
| C | 29 | 29 | 0.0002 | 0.0002 | 0.0051 | 0.0051 |
| C | 102 | 357 | 0.0022 | 0.0026 | 0.0038 | 0.0045 |
| C | 119 | 373 | 0.0023 | 0.0027 | 0.0038 | 0.0044 |
Mutation extraction results from several mutation sources for the PMC articles
| COSMIC | InSiGHT | |||
|---|---|---|---|---|
| Set | Matched | Recall | Matched | Recall |
| Abstracts | 140 | 0.0041 | 1 | 0.0040 |
| Full text | 694 | 0.0205 | 20 | 0.0794 |
| PDF (full text) | 23 | 0.0007 | 7 | 0.0278 |
| Tables | 466 | 0.0138 | 18 | 0.0714 |
| Full text + tables | 906 | 0.0268 | 37 | 0.1468 |
| Supplementary Material | 17 015 | 0.5059 | 88 | 0.3492 |
| All | 17 896 | 0.5292 | 115 | 0.4563 |
Results are compared with 252 mutations linked to PMC articles in InSiGHT and 33 814 mutations in COSMIC. PDF refers to a full-text publication only available as PDF.
Figure 1.COSMIC data set recall results of applying EMU to different sources and their aggregation (All) on the PMC set. Matching of the triple PMID, gene and mutation is required to obtain a match.
Figure 2.InSiGHT data set recall results of applying EMU to different sources and their aggregation (All) on the PMC set. Matching of the triple PMID, gene and mutation is required to obtain a match.