| Literature DB >> 35904569 |
Chih-Hsuan Wei1, Alexis Allot1, Kevin Riehle2, Aleksandar Milosavljevic2, Zhiyong Lu1.
Abstract
MOTIVATION: Previous studies have shown that automated text-mining tools are becoming increasingly important for successfully unlocking variant information in scientific literature at large scale. Despite multiple attempts in the past, existing tools are still of limited recognition scope and precision. RESULT: We propose tmVar 3.0: an improved variant recognition and normalization system. Compared to its predecessors, tmVar 3.0 recognizes a wider spectrum of variant related entities (e.g., allele and copy number variants), and groups together different variant mentions belonging to the same genomic sequence position in an article for improved accuracy. Moreover, tmVar 3.0 provides advanced variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry. tmVar 3.0 exhibits state-of-the-art performance with over 90% in F-measure for variant recognition and normalization, when evaluated on three independent benchmarking datasets. tmVar 3.0 as well as annotations for the entire PubMed and PMC datasets are freely available for download. AVAILABILITY: https://github.com/ncbi/tmVar3.Entities:
Year: 2022 PMID: 35904569 PMCID: PMC9477515 DOI: 10.1093/bioinformatics/btac537
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
The mutation types extracted by tmVar 3.0 and examples
| Type | Example | tmVar 3.0 | tmVar2.0 | SETH |
|---|---|---|---|---|
| SNP | Rs763780 | ✓ | ✓ | ✓ |
| DNA mutation | c.1976A>T | ✓ | ✓ | ✓ |
| DNA allele | 1976A | ✓ | ||
| DNA change | A>T | ✓ | ✓ | ✓ |
| Protein mutation | p.Gln659Leu | ✓ | ✓ | ✓ |
| Protein allele | glutamine at codon 659 | ✓ | ||
| Protein change | methionine to threonine | ✓ | ✓ | ✓ |
| Other mutations | 306 base pair insertion | ✓ | ||
| Copy number variant | Chr15: 31 833 000–37 477 000 bp deletion | ✓ | ||
| RefSeq | NM_203475.1 | ✓ | ||
| Chromosome | 10q11.12 | ✓ | ||
| Genomic region | Chr10: 46 123 781–51 028 772 | ✓ | ✓ |
tmVar 3.0 performance comparison with tmVar 2.0 and SETH on three public benchmarking datasets: tmVar 3.0, OSIRIS (Bonis ) and Thomas (Thomas ) for variant recognition (NER) and normalization tasks
| Corpus | Task | Method | Precision (%) | Recall (%) | F-score (%) |
|---|---|---|---|---|---|
| tmVar | NER | tmVar 3.0 | 94.01 | 88.86 | 91.36 |
| tmVar 2.0 | 98.22 | 80.64 | 88.57 | ||
| SETH | 97.92 | 68.77 | 80.79 | ||
| Normalization | tmVar 3.0 | 96.99 | 91.71 | 94.28 | |
| tmVar 2.0 | 94.49 | 77.25 | 85.00 | ||
| SETH | 86.51 | 69.91 | 77.33 | ||
| OSIRIS | NER | tmVar 3.0 | 98.62 | 84.98 | 91.30 |
| tmVar 2.0 | 99.53 | 83.00 | 90.52 | ||
| SETH | 96.43 | 74.70 | 84.19 | ||
| Normalization | tmVar 3.0 | 97.72 | 84.58 | 90.68 | |
| tmVar 2.0 | 97.20 | 80.62 | 88.14 | ||
| SETH | 94.21 | 69.38 | 79.91 | ||
| Thomas | NER | tmVar 3.0 | 92.26 | 91.30 | 91.78 |
| tmVar 2.0 | 82.46 | 97.04 | 89.16 | ||
| SETH | 84.43 | 69.39 | 76.18 | ||
| Normalization | tmVar 3.0 | 91.01 | 90.32 | 90.67 | |
| tmVar 2.0 | 89.94 | 88.24 | 89.08 | ||
| SETH | 95.58 | 57.50 | 71.80 |