Literature DB >> 35904569

tmVar 3.0: an improved variant concept recognition and normalization tool.

Chih-Hsuan Wei¹, Alexis Allot¹, Kevin Riehle², Aleksandar Milosavljevic², Zhiyong Lu¹.

Abstract

MOTIVATION: Previous studies have shown that automated text-mining tools are becoming increasingly important for successfully unlocking variant information in scientific literature at large scale. Despite multiple attempts in the past, existing tools are still of limited recognition scope and precision. RESULT: We propose tmVar 3.0: an improved variant recognition and normalization system. Compared to its predecessors, tmVar 3.0 recognizes a wider spectrum of variant related entities (e.g., allele and copy number variants), and groups together different variant mentions belonging to the same genomic sequence position in an article for improved accuracy. Moreover, tmVar 3.0 provides advanced variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry. tmVar 3.0 exhibits state-of-the-art performance with over 90% in F-measure for variant recognition and normalization, when evaluated on three independent benchmarking datasets. tmVar 3.0 as well as annotations for the entire PubMed and PMC datasets are freely available for download. AVAILABILITY: https://github.com/ncbi/tmVar3.

Entities: Chemical

Year: 2022 PMID： 35904569 PMCID： PMC9477515 DOI： 10.1093/bioinformatics/btac537

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

Introduction

Genomic variants are an essential part of precision medicine which aims to provide personalized treatments based on an individual’s genetic profile. To better understand the mechanism of genetic diseases, (semi-)automatically collecting and assimilating published knowledge about sequence variants in scientific literature becomes an increasingly important task. A recent study (Lee ) reviewed a number of existing software tools previously developed for such a task (Caporaso ; Cejuela ; Cheng ; Thomas ; Wei ). Most of these tools use regular expressions based on the Human Genome Variation Society (HGVS) nomenclature and frequent variant forms in text. We previously developed tmVar (Wei ), which uses a machine learning-based approach to optimally recognize variant components (wild type, mutant and position). More recently, a new function was added to tmVar so it performs variant normalization by linking recognized variant mentions to standard concept identifiers (Wei ). Specifically, tmVar 2.0 normalizes variant mentions to dbSNP (https://www.ncbi.nlm.nih.gov/snp/) RS identifiers (RS IDs). Relying on the variant linking results of tmVar, several downstream text mining applications were successfully developed (Allot ; Nie ). Despite these efforts, existing variant extraction tools are still limited in (i) recognizing variant types of a broader scope such as incomplete variants (e.g. V600) and related concepts (e.g. genomic regions). Such concepts are found to play important roles in connecting variants with disease and drug information in the same article, (ii) linking mentions to specific alleles. Note that dbSNP RS IDs (e.g. rs113488022) only record the polymorphism at a specific position but do not differentiate specific (e.g. T>A versus T>C) registered in the ClinGen Allele Registry (CAR) (Pawliczek ). We herein propose a new comprehensive variant extraction system that specifically addresses all these challenges. The improved tmVar 3.0 system achieves consistent high precision and recall on several publicly available gold standard corpora and is freely available for the scientific community.

Methods

We first expanded the recognition scope of tmVar to cover more difficult cases that were rarely addressed by existing tools, such as incomplete variant mentions (e.g. Cys326; guanine to cytosine), copy number variants (e.g. chr19:54 666 173–54 677 766 bp del), reference sequence (RefSeq) (e.g. NM_203475.1), chromosomal locations (e.g. chromosome 5 q 33) and genomic regions (e.g. chr7:156 583 796–156 584 569) as shown in Table 1.

Table 1.

The mutation types extracted by tmVar 3.0 and examples

Type	Example	tmVar 3.0	tmVar2.0	SETH
SNP	Rs763780	✓	✓	✓
DNA mutation	c.1976A>T	✓	✓	✓
DNA allele	1976A	✓
DNA change	A>T	✓	✓	✓
Protein mutation	p.Gln659Leu	✓	✓	✓
Protein allele	glutamine at codon 659	✓
Protein change	methionine to threonine	✓	✓	✓
Other mutations	306 base pair insertion	✓
Copy number variant	Chr15: 31 833 000–37 477 000 bp deletion	✓
RefSeq	NM_203475.1	✓
Chromosome	10q11.12	✓
Genomic region	Chr10: 46 123 781–51 028 772	✓		✓

The mutation types extracted by tmVar 3.0 and examples To better support variant-related text mining research (e.g. mining variant-disease associations), tmVar 3.0 groups variants from the same genomic sequence position even in different form/types (e.g. DNA and protein variants/alleles). For instance, in PMID: 20 577 006, we group the variants (i.e. P799L and P799) belonging to rs121912637. In this article, P799 co-occurs with disease metatropic dysplasia in the same sentence, but not P799L. In this case, grouping the two variant mentions makes it easier to link P799L to the correct disease. Third, tmVar 3.0 provides alternative options to normalize a particular variant. In addition to providing RS IDs that record all the possible allele changes on a specific genomic position, we offer three allele-specific options for improved precision in the normalization results: (i) CAR Canonical Allele Identifier (CA ID) (e.g. CA16602736) and (ii) the combination of an RS ID and the specific allele [e.g. rs113488022(T>A)]. CA ID is a granular identifier and can specify the specific allele of the genomic sequence position. To map the variants to CA IDs, we expanded the mapping table of the variant normalization. In addition to the existing records (i.e. variant, corresponding gene and RS ID), we appended the CA IDs to the table. With the variant in the raw text and the corresponding gene recognized by our gene tagger [e.g. GNormPlus (Wei )], the RS ID and CA ID can be searched directly using the mapping table. Furthermore, we observed that more than half of the variant mentions cannot be linked to an existing record in dbSNP or CAR databases. In such cases, tmVar 3.0 finds the corresponding gene of the variant in the text and normalizes it with the variant as the third option (e.g. BRAF: c.1799T>A). The percentages of the normalized variants in the entire PubMed/PMC are 25.43% to CA IDs, 50.23% to RS IDs and 33.80% to corresponding genes. Not all the variants can be normalized or mapped to a corresponding gene, since gene information is lacking in some articles. Finally, in tmVar 3.0, we improved our recognition algorithm on some previously difficult edge cases such as variants described in natural language (e.g. ‘nine-nucleotide deletion starting at position 1952’), or with a missing space between the gene and variant (e.g. ‘BRAFV600E’).

Results

The newly improved tmVar 3.0 system is evaluated on three separate benchmarking datasets [i.e. OSIRIS (Bonis ), Thomas (Thomas ) and our revised tmVar corpus (Wei )]. In the new tmVar corpus, we annotated all of the relevant variants (e.g. alleles) and mapped every variant to either the RS ID or the corresponding gene. The evaluation results of tmVar 3.0 on variant recognition and normalization are shown in Table 2 and compared with the previous tmVar version (2.0) and SETH (Thomas ), a previous state-of-the-art method producing normalized dbSNP RS IDs. As can be seen, tmVar 3.0 achieves consistently higher accuracy (over 90% in F-measure) than SETH and tmVar 2.0 on the three public corpora. To facilitate the use of tmVar results at PubMed scale, we have processed the entire PubMed/PMC open access and incorporated the results in the NCBI web server PubTator (Wei ). The annotations are also freely available via FTP.

Table 2.

tmVar 3.0 performance comparison with tmVar 2.0 and SETH on three public benchmarking datasets: tmVar 3.0, OSIRIS (Bonis ) and Thomas (Thomas ) for variant recognition (NER) and normalization tasks

Corpus	Task	Method	Precision (%)	Recall (%)	F-score (%)
tmVar	NER	tmVar 3.0	94.01	88.86	91.36
		tmVar 2.0	98.22	80.64	88.57
		SETH	97.92	68.77	80.79
	Normalization	tmVar 3.0	96.99	91.71	94.28
		tmVar 2.0	94.49	77.25	85.00
		SETH	86.51	69.91	77.33
OSIRIS	NER	tmVar 3.0	98.62	84.98	91.30
		tmVar 2.0	99.53	83.00	90.52
		SETH	96.43	74.70	84.19
	Normalization	tmVar 3.0	97.72	84.58	90.68
		tmVar 2.0	97.20	80.62	88.14
		SETH	94.21	69.38	79.91
Thomas	NER	tmVar 3.0	92.26	91.30	91.78
		tmVar 2.0	82.46	97.04	89.16
		SETH	84.43	69.39	76.18
	Normalization	tmVar 3.0	91.01	90.32	90.67
		tmVar 2.0	89.94	88.24	89.08
		SETH	95.58	57.50	71.80

tmVar 3.0 performance comparison with tmVar 2.0 and SETH on three public benchmarking datasets: tmVar 3.0, OSIRIS (Bonis ) and Thomas (Thomas ) for variant recognition (NER) and normalization tasks

Conclusion

We introduce tmVar 3.0, an improved open-source software tool with a broader scope and better accuracy for variant concept recognition and normalization, compared to its predecessors. tmVar 3.0 can recognize most of the variants even when the variants are described with partial information (e.g. amino acid change without the sequence position) or with natural language. tmVar 3.0 groups different mentions of the same variant together based on the context for improved normalization performance. As a result, tmVar 3.0 achieves superior variant recognition and normalization. In the future, we would like to further enhance and expand tmVar by better linking variants with other closely related concepts such as drugs and diseases.

Funding

This work was supported by the National Institutes of Health Intramural Research Program, National Library of Medicine and in part by the NIH NHGRI Clinical Genome Resource (ClinGen) grant U24 HG009649. Conflict of Interest: none declared.

12 in total

1. Beyond accuracy: creating interoperable and scalable text-mining web services.

Authors: Chih-Hsuan Wei; Robert Leaman; Zhiyong Lu
Journal: Bioinformatics Date: 2016-02-16 Impact factor: 6.937

2. OSIRIS: a tool for retrieving literature about sequence variants.

Authors: Julio Bonis; Laura Inés Furlong; Ferran Sanz
Journal: Bioinformatics Date: 2006-07-31 Impact factor: 6.937

3. tmVar: a text mining approach for extracting sequence variants in biomedical literature.

Authors: Chih-Hsuan Wei; Bethany R Harris; Hung-Yu Kao; Zhiyong Lu
Journal: Bioinformatics Date: 2013-04-05 Impact factor: 6.937

4. SETH detects and normalizes genetic variants in text.

Authors: Philippe Thomas; Tim Rocktäschel; Jörg Hakenberg; Yvonne Lichtblau; Ulf Leser
Journal: Bioinformatics Date: 2016-06-02 Impact factor: 6.937

5. tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine.

Authors: Chih-Hsuan Wei; Lon Phan; Juliana Feltz; Rama Maiti; Tim Hefferon; Zhiyong Lu
Journal: Bioinformatics Date: 2018-01-01 Impact factor: 6.937

6. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature.

Authors: Kyubum Lee; Chih-Hsuan Wei; Zhiyong Lu
Journal: Brief Bioinform Date: 2021-05-20 Impact factor: 11.622

7. GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains.

Authors: Chih-Hsuan Wei; Hung-Yu Kao; Zhiyong Lu
Journal: Biomed Res Int Date: 2015-08-25 Impact factor: 3.411

8. nala: text mining natural language mutation mentions.

Authors: Juan Miguel Cejuela; Aleksandar Bojchevski; Carsten Uhlig; Rustem Bekmukhametov; Sanjeev Kumar Karn; Shpend Mahmuti; Ashish Baghudana; Ankit Dubey; Venkata P Satagopam; Burkhard Rost
Journal: Bioinformatics Date: 2017-06-15 Impact factor: 6.937

9. ClinGen Allele Registry links information about genetic variants.

Authors: Piotr Pawliczek; Ronak Y Patel; Lillian R Ashmore; Andrew R Jackson; Chris Bizon; Tristan Nelson; Bradford Powell; Robert R Freimuth; Natasha Strande; Neethu Shah; Sameer Paithankar; Matt W Wright; Selina Dwight; Jimmy Zhen; Melissa Landrum; Peter McGarvey; Larry Babb; Sharon E Plon; Aleksandar Milosavljevic
Journal: Hum Mutat Date: 2018-11 Impact factor: 4.878

10. LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC.

Authors: Alexis Allot; Yifan Peng; Chih-Hsuan Wei; Kyubum Lee; Lon Phan; Zhiyong Lu
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971