Literature DB >> 29762787

LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC.

Alexis Allot¹, Yifan Peng¹, Chih-Hsuan Wei¹, Kyubum Lee¹, Lon Phan¹, Zhiyong Lu¹.

Abstract

The identification and interpretation of genomic variants play a key role in the diagnosis of genetic diseases and related research. These tasks increasingly rely on accessing relevant manually curated information from domain databases (e.g. SwissProt or ClinVar). However, due to the sheer volume of medical literature and high cost of expert curation, curated variant information in existing databases are often incomplete and out-of-date. In addition, the same genetic variant can be mentioned in publications with various names (e.g. 'A146T' versus 'c.436G>A' versus 'rs121913527'). A search in PubMed using only one name usually cannot retrieve all relevant articles for the variant of interest. Hence, to help scientists, healthcare professionals, and database curators find the most up-to-date published variant research, we have developed LitVar for the search and retrieval of standardized variant information. In addition, LitVar uses advanced text mining techniques to compute and extract relationships between variants and other associated entities such as diseases and chemicals/drugs. LitVar is publicly available at https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/LitVar.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2018 PMID： 29762787 PMCID： PMC6030971 DOI： 10.1093/nar/gky355

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Most biomedical knowledge is available as unstructured information in scholarly publications (1). While multiple databases provide structured knowledge on variations (2–5), they heavily rely on manual curation, and thus need advanced variation-oriented text-mining tools to improve the annotation process (6). The task of linking omics data (fields of study in biology ending in -omics such as genomics and proteomics) with scientific literature is further complicated by multiple synonyms and abbreviations used by researchers to refer to one variant, gene, disease or chemical in publications (7). Hence, finding comprehensive and contextualized information about a specific genomic variation becomes an arduous task, as researchers and healthcare professionals rely on curated databases or keyword-based search engines (8,9) that are not suitable for the variety of formats and complexity in which a variation can be cited in literature. Consequently, a variant-oriented semantic search system, which improves the quality (sensitivity and specificity) of search results, is greatly needed. To date, a handful of automatic tools have attempted to address this issue. For instance, command-line automatic variation detection tools such as EMU (10), MutationFinder (11) or Nala (12) can recognize variation mentions in text and return the results in wNm format (e.g. ‘A146T’), while SETH (13) and tmVar (6) can further map the extracted mentions to the specific dbSNP identifiers (e.g. rs121913527). A number of web applications have also been developed to provide an improved search for variants, but they generally accept only specific variant identifiers (14), or are limited to variant information found in abstracts (15,16). GeneView (17) is a recent system which allows semantic search of variants (with or without gene information), but its search results do not include context information and are not always normalized to specific variants (Supplementary Table S1). Here, we present LitVar, a novel tool that combines robust and advanced text mining, with data integrations from PubMed, PubMed Central Open Access Subset (hereafter called ‘PMC-OA’), dbSNP (5), and ClinVar (4) for the accurate search of variants and related information from unstructured human-related biomedical literature. Compared to PubMed, LitVar offers multiple advantages in variant searching. First, LitVar uses tmVar (6,18), a high-performance variant name recognition tool, supporting both abstracts and full-text articles (Supplementary Table S2) to normalize different names of the same variant into a unique and standardized form. This enables all matching articles to be returned regardless of the specific queried variant name (e.g. identical results will be returned for ‘A146T’, ‘c.436G>A’ or rs121913527). Second, LitVar combines variant-related literature from PubMed abstracts (>27 million) and PMC-OA full-text articles (>1.8 million) and provides a unified access to both literature resources. This is particularly important as abstracts have much lower biomedical concept coverage compared to full-text articles (19,20). Third, LitVar employs a state-of-the-art entity recognition toolset (6,21–23) as its backend processing method, such that users can explore related chemical and disease information for variants of interest. In addition, users can filter results by publication type (e.g. ‘Review’ or ‘Letter’), publication year (e.g. ‘Last Year’ or ‘Last 2 years’), specific journals, and different elements of a publication (e.g. abstract or table content). Finally, LitVar allows users to download search results and subscribe to Really Simple Syndication (RSS) feeds of the latest literature updates. In addition to providing a user-friendly and interactive interface for human users, LitVar also supports a set of RESTful APIs for computational analysis and open programmatic access to its standardized and normalized variant data.

SYSTEM DESCRIPTION

Literature data process—entity recognition/normalization and relation extraction

LitVar employs several state-of-the-art text mining and information extraction components in its data processing as shown in Figure 1. First, we processed the entire set of PubMed abstracts and PMC-OA full-text articles in the BioC XML format (24) to extract all variations and their associated entities (i.e. gene, disease, chemical, and species) using a suite of entity taggers, including tmVar for variants, GNormPlus for genes (22), TaggerOne for chemicals and diseases (21) and SR4GN for species (23). Following the lead of GeneView (17), we reported the performance of our taggers on previous benchmarked datasets in the supplementary document (Supplementary Table S3). We then normalized all detected entities to corresponding database identifiers (e.g. MeSH identifiers for chemical and diseases). When possible, we map variants in different forms into dbSNP identifiers (RSIDs). Otherwise, we normalize them into standard HGVS formats. After entity tagging, non-human papers were removed in order to be consistent with dbSNP, and a sentence splitter (25) was applied to segment remaining articles into individual sentences. Finally, we extracted relations between entities based on sentence co-occurrence. Our LitVar data is being updated every month.

Figure 1.

Pre-processing literature data for LitVar. Multiple scripts import publications, detect and normalize biological entities, retrieve relations and continuously update the database.

Query processing

LitVar analyses user queries through a three-step normalization process (Figure 2). First, we use regular expressions to replace amino-acid codes to single-letter codes when applicable. For example, ‘Ala146Thr’ is replaced by ‘A146T.’ Second, we identify the main components of a variant mention (such as the sequence position and mutation type) and rewrite them in HGVS (Human Genome Variation Society) expression. For example, ‘A146T’ is transformed into ‘p.A146T’. Finally, the HGVS expression is used to match LitVar entries with the same name and return results sorted by the number of associated publications. The best match (i.e. the variant with the most publications) is used for the default search results, while other matching variants are also returned to the user for further review.

Figure 2.

LitVar normalizes user queries in real-time.

System implementation details

In LitVar, we aggregate text-mined entities and snippets from PubMed and PMC-OA and store them in a MongoDB database. Our Django web server then processes the requests of both the web application (based on AngularJS, one of the most popular web frameworks) and RESTful API clients. We have chosen both a JSON-like document-oriented database and JSON-oriented front-end to significantly reduce data transformations between storage and visualization of the content. The choice of client-side rendering also allows for better response to user interaction, thus improving the user experience. Currently, LitVar supports most popular web browsers, including the latest versions of Chrome, Safari, Firefox, IE11 and Edge.

RESULTS

As of March 2018, there are 1 968 872 unique variants in LitVar, of which 852 489 are linked to RSIDs while the rest are expressed in standard HGVS forms. Figure 3 shows that there are 309 048 RSID-PMID links in dbSNP, while LitVar can detect 269 253 and 692 953 links by text-mining the entire PubMed and PMC-OA. On average, LitVar returns twice as many publications as in dbSNP, because of its ability to include many synonymous names. For example, in the case of ‘rs121913527’ as a search query, no results are found in PubMed, 10 results are found in dbSNP, while 87 articles are found in LitVar. As can be seen in Figure 3, some RSID-PMIDs exist only in dbSNP as LitVar is limited to the OA subset of PMC and does not currently process supplementary materials.

Figure 3.

The distribution of variants in PMC-OA, PubMed and dbSNP. All numbers are RSID-PMID unique pairs. Data accessed on 5 April 2018.

FUNCTIONALITY AND USAGE

Website

LitVar can be accessed through an easy-to-use graphical web interface, as shown in Figure 4. After a user enters a query in the search bar (Figure 4a), LitVar normalizes the query to find the best matching variant in its database, along with alternative disambiguation results (Figure 4b). Next, LitVar returns a list of publications containing this variant, ordered by publication date. This process has two main features. First, for each result, LitVar returns one or more snippets highlighting the searched variant as well as other entities (e.g. diseases, chemicals, and other variants) which appear the most often in the same sentence as the queried variant (Figure 4d). They are detected in publications during pre-processing and linked to each sentence in our database. This is particularly useful to detect potential relations (e.g. potential implication of a variant in several diseases). In addition to this entity-level filtering, users can also filter publications based on ‘Top Journals’, ‘Publication Year’, ‘Publication Type’ and ‘Part of publication’ matching the query (Figure 4c). Second, in addition to displaying publications associated with the relevant variant (Figure 4e), LitVar also displays a ‘Knowledge Panel’ (Figure 4f) to help users with the most important information about the queried variant, such as its clinical significance (this information is integrated from ClinVar).

Figure 4.

LitVar user interface. Multiple clearly delimited zones allow users to perform search and visualize results. This includes the (a) search, (b) disambiguation, (c) filters, (d) entity facets, (e) list of matching publications, (f) knowledge panel, (g) highlight customization panel and (h) automatic notification by RSS feed and download button.

Programmatic access via RESTful API

In addition to the interactive user interface, LitVar allows users to perform computational analyses through two types of RESTful APIs. The disambiguation API allows programmatic access to the LitVar disambiguation engine, which analyses a free-text query and returns a list of matching variants via VarIDs (a unique variation ID was created for LitVar because some variations could not be linked to an existing RSID after standardization). The second search API allows one to retrieve a list of PMIDs linked to any given variant specified by its VarID.

USE CASES

Below we demonstrate examples of how LitVar may be used under real-world circumstances.

Case 1: citation link from dbSNP

When browsing variant information in dbSNP, researchers can click the citation link, to find and read relevant publications in PubMed. As mentioned earlier, the literature link in dbSNP shows often incomplete results. Hence, a second link to LitVar was added. For example, when searching information about a specific variant on dbSNP website, such as rs1042714, users can click on the LitVar link to review related publications when applicable. The newly added link not only allows users to view more publications (466) than with the link to PubMed (134) but also to display the context in which the variant appears in the publications (Figure 5). Furthermore, we display a small icon if a result in LitVar is the same as in dbSNP.

Figure 5.

LitVar snippets. LitVar displays the queried variation in the context of the sentences in which it appears in the publication. The queried variation has a red background, while other bioconcepts (other variants, genes, diseases, chemicals) are represented by specific colors. To continue their investigation, the researcher can further restrict the search to articles published in the last two years (51 publications) or use the entity facets on the left sidebar to investigate the link between this variant and a disease such as ‘Hypertension’.

Case 2: variant-specific search

A mutation mention is highly ambiguous as it can refer to different variants located on different genes. Conversely, a single variant can be described in many ways in the literature. For example, c.37G>C, p.Gly13Arg, p.G13R and, ‘glycine to arginine substitution at position 13’ located on gene NRAS, all refer to the same RSID: rs121434595. Hence, searching variants in PubMed suffers both problems of precision and recall. A search with the unique identifier (RSID) in dbSNP, addresses this issue, but as shown in Figure 3, many RSID-PMID links are missing in dbSNP. For instance, rare variants in the complement factor H (CFH) gene, are associated with age-related macular degeneration (AMD). We start by searching for ‘CFH R1210C’ on LitVar. The best hit, rs121913059, is a highly penetrant rare variant with 48 results in LitVar, compared to seven results in PubMed (with the same query) or five results in dbSNP (with RSID). Furthermore, in the LitVar search results page, the highlighted snippets allow to easily select an abstract worth further investigation, for example ‘THE PATHOPHYSIOLOGY OF GEOGRAPHIC ATROPHY SECONDARY TO AGE-RELATED MACULAR DEGENERATION AND THE COMPLEMENT PATHWAY AS A THERAPEUTIC TARGET’. This relevant article is only found in LitVar results (i.e. absent in PubMed or dbSNP search results).

CONCLUSIONS

In summary, LitVar improves access to variant-specific information in the biomedical literature. LitVar not only processed the entire set of PubMed abstracts, but also applicable PMC-OA full-text articles. Furthermore, it allows users to examine other related entities, such as diseases and chemicals. LitVar has several known limitations. First, as a variant search system, LitVar currently only support searches by variant or variant with a gene. Second, variants in the LitVar database are currently limited to those found in the title, abstract, and full texts but not including supplementary materials. Third, LitVar endeavours to recognize a wide variety of variant formats, but a query may still yield zero results in LitVar either because we could not map it to a proper record in our database (e.g. g.28612G>A) or the variant has no associated publications in LitVar (e.g. rs115735611). LitVar is also bound to the accuracy of the current text mining algorithms, which are known to be imperfect in both entity recognition and relation extraction. For entity tagging, our tools are mostly trained on abstracts, and their results on full text may therefore be inferior due to its structure and complexity (26). For relation extraction, LitVar currently relies on sentence co-occurrence. While it is a robust method for building real-world biological databases and web-servers such as STRING (27) and GeneView (17), its results may include false positives (e.g. when a sentence states that two entities are not related). Recently, there are a few studies showing the potential of using machine learning for extracting associations between variants and specific diseases such as cancer (28,29), but further investigation is warranted for validating and generalizing such methods across diseases and other entities (e.g. chemicals), as well as for testing their performance with full-text articles. In both cases, a large-scale human-annotated corpus would be required. In the future, we would like to extend the current scope of LitVar by supporting queries containing other types of key entities (such as genes and diseases) and provide keyword-based queries, while continuing to improve LitVar's performance in speed and accuracy. To improve the quality of our relations, we plan to filter out sentences expressing uncertain or negative findings. We also plan to add new filters (a) to display publications that are only found in LitVar (i.e. not existing in other curated databases), as they may be of high interest to some users (e.g. curators), or (b) to show results found in specific sections of an article (e.g. Results versus Discussion section).

DATA AVAILABILITY

LitVar is publicly available at https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/LitVar. Click here for additional data file.

27 in total

1. Distribution of information in biomedical abstracts and full-text publications.

Authors: M J Schuemie; M Weeber; B J A Schijvenaars; E M van Mulligen; C C van der Eijk; R Jelier; B Mons; J A Kors
Journal: Bioinformatics Date: 2004-05-06 Impact factor: 6.937

2. tmVar: a text mining approach for extracting sequence variants in biomedical literature.

Authors: Chih-Hsuan Wei; Bethany R Harris; Hung-Yu Kao; Zhiyong Lu
Journal: Bioinformatics Date: 2013-04-05 Impact factor: 6.937

3. Literome: PubMed-scale genomic knowledge base in the cloud.

Authors: Hoifung Poon; Chris Quirk; Charlie DeZiel; David Heckerman
Journal: Bioinformatics Date: 2014-06-17 Impact factor: 6.937

4. SETH detects and normalizes genetic variants in text.

Authors: Philippe Thomas; Tim Rocktäschel; Jörg Hakenberg; Yvonne Lichtblau; Ulf Leser
Journal: Bioinformatics Date: 2016-06-02 Impact factor: 6.937

5. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.

Authors: David Westergaard; Hans-Henrik Stærfeldt; Christian Tønsberg; Lars Juhl Jensen; Søren Brunak
Journal: PLoS Comput Biol Date: 2018-02-15 Impact factor: 4.475

6. MutationFinder: a high-performance system for extracting point mutation mentions from text.

Authors: J Gregory Caporaso; William A Baumgartner; David A Randolph; K Bretonnel Cohen; Lawrence Hunter
Journal: Bioinformatics Date: 2007-05-11 Impact factor: 6.937

7. The gene normalization task in BioCreative III.

Authors: Zhiyong Lu; Hung-Yu Kao; Chih-Hsuan Wei; Minlie Huang; Jingchen Liu; Cheng-Ju Kuo; Chun-Nan Hsu; Richard Tzong-Han Tsai; Hong-Jie Dai; Naoaki Okazaki; Han-Cheol Cho; Martin Gerner; Illes Solt; Shashank Agarwal; Feifan Liu; Dina Vishnyakova; Patrick Ruch; Martin Romacker; Fabio Rinaldi; Sanmitra Bhattacharya; Padmini Srinivasan; Hongfang Liu; Manabu Torii; Sergio Matos; David Campos; Karin Verspoor; Kevin M Livingston; W John Wilbur
Journal: BMC Bioinformatics Date: 2011-10-03 Impact factor: 3.169

8. STRING v10: protein-protein interaction networks, integrated over the tree of life.

Authors: Damian Szklarczyk; Andrea Franceschini; Stefan Wyder; Kristoffer Forslund; Davide Heller; Jaime Huerta-Cepas; Milan Simonovic; Alexander Roth; Alberto Santos; Kalliopi P Tsafou; Michael Kuhn; Peer Bork; Lars J Jensen; Christian von Mering
Journal: Nucleic Acids Res Date: 2014-10-28 Impact factor: 16.971

9. nala: text mining natural language mutation mentions.

Authors: Juan Miguel Cejuela; Aleksandar Bojchevski; Carsten Uhlig; Rustem Bekmukhametov; Sanjeev Kumar Karn; Shpend Mahmuti; Ashish Baghudana; Ankit Dubey; Venkata P Satagopam; Burkhard Rost
Journal: Bioinformatics Date: 2017-06-15 Impact factor: 6.937

Review 10. Data integration in biological research: an overview.

Authors: Vasileios Lapatas; Michalis Stefanidakis; Rafael C Jimenez; Allegra Via; Maria Victoria Schneider
Journal: J Biol Res (Thessalon) Date: 2015-09-02 Impact factor: 1.889

27 in total

1. PubTator central: automated concept annotation for biomedical full text articles.

Authors: Chih-Hsuan Wei; Alexis Allot; Robert Leaman; Zhiyong Lu
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

2. Controversial Trials First: Identifying Disagreement Between Clinical Guidelines and New Evidence.

Authors: Florian Borchert; Laura Meister; Thomas Langer; Markus Follmann; Bert Arnrich; Matthieu-P Schapranow
Journal: AMIA Annu Symp Proc Date: 2022-02-21

3. Assigning species information to corresponding genes by a sequence labeling framework.

Authors: Ling Luo; Chih-Hsuan Wei; Po-Ting Lai; Qingyu Chen; Rezarta Islamaj; Zhiyong Lu
Journal: Database (Oxford) Date: 2022-10-13 Impact factor: 4.462

4. pubmedKB: an interactive web server for exploring biomedical entity relations in the biomedical literature.

Authors: Peng-Hsuan Li; Ting-Fu Chen; Jheng-Ying Yu; Shang-Hung Shih; Chan-Hung Su; Yin-Hung Lin; Huai-Kuang Tsai; Hsueh-Fen Juan; Chien-Yu Chen; Jia-Hsin Huang
Journal: Nucleic Acids Res Date: 2022-05-10 Impact factor: 19.160

Review 5. Big Data and Atrial Fibrillation: Current Understanding and New Opportunities.

Authors: Qian-Chen Wang; Zhen-Yu Wang
Journal: J Cardiovasc Transl Res Date: 2020-05-06 Impact factor: 4.132

6. Knowledge bases and software support for variant interpretation in precision oncology.

Authors: Florian Borchert; Andreas Mock; Aurelie Tomczak; Jonas Hügel; Samer Alkarkoukly; Alexander Knurr; Anna-Lena Volckmar; Albrecht Stenzinger; Peter Schirmacher; Jürgen Debus; Dirk Jäger; Thomas Longerich; Stefan Fröhling; Roland Eils; Nina Bougatf; Ulrich Sax; Matthieu-P Schapranow
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 11.622

7. Human intelectin-2 (ITLN2) is selectively expressed by secretory Paneth cells.

Authors: Eric B Nonnecke; Patricia A Castillo; Malin E V Johansson; Edward J Hollox; Bo Shen; Bo Lönnerdal; Charles L Bevins
Journal: FASEB J Date: 2022-03 Impact factor: 5.834

Review 8. Collaborative, Multidisciplinary Evaluation of Cancer Variants Through Virtual Molecular Tumor Boards Informs Local Clinical Practices.

Authors: Shruti Rao; Beth Pitel; Alex H Wagner; Simina M Boca; Matthew McCoy; Ian King; Samir Gupta; Ben Ho Park; Jeremy L Warner; James Chen; Peter K Rogan; Debyani Chakravarty; Malachi Griffith; Obi L Griffith; Subha Madhavan
Journal: JCO Clin Cancer Inform Date: 2020-07

9. BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.

Authors: Qingyu Chen; Nagesh C Panyam; Aparna Elangovan; Karin Verspoor
Journal: Database (Oxford) Date: 2018-01-01 Impact factor: 3.451

10. Variomes: a high recall search engine to support the curation of genomic variants.

Authors: Emilie Pasche; Anaïs Mottaz; Déborah Caucheteur; Julien Gobeill; Pierre-André Michel; Patrick Ruch
Journal: Bioinformatics Date: 2022-03-11 Impact factor: 6.931