Literature DB >> 31192369

VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations.

James D Stephenson^1,2, Roman A Laskowski¹, Andrew Nightingale¹, Matthew E Hurles², Janet M Thornton¹.

Abstract

MOTIVATION: Understanding the protein structural context and patterning on proteins of genomic variants can help to separate benign from pathogenic variants and reveal molecular consequences. However, mapping genomic coordinates to protein structures is non-trivial, complicated by alternative splicing and transcript evidence.
RESULTS: Here we present VarMap, a web tool for mapping a list of chromosome coordinates to canonical UniProt sequences and associated protein 3D structures, including validation checks, and annotating them with structural information.
AVAILABILITY AND IMPLEMENTATION: https://www.ebi.ac.uk/thornton-srv/databases/VarMap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Proteins

Year: 2019 PMID： 31192369 PMCID： PMC6853667 DOI： 10.1093/bioinformatics/btz482

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The consequence of variants affecting protein sequence depends on the structural context and chemical environment. Understanding these elements has the potential of both uncovering the biochemical consequences of the change, and of identifying ‘hot spots’ where several variants from different individuals occur within close spatial proximity in the same protein. However, to benefit from the added information 3D protein structures can provide, an accurate mapping between genomic coordinates and the corresponding protein sequence, and structure, is required. Inaccurate mapping may lead to misleading variant interpretation. Alternative splicing makes mapping genomic coordinates to protein sequence non-trivial. As Figure 1A shows, a single coding region can be alternatively spliced into several different transcripts; which of these is expressed may depend on tissue type or developmental stage. Each transcript can result in a different isoform of the same protein. Choosing the relevant transcript is thus a complex matter. In most cases, one of the transcripts is identified as the ‘RefSeq Select transcript’, chosen according to criteria described by NCBI (O'Leary ), and has a corresponding protein sequence. Proteins in UniProt also have a reference, or ‘canonical’, sequence (UniProt, 2019). However, as the translated select RefSeq and canonical UniProt sequences are independently derived, they often differ [in 18% of cases in the ClinVar database (Landrum ) (Fig. 1C)]—resulting in different numbering of the residues.

Fig. 1.

Mapping from genomic coordinates to protein sequence and structure. (A) Example missense variant observed on chromosome 12, position 123456, DNA change G/C. Three different transcripts are possible via alternative splicing. Transcript 1 is the longest and is designated as the RefSeq select reference transcript. Three protein isoforms can be created by translating the transcripts. Isoform 3 is designated as the canonical protein isoform in UniProt. The original DNA variant can be mapped onto isoforms 1 and 3, but not to isoform 2 as exon 3 has been spliced out. Isoforms 1 and 2 do not have a corresponding protein 3D structure, whereas isoform 3 does. VarMap maps from the isoform position to the position in the representative structure. (B) Simplified schema for mapping from variant genomic coordinates to protein sequence and structure using VarMap. A more detailed version is available in the Supplementary Materials and on the VarMap website. (C) Shows the percentages of ClinVar variants belonging to a gene whose translated Select RefSeq transcript is identical to the UniProt canonical isoform sequence (black) and those which do not (grey). ClinVar file used: clinvar_20190211.vcf. (D) The percentage of genomic coordinates in ClinVar which are SNPs. (E) A breakdown of the SNP variant types. (F) The percentage of coding SNPs which can be mapped directly to the exact human structure and those which can be mapped to homologous structures. (G) Of the variants which can be mapped to structure, the number which have direct contacts with DNA, metals, ligands and protein as derived from every closely related protein for each variant. The VarMap output from the ClinVar dataset used here is available on the VarMap website. A description of the methods used to generate these plots is available in the Supplementary Material

2 Materials and methods

The user uploads a tab-separated file of genomic coordinates, identifiers (optionally), reference and variant alleles. For files of fewer than 20 coordinates, VarMap runs in real time. For larger files, it runs in batch mode on a processor farm, a link to the results being e-mailed to the user. VarMap performs a number of checks on the input data, including a GRCh37/CGCh38 assembly check via the Ensembl REST API (Fig. 1B). Locally installed VEP is called for each coordinate which returns a list of transcripts which are then paired with associated isoforms. Also returned for each transcript are ENST, ENSG, HGVS identifiers (den Dunnen ), amino acid change, protein position, PolyPhen/SIFT score and VEP consequence. The transcript RefSeqs are retrieved from Ensembl BioMart. The UniProt canonical isoform is identified from the SWISS-PROT database. The amino acid identity at the position returned by VEP for the canonical isoform is checked against the corresponding position in the SWISS-PROT sequence. The RefSeq Select accession for each gene is retrieved from HGNC. The allele frequency of each variant in the natural population is retrieved from gnomAD. The amino acid conservation is calculated using the ScoreCons algorithm, while known disease associations for the amino acid position are retrieved from UniProt and ClinVar. CATH and Pfam domain memberships are also returned. The UniProt canonical isoform sequence is searched against all PDBe sequences using FASTA. The alignments provide the mapping of the variant amino acid to its equivalent position in each 3D structure. The PDB accession code, chain, position and amino acid identity of the closest structure (according to alignment E-value) are provided, together with its resolution and sequence alignment quality. From this, and all other structure matches, information is taken about the variant residue’s context: whether it is a catalytic residue, or involved in a disulphide bond, or makes contact with DNA, protein, ligands or metals from PDBsum. This information is provided in the downloadable tab-separated file only. Output to screen includes the transcripts relating to the UniProt canonical isoform, protein position, colour-coded CADD score (Rentzsch ) and PDB structure. When a position cannot be mapped to the canonical isoform, clicking ‘more info’ displays a table of all transcripts with further information. All additional annotations are included in the downloadable file. A more detailed description of these methods can be found in the Supplementary Material and on the VarMap website: https://www.ebi.ac.uk/thornton-srv/databases/VarMap.

3 VarMap web tool

We present here the web tool ‘VarMap’ that automates the mapping of a list of single nucleotide polymorphisms (SNPs) to their corresponding UniProt canonical isoform sequence positions [via VEP (McLaren ) and SWISS-PROT (Boutet )] and their position in the closest 3D structure in PDBe (wwPDB consortium, 2019). In addition to a screen output VarMap provides a downloadable tab-separated file containing additional annotations at the DNA sequence, protein sequence and protein structure levels extracted from various resources to help explain the role and interchangeability of each variant. When a position cannot be mapped to the canonical isoform, alternative information is provided for other transcripts. Figure 1D–G shows how VarMap annotations can be used to analyze large datasets using ClinVar as an example. Figure 1D shows the proportion of variants that are SNPs, and of these the proportions that are coding. Figure 1E shows the variant types and Figure 1F shows that using homologous structures increases the proportion of variants that can be mapped to structure from 18 to 58%. Figure 1G demonstrates the wealth of information that can be extracted by considering all closely related structures. Tools that map only onto a single structure—and, furthermore, those that only perform the mapping if the protein structure is human—may lose this interaction data.

4 Discussion

In principle, the information provided by VarMap could be obtained manually using the following existing tools and databases: Ensembl (Cunningham ), VEP (McLaren ), UniProt (UniProt, 2019), SWISS-PROT (Boutet ), BioMart (Kinsella ), HGNC (Braschi ), CATH (Dawson ), Pfam (El-Gebali ), M-CSA (Ribeiro ), FASTA (Pearson, 2014), PDBsum (Laskowski ), ScoreCons (Valdar, 2002), gnomAD (Lek ) and ClinVar (Landrum ). However, this process would be prohibitively time-consuming for large datasets. Tools exist that are similar to parts of VarMap, such as VAI (Hinrichs ), varQ (Radusky ), G23D (Solomon ), StructMAn (Gress ), mutfunc (Wagih ) and Decipher (Firth ), but they do not address transcript and isoform mapping to the same degree, or provide the same breadth of structural annotations. VarMap has several additional features compared to existing tools which makes it especially useful for the analysis of large datasets. Firstly, the batch upload facility allows thousands of variants to be annotated concurrently. The preservation of input ID means that input lines can be directly cross-referenced with output lines. Secondly, VarMap partially annotates all transcript-isoform pairs, which may be important if variants are in non-reference transcripts or non-canonical isoforms. It also highlights whether the UniProt canonical isoform relates to the RefSeq select transcript. Thirdly, VarMap returns information on three aspects of each variant: At the DNA level, consequences and pathogenicity scores are returned as well as the allele frequency of natural/disease associated variants at that position. At the protein sequence level, conservation is calculated, membership of Pfam and CATH, and whether the residue represents a known catalytic site. Known disease associations with the affected amino acids are reported. At the protein structure level, the representative structure with position and resolution is returned as are intermolecular interactions from homologous structures between the variant amino acid and ligands, proteins, nucleic acids and metals.

5 Conclusion

VarMap provides a wide range of annotations for single variants or any size genomic coordinate variant datasets. It is envisaged that it will be useful for clinical geneticists with patient variant data and researchers who wish to consider the environmental context and protein spatial distribution of genetic variants on structures. The data-rich, tab-separated output file facilitates intuitive sorting and filtering using simple parsing commands or spreadsheets, which require no expert knowledge of bioinformatics or structural biology.

Funding

This work was supported by an EMBL-EBI/Sanger postdoctoral (ESPOD) fellowship (to J.D.S.). Conflict of Interest: none declared. Click here for additional data file.

24 in total

1. HGVS Recommendations for the Description of Sequence Variants: 2016 Update.

Authors: Johan T den Dunnen; Raymond Dalgleish; Donna R Maglott; Reece K Hart; Marc S Greenblatt; Jean McGowan-Jordan; Anne-Francoise Roux; Timothy Smith; Stylianos E Antonarakis; Peter E M Taschner
Journal: Hum Mutat Date: 2016-03-25 Impact factor: 4.878

2. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources.

Authors: Helen V Firth; Shola M Richards; A Paul Bevan; Stephen Clayton; Manuel Corpas; Diana Rajan; Steven Van Vooren; Yves Moreau; Roger M Pettett; Nigel P Carter
Journal: Am J Hum Genet Date: 2009-04-02 Impact factor: 11.025

3. Analysis of protein-coding genetic variation in 60,706 humans.

Authors: Monkol Lek; Konrad J Karczewski; Eric V Minikel; Kaitlin E Samocha; Eric Banks; Timothy Fennell; Anne H O'Donnell-Luria; James S Ware; Andrew J Hill; Beryl B Cummings; Taru Tukiainen; Daniel P Birnbaum; Jack A Kosmicki; Laramie E Duncan; Karol Estrada; Fengmei Zhao; James Zou; Emma Pierce-Hoffman; Joanne Berghout; David N Cooper; Nicole Deflaux; Mark DePristo; Ron Do; Jason Flannick; Menachem Fromer; Laura Gauthier; Jackie Goldstein; Namrata Gupta; Daniel Howrigan; Adam Kiezun; Mitja I Kurki; Ami Levy Moonshine; Pradeep Natarajan; Lorena Orozco; Gina M Peloso; Ryan Poplin; Manuel A Rivas; Valentin Ruano-Rubio; Samuel A Rose; Douglas M Ruderfer; Khalid Shakir; Peter D Stenson; Christine Stevens; Brett P Thomas; Grace Tiao; Maria T Tusie-Luna; Ben Weisburd; Hong-Hee Won; Dongmei Yu; David M Altshuler; Diego Ardissino; Michael Boehnke; John Danesh; Stacey Donnelly; Roberto Elosua; Jose C Florez; Stacey B Gabriel; Gad Getz; Stephen J Glatt; Christina M Hultman; Sekar Kathiresan; Markku Laakso; Steven McCarroll; Mark I McCarthy; Dermot McGovern; Ruth McPherson; Benjamin M Neale; Aarno Palotie; Shaun M Purcell; Danish Saleheen; Jeremiah M Scharf; Pamela Sklar; Patrick F Sullivan; Jaakko Tuomilehto; Ming T Tsuang; Hugh C Watkins; James G Wilson; Mark J Daly; Daniel G MacArthur
Journal: Nature Date: 2016-08-18 Impact factor: 49.962

4. Ensembl 2019.

Authors: Fiona Cunningham; Premanand Achuthan; Wasiu Akanni; James Allen; M Ridwan Amode; Irina M Armean; Ruth Bennett; Jyothish Bhai; Konstantinos Billis; Sanjay Boddu; Carla Cummins; Claire Davidson; Kamalkumar Jayantilal Dodiya; Astrid Gall; Carlos García Girón; Laurent Gil; Tiago Grego; Leanne Haggerty; Erin Haskell; Thibaut Hourlier; Osagie G Izuogu; Sophie H Janacek; Thomas Juettemann; Mike Kay; Matthew R Laird; Ilias Lavidas; Zhicheng Liu; Jane E Loveland; José C Marugán; Thomas Maurel; Aoife C McMahon; Benjamin Moore; Joannella Morales; Jonathan M Mudge; Michael Nuhn; Denye Ogeh; Anne Parker; Andrew Parton; Mateus Patricio; Ahamed Imran Abdul Salam; Bianca M Schmitt; Helen Schuilenburg; Dan Sheppard; Helen Sparrow; Eloise Stapleton; Marek Szuba; Kieron Taylor; Glen Threadgold; Anja Thormann; Alessandro Vullo; Brandon Walts; Andrea Winterbottom; Amonida Zadissa; Marc Chakiachvili; Adam Frankish; Sarah E Hunt; Myrto Kostadima; Nick Langridge; Fergal J Martin; Matthieu Muffato; Emily Perry; Magali Ruffier; Daniel M Staines; Stephen J Trevanion; Bronwen L Aken; Andrew D Yates; Daniel R Zerbino; Paul Flicek
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

5. The Ensembl Variant Effect Predictor.

Authors: William McLaren; Laurent Gil; Sarah E Hunt; Harpreet Singh Riat; Graham R S Ritchie; Anja Thormann; Paul Flicek; Fiona Cunningham
Journal: Genome Biol Date: 2016-06-06 Impact factor: 13.583

6. UCSC Data Integrator and Variant Annotation Integrator.

Authors: Angie S Hinrichs; Brian J Raney; Matthew L Speir; Brooke Rhead; Jonathan Casper; Donna Karolchik; Robert M Kuhn; Kate R Rosenbloom; Ann S Zweig; David Haussler; W James Kent
Journal: Bioinformatics Date: 2016-01-06 Impact factor: 6.937

7. G23D: Online tool for mapping and visualization of genomic variants on 3D protein structures.

Authors: Oz Solomon; Vered Kunik; Amos Simon; Nitzan Kol; Ortal Barel; Atar Lev; Ninette Amariglio; Raz Somech; Gidi Rechavi; Eran Eyal
Journal: BMC Genomics Date: 2016-08-26 Impact factor: 3.969

8. The Pfam protein families database in 2019.

Authors: Sara El-Gebali; Jaina Mistry; Alex Bateman; Sean R Eddy; Aurélien Luciani; Simon C Potter; Matloob Qureshi; Lorna J Richardson; Gustavo A Salazar; Alfredo Smart; Erik L L Sonnhammer; Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto; Robert D Finn
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

9. Protein Data Bank: the single global archive for 3D macromolecular structure data.

Authors:
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

10. Genenames.org: the HGNC and VGNC resources in 2019.

Authors: Bryony Braschi; Paul Denny; Kristian Gray; Tamsin Jones; Ruth Seal; Susan Tweedie; Bethan Yates; Elspeth Bruford
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

14 in total

1. The Clinical Genome and Ancestry Report: An interactive web application for prioritizing clinically implicated variants from genome sequencing data with ancestry composition.

Authors: In-Hee Lee; Jose A Negron; Carles Hernandez-Ferrer; William Jefferson Alvarez; Kenneth D Mandl; Sek Won Kong
Journal: Hum Mutat Date: 2019-11-15 Impact factor: 4.878

2. Functional validation of novel variants in B4GALNT1 associated with early-onset complex hereditary spastic paraplegia with impaired ganglioside synthesis.

Authors: Julian Emanuel Alecu; Yuhsuke Ohmi; Robiul H Bhuiyan; Kei-Ichiro Inamori; Takahiro Nitta; Afshin Saffari; Hellen Jumo; Marvin Ziegler; Claudio Melo de Gusmao; Nutan Sharma; Shiho Ohno; Noriyoshi Manabe; Yoshiki Yamaguchi; Mariko Kambe; Keiko Furukawa; Mustafa Sahin; Jin-Ichi Inokuchi; Koichi Furakawa; Darius Ebrahimi-Fakhari
Journal: Am J Med Genet A Date: 2022-07-01 Impact factor: 2.578

3. Common genetic associations between age-related diseases.

Authors: Handan Melike Dönertaş; Daniel K Fabian; Matías Fuentealba Valenzuela; Linda Partridge; Janet M Thornton
Journal: Nat Aging Date: 2021-04-08

4. ADDRESS: A Database of Disease-associated Human Variants Incorporating Protein Structure and Folding Stabilities.

Authors: Jaie Woodard; Chengxin Zhang; Yang Zhang
Journal: J Mol Biol Date: 2021-02-02 Impact factor: 6.151

5. Structural analysis of pathogenic missense mutations in GABRA2 and identification of a novel de novo variant in the desensitization gate.

Authors: Alba Sanchis-Juan; Marcia A Hasenahuer; James A Baker; Amy McTague; Katy Barwick; Manju A Kurian; Sofia T Duarte; Keren J Carss; Janet Thornton; F Lucy Raymond
Journal: Mol Genet Genomic Med Date: 2020-04-29 Impact factor: 2.183

6. MISCAST: MIssense variant to protein StruCture Analysis web SuiTe.

Authors: Sumaiya Iqbal; David Hoksza; Eduardo Pérez-Palma; Patrick May; Jakob B Jespersen; Shehab S Ahmed; Zaara T Rifat; Henrike O Heyne; M Sohel Rahman; Jeffrey R Cottrell; Florence F Wagner; Mark J Daly; Arthur J Campbell; Dennis Lal
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971

7. VarSite: Disease variants and protein structure.

Authors: Roman A Laskowski; James D Stephenson; Ian Sillitoe; Christine A Orengo; Janet M Thornton
Journal: Protein Sci Date: 2019-10-27 Impact factor: 6.725

8. From chemoproteomic-detected amino acids to genomic coordinates: insights into precise multi-omic data integration.

Authors: Maria F Palafox; Heta S Desai; Valerie A Arboleda; Keriann M Backus
Journal: Mol Syst Biol Date: 2021-02 Impact factor: 11.429

9. A Novel Osteochondrodysplasia With Empty Sella Associates With a TBX2 Variant.

Authors: Riikka E Mäkitie; Sanna Toiviainen-Salo; Ilkka Kaitila; Outi Mäkitie
Journal: Front Endocrinol (Lausanne) Date: 2022-03-03 Impact factor: 5.555

10. Identification of Novel Genomic-Variant Patterns of OR56A5, OR52L1, and CTSD in Retinitis Pigmentosa Patients by Whole-Exome Sequencing.

Authors: Ting-Yi Lin; Yun-Chia Chang; Yu-Jer Hsiao; Yueh Chien; Ying-Chun Jheng; Jing-Rong Wu; Lo-Jei Ching; De-Kuang Hwang; Chih-Chien Hsu; Tai-Chi Lin; Yu-Bai Chou; Yi-Ming Huang; Shih-Jen Chen; Yi-Ping Yang; Ping-Hsing Tsai
Journal: Int J Mol Sci Date: 2021-05-25 Impact factor: 5.923