Literature DB >> 31116477

MetaDome: Pathogenicity analysis of genetic variants through aggregation of homologous human protein domains.

Laurens Wiel^1,2, Coos Baakman², Daan Gilissen^1,3, Joris A Veltman^4,5, Gerrit Vriend², Christian Gilissen¹.

Abstract

The growing availability of human genetic variation has given rise to novel methods of measuring genetic tolerance that better interpret variants of unknown significance. We recently developed a concept based on protein domain homology in the human genome to improve variant interpretation. For this purpose, we mapped population variation from the Exome Aggregation Consortium (ExAC) and pathogenic mutations from the Human Gene Mutation Database (HGMD) onto Pfam protein domains. The aggregation of these variation data across homologous domains into meta-domains allowed us to generate amino acid resolution of genetic intolerance profiles for human protein domains. Here, we developed MetaDome, a fast and easy-to-use web server that visualizes meta-domain information and gene-wide profiles of genetic tolerance. We updated the underlying data of MetaDome to contain information from 56,319 human transcripts, 71,419 protein domains, 12,164,292 genetic variants from gnomAD, and 34,076 pathogenic mutations from ClinVar. MetaDome allows researchers to easily investigate their variants of interest for the presence or absence of variation at corresponding positions within homologous domains. We illustrate the added value of MetaDome by an example that highlights how it may help in the interpretation of variants of unknown significance. The MetaDome web server is freely accessible at https://stuart.radboudumc.nl/metadome.

Entities: Chemical Disease Gene Mutation Species

Keywords: ClinVar; Pfam; genetic tolerance; genetic variation; gnomAD; meta-domains; pathogenicity; protein domain homology; web server

Mesh：

Substances：
Proteins

Year: 2019 PMID： 31116477 PMCID： PMC6772141 DOI： 10.1002/humu.23798

Source DB: PubMed Journal: Hum Mutat ISSN： 1059-7794 Impact factor: 4.878

INTRODUCTION

The continuous accumulation of human genomic data has spurred the development of new methods to interpret genetic variants. There are many freely available web servers and services that facilitate the use of these data by non‐bioinformaticians. For example, the ESP Exome Variant Server (Fu et al., 2012; NHLBI GO Exome Sequencing Project (ESP), 2011) and the Genome Aggregation Database (gnomAD) browser (Karczewski et al., 2017; Lek et al., 2016) help locate variants that occur frequently in the general population. These services are used for the interpretation of unknown variants based on the assumption that variants occurring frequently in the general population are unlikely to be relevant for patients with Mendelian disorders (Amr et al., 2016). There are also methods that derive information from these large human genetic databases. For example, genetic intolerance, which is commonly used to interpret variants of unknown significance by assessing whether variants stand out because they occur in regions that are genetically invariable in the general population (Ge et al., 2016; Gussow, Petrovski, Wang, Allen, & Goldstein, 2016). Examples of such methods are RVIS (Petrovski, Wang, Heinzen, Allen, & Goldstein, 2013) and subRVIS (Gussow et al., 2016). The strongest evidence for the pathogenicity of a genomic variant comes from the presence of that variant in any of the clinically relevant genetic variant databases such as the Human Gene Mutation Database (HGMD; Stenson et al., 2017) or the public archive of clinically relevant variants (ClinVar; Landrum et al., 2016). These databases are gradually growing in the amount of validated pathogenic information. Another way to provide evidence for the pathogenicity of a genomic variant is to observe the effect of that variant in homologous proteins across different species. Mutations at corresponding locations in homologous proteins are found to result in similar effects on protein stability (Ashenberg, Gong, & Bloom, 2013) and can facilitate variant interpretation between disease genes and their paralogues (Lal et al., 2017). Finding homologous proteins is one the key applications of BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990). Transferring information between homologous proteins is one of the oldest concepts in bioinformatics, and can be achieved by performing a multiple sequence alignment (MSA) and locating equivalent positions between the protein sequences. We have previously used this concept and showed that it also holds for homologous Pfam protein domain relationships within the human genome. We found that ~71–72% of all disease‐causing missense variants from HGMD and ClinVar occur in regions translating to a Pfam protein domain and observed that pathogenic missense variants at equivalent domain positions are often paired with the absence of population‐based variation and vice versa (Wiel, Venselaar, Veltman, Vriend, & Gilissen, 2017). By aggregating variant information over homologous protein domains, the resolution of genetic tolerance per position is increased to the number of aligned positions. Similarly, the annotation of pathogenic variants found at equivalent domain positions also assists the interpretation of variants of unknown significance. This use of variant information from homologous protein domains was dubbed “meta‐domains.” We realized that this type of information could be of great benefit to the genetics community and therefore developed “MetaDome.” MetaDome is a freely available web server that uses our concept of meta‐domains to optimally use the information from population‐based and pathogenic variation datasets without the need of a bioinformatics intermediate. MetaDome is easy to use and utilizes the latest population datasets by incorporating the gnomAD and ClinVar datasets.

METHODS

Software architecture of MetaDome

MetaDome is developed in Python v3.5.1 (Rossum & Drake, 2010) and makes use of the Flask framework v0.12.4 (Ronacher, 2010) for the web server part which communicates between the front‐end, the back‐end, and the database. The software architecture (Figure S1) follows the Domain‐driven design paradigm (Evans, 2004). The entities in the domain part of this software architecture are rich data representations that are based on the internal database (Creating the mapping database) and annotations from external resources. These entities are stored after their first creation and afterward directly used for data retrieval to make the lookup in MetaDome as efficient as possible. The code is open source and can be found at our GitHub repository: https://github.com/cmbi/metadome. Detailed instructions on how to deploy the MetaDome web server can be found there too. To ensure MetaDome can be deployed to any environment and provide a high degree of modularity, we have containerized the application via Docker v17.12.1 (Hykes, 2013). We use docker‐compose v1.17.1 to ensure that different containerized aspects of the MetaDome server can work together. The following aspects are containerized to this purpose: (a) The Flask application, (b) a PostgreSQL v10 database wherein the mapping database is stored, (c) a Celery v4.2.0 task queue management system to facilitate the larger tasks of the MetaDome web‐based user requests, (d) a Redis v4.0.11 for task result storage, and (e) RabbitMQ v3.7 to mediate as a task broker between client and workers. For a full overview of the docker‐compose architecture we refer to Figure S2. The visualization medium of the MetaDome web server is a fully interactive and responsive HTML web‐page. This page is generated by the Flask framework and the navigation esthetics are made using the CSS framework Bulma v0.7.1 (Thomas, Potiekhin, Lauhakari, Shah, & Berning, 2018). The visualizations of the various landscapes and the schematic protein are created with JavaScript, JQuery v3.3.1 and the D3 Framework v4.13.0 (Bostock, Ogievetsky, & Heer, 2011). As the visualization by the D3 Framework is highly dependent on the user's CPU power, so are the visualizations of MetaDome.

Datasets of population and disease‐causing genetic variation

MetaDome makes use of single nucleotide variants (SNVs) from population and clinically relevant genetic variation databases. Population variation was obtained from the gnomAD r2.0.2 VCF file by selecting all synonymous, nonsense, and missense variants that meet the PASS filter criteria. Variants meeting the PASS criteria are considered to be true variants (Lek et al., 2016). The variants in the VCF file from ClinVar release 2018 05 03 with disease‐causing (Pathogenic) status are used as the disease‐causing SNVs in MetaDome.

Creating the mapping database

MetaDome stores a complete mapping between genomic, protein positions, and all domain annotations (Figure S3) in a PostgreSQL relational database (PostgreSQL Global Development Group, 1996). This mapping is auto‐generated and stored in the PostgreSQL database by the MetaDome web server upon the first run. The genomic positions consist of each chromosomal position in the protein‐coding transcripts of the GENCODE release 19 GRCh37.p13 Basic set (Harrow et al., 2012). The protein positions correspond to protein sequence positions in the UniProtKB/Swiss‐Prot Release 2016_09 databank entries for the human species (Boutet et al., 2016). These mappings are created with Protein–Protein BLAST v2.2.31 + (Camacho et al., 2009) for each protein‐coding translation in the GENCODE Basic set to human canonical and isoform Swiss‐Prot protein sequences. We exclude sequences that do not start with a start codon (i.e., ATG encoding for methionine), or end with a stop codon. We checked if the complementary DNA (cDNA) sequence of the transcripts match the GENCODE translation via Biopython's translate function (Cock et al., 2009), if they are not identical then these are excluded too. The global information on the transcript (e.g., identifiers and sequence length) is registered in the database in the table “genes” and, for each Swiss‐Prot entry with an identical sequence match, the global information is stored in the table “proteins.” All tables are indexed by the fields that are used in the lookups. Next, for each identical match between translation and Swiss‐Prot sequence a ClustalW2 v2.1 (Larkin et al., 2007) alignment is made between these two sequences. Each nucleotide's genomic position is mapped to the protein position and stored in the “mappings” table. Each entry in mapping represents a single nucleotide of a codon and is linked to the corresponding entry in the “genes” and “proteins” table (i.e., the corresponding GENCODE translation, transcription, and Swiss‐Prot sequence). Each Swiss‐Prot sequence in the database is annotated via InterProScan v5.20–59.0 (Finn et al., 2017) for Pfam‐A v30.0 protein domains (Finn et al., 2016) and the results are stored in the “interpro_domains” table. After the construction of the database is finished, all meta‐domain alignments can be constructed.

Composing a meta‐domain

Meta‐domains consist of homologous Pfam protein domain instances that are annotated using InterproScan. Meta‐domains consist of domains that have at least two homologs within the human genome. MSAs are made using a three step process. (a) Retrieve all sequences for the domain instances, (b) retrieve the Pfam HMM corresponding to the Pfam identifier annotated by InterproScan, and (c) use HMMER 3.1b2 (Finn et al., 2015) to align the sequences from the first step. The resulting Stockholm format MSA files can be inspected with alignment visualization software like Jalview (Waterhouse, Procter, Martin, Clamp, & Barton, 2009). In this Stockholm formatted file, all columns that correspond to the domain consensus represent the same homologous positions. These Stockholm files are retrieved by the MetaDome web server when a user request meta‐domain information for a position of their interest. Upon retrieval of this Stockholm file, the mapping database is used to obtain the corresponding genomic positions for each residue. These genomic positions are subsequently used to retrieve corresponding gnomAD or ClinVar variation.

Computing genetic tolerance and generating a tolerance landscape

The nonsynonymous over synonymous ratio, or score, is used to quantify genetic tolerance. This score is based on the observed (obs) missense and synonymous variation in gnomAD ( ). This score is corrected for the sequence composition by taking into account the background (bg) of possible missense and synonymous variants based on the codon table ( and ): The tolerance landscape computes this ratio as a sliding window of size 21 (i.e., 10 residues before and 10 after the residue of interest) over the entirety of the gene's protein, similar to the Missense Tolerance Ratio (MTR) presented by Traynelis et al. (2017). The edges (e.g., start and end) are therefore a bit noisy as they are not the result of averaging over a full‐length window.

RESULTS

Accessibility

The MetaDome web server is freely accessible at https://stuart.radboudumc.nl/metadome. MetaDome features a user‐friendly web interface and features a fully interactive tour to get familiar with all parts of the analysis and visualizations. All source code and detailed configuration instructions are available in our GitHub repository: https://github.com/cmbi/metadome.

The underlying database: A mapping between genes and proteins

The MetaDome web server queries genomic datasets to annotate positions in a protein or a protein domain. Therefore, the server needs access to genomic positional information as well as protein sequence and protein domain information. The database maps GENCODE gene translations to entries in the UniProtKB/Swiss‐Prot databank in a per‐position manner and corresponding protein domains or genomic variation. With respect to our criteria to map gene translations to proteins (Methods: Creating the mapping database), 42,116 of the 56,319 full‐length protein‐coding GENCODE Basic transcripts for 19,728 human genes are linked to 33,492 of the 42,130 Swiss‐Prot human canonical or isoform sequences. Of the total 591,556 canonical and isoform sequences present in Swiss‐Prot, 42,130 result from the Human species. The resulting mappings contain 32,595,355 unique genomic positions that are linked to 19,226,961 residues in Swiss‐Prot protein sequences. A total of 71,419 Pfam domains are linked to 30,406 of the Swiss‐Prot sequences in our database. Of these Pfam domain instances, 5,948 are from a unique Pfam domain family and 3,334 of these families have two or more homologs and are therefore suitable for meta‐domain construction. Thus, by incorporating every protein‐coding transcript, instead of only the longest ones, we increase the previously 2,750 (Wiel et al., 2017) meta‐domains to 3,334. These meta‐domains, on average, consist of 16 human protein domain homologs with a protein sequence length of 158 residues. Table 1 summarizes the counting statistics for sequences, domains, and so forth.

Table 1

Statistics on the number of entries present in GENCODE, Swiss‐Prot, and our mapping database

Database	What	# Of entries
GENCODE	Protein‐coding genes	20,345
MetaDome	Protein‐coding genes	19,728
GENCODE	Protein‐coding transcripts	57,005
MetaDome	Protein‐coding transcripts	56,319
Swiss‐Prot	Canonical and isoform protein sequences	591,556
Swiss‐Prot	Human canonical and isoform protein sequences	42,130
MetaDome	Gene translations identically mapped to a canonical or isoform protein sequence	42,116
MetaDome	Canonical and isoform protein sequences	33,492
MetaDome	Pfam protein domain regions	71,419
MetaDome	Unique Pfam protein domain families	5,948
MetaDome	Unique Pfam protein domain families with two or more within‐human occurrences	3,334
MetaDome	Chromosome to protein position mappings	70,261,143
MetaDome	Unique chromosome positions	32,595,355
MetaDome	Unique residues (as part of a protein)	19,226,961
MetaDome	Unique protein sequences with at least one Pfam domain annotated	30,406

Statistics on the number of entries present in GENCODE, Swiss‐Prot, and our mapping database

How to use the MetaDome web server

At the welcome page users are offered the option to start an interactive tour or start with the analysis. The navigation bar at the top is available throughout all web pages in MetaDome and allow for further navigation to the “About,” “Method,” and “Contact” page (Figure S4). The user can fill in a gene symbol in the “gene of interest” field and is aided by an auto‐completion to help you find your gene of interest more easily (Figure S5). Clicking the “Get transcripts” fills all GENCODE transcripts for that gene in the dropdown box. Only the transcripts that are mapped to a Swiss‐Prot protein can be used in the analysis, the others are displayed in gray (Figure S6). Clicking the “Start Analysis” button starts an extensive query to the back‐end of the web server for the selected transcript. First, all the mappings are retrieved for the transcript of interest. Second, the entire transcript is annotated with ClinVar and gnomAD single nucleotide variants (SNVs) and Pfam domains. Third, if there are any Pfam domains suitable for meta‐domain relations then all mappings for those regions are gathered and annotated with ClinVar and gnomAD variation (Methods: Composing a meta‐domain). The web‐page provided to the user as a result of the “Analyse Protein” can best be explained using an example. Therefore, we have generated this result for gene CDK13 for transcript ‘ENST00000181839.4‘ (Figure 1). The result page features four main components that we will describe from top to bottom. Located at the top is the graph control field. Directly below the graph control is the landscape view of the protein. Below the landscape view, a schematic and interactive representation of the protein and an additional representation of the protein which controls the zooming option. Lastly, at the bottom of the page there is the list of selected positions. All of these components are interactive and the various functionalities are described in Table 2.

Figure 1

Table 2

Descriptions of the various functionalities on the MetaDome result page

Component	Functionality
Gene and transcript input field	Input of gene of interest
(Figure 1.1)	Retrieving transcripts for gene of interest
	Selecting a transcript
	Starting the analysis for selected transcript
Graph control field	Toggling between different landscape representations
(Figure 1.2)	Reset the zoom on the landscape
	Reset the web‐page
	Toggle ClinVar variants to be displayed in the schematic protein
	Download the visual representation
Landscape view	Displays the meta‐domain landscape
(Figure 1.3)	Displays the tolerance landscape
Schematic protein	Displays a schematic representation of the gene's protein with Pfam protein domains annotated
(Figure 1.4)	Hovering over a position displays positional information
	Clicking on a position highlights the position and adds the position to the list of “Selected Positions”
	Controls the zooming of particular parts of the protein (Figure 1.5)
Selected positions	Displays any positions selected in the schematic protein
(Figure 1.6)	Displays per selected position: if that position is part of a Pfam protein domain, any known gnomAD or ClinVar variants present at this position, and any variants that are homologously related to this position
	Provides more detailed information as a pop‐up when clicking on one of the positions in this list.

MetaDome web server result for the gene CDK13 The result provided by the MetaDome web server for the analysis of gene CDK13 with transcript ENST00000181839.4, as provided in (1). In (2), there is additional information that the translation of this transcript corresponds to Swiss‐Prot protein Q14004. Here also various alternative visualizations can be selected. The visualization starts by default in the “meta‐domain landscape,” a mode selectable in the graph control in (2). The landscapes are visualized in (3), and in the meta‐domain landscape the domain regions are annotated with missense variation counts found in homologous domains as bar plots. The schematic protein representation, located at (4), is per‐position selectable, and the domains are presented as purple blocks. Selected positions are highlighted in green. The “Zoom‐in” section at (5) features a selectable grayed‐out copy of schematic protein representation that can zoom‐in on any part of the protein. Any selected positions are in the list of selected positions in (6). Here more information can be obtained by clicking on one of these positions. A detailed description of the functionality of each component is described in Table 2 Descriptions of the various functionalities on the MetaDome result page Input of gene of interest Retrieving transcripts for gene of interest Selecting a transcript Starting the analysis for selected transcript Toggling between different landscape representations Reset the zoom on the landscape Reset the web‐page Toggle ClinVar variants to be displayed in the schematic protein Download the visual representation Displays the meta‐domain landscape Displays the tolerance landscape Displays a schematic representation of the gene's protein with Pfam protein domains annotated Hovering over a position displays positional information Clicking on a position highlights the position and adds the position to the list of “Selected Positions” Controls the zooming of particular parts of the protein (Figure 1.5) Displays any positions selected in the schematic protein Displays per selected position: if that position is part of a Pfam protein domain, any known gnomAD or ClinVar variants present at this position, and any variants that are homologously related to this position Provides more detailed information as a pop‐up when clicking on one of the positions in this list. Another way to use population‐based variation in the context of the entire protein is via the tolerance landscape representation in MetaDome that can be selected in the graph control component (Figure 1 .2). The tolerance landscape depicts a missense over synonymous ratio (also known as , or ) over a sliding window of 21 residues over the entirety of the protein of interest (e.g., calculated for 10 residues left and right of each residue) based on the gnomAD dataset (Methods: Computing genetic tolerance and generating a tolerance landscape; Figure 2a). Previously, the metric has been used by others and us to measure genetic tolerance and predict disease genes (Ge, Kwok, & Shieh, 2015; Gilissen et al., 2014; Lelieveld et al., 2017), and it is suitable for measuring tolerance in regions within genes (Ge et al., 2016).

Figure 2

Examples of a MetaDome analysis for the gene CDK13 (a) The tolerance landscape depicts a missense over synonymous ratio calculated as a sliding window over the entirety of the protein (Methods: Computing genetic tolerance and generating a tolerance landscape). The missense and synonymous variation are annotated from the gnomAD dataset and the landscape provides some indication of regions that are intolerant to missense variation. In this CDK13 tolerance landscape the Pkinase Pfam protein domain (PF00069) in purple can be clearly seen as intolerant if compared with other parts in this protein. The red bars in the schematic protein representation correspond to pathogenic ClinVar variants found in this gene and in homologous protein domains. All of these variants are contained in the intolerant region of the landscape. (b) A zoom‐in on the meta‐domain landscape for CDK13. The Pkinase Pfam protein domain (PF00069) is located between protein positions 707 and 998 and annotated as a purple box in the schematic protein representation. The meta‐domain landscape displays a deep annotation of the protein domain: the green (gnomAD) and red (ClinVar) bars correspond to the number of missense variants found at aligned homologous positions. Unaligned positions are annotated as black bars. All of this information is displayed upon hovering over these various elements. (c) The positional information provides a detailed overview of a position from the “Selected Positions” list, especially if that position is aligned to domain homologs. Here, for position p.Gly714 we can observe in (1) the positional details for this specific protein position. In (2) is any known pathogenic information for this position. We can observe here that for this position there are two known pathogenic missense variants. In (3) meta‐domain information is displayed and we can observe that p.Gly714 is aligned to consensus position 10 in the Pkinase Pfam protein domain and related to 329 other codons. This consensus position has an alignment coverage of 93.5% for the meta‐domain MSA. There are also four pathogenic variants found in ClinVar on corresponding homologous positions as can be seen in (4) and in (5) there is an overview of all corresponding variants found in gnomAD. MSA, multiple sequence alignment

An example of using the MetaDome web server for variant interpretation

The MetaDome analysis result for CDK13 (Figure 1) is performed for the longest protein‐coding transcript with a protein sequence length of 1,512 amino acids. In the resulting schematic protein representation we can observe the Pkinase Pfam protein domain (PF00069) between positions 707 and 998 as the only protein domain in this gene (Figure 2b). The Pkinase domain is highly prevalent throughout the human genome with as many as 779 homologous occurrences in human proteins, of which 353 are unique genomic regions. It is the 8th most occurring domain in our mapping database. The meta‐domain landscape is the default view mode and shows any missense variation found in homologous domain occurrences throughout the human genome. Population‐based (gnomAD) missense variation is displayed in green and pathogenic (ClinVar) missense variation is annotated in red bars, with the height of the bars depicting the number of variants found at each position (Figure 2b). At the “Display ClinVar variants” the user is provided two options, to highlight all known pathogenic information known for the current protein and/or highlight any ClinVar variants that are present at homologous positions (Figure 2a). All ClinVar variants highlighted are displayed in red. In total six known disease‐causing SNVs are present in the CDK13 gene itself according to ClinVar, and these all fall within the Pkinase protein domain. All of these are missense variants. If we add variants found in homologous domains there are 64 positions with one or more reported pathogenic variants (Data S1). Four of these positions overlap with the positions on which ClinVar variants were found in the gene itself and on position p.883 (Figure S7) we can observe a peak of eight missense variants annotated from other protein domains. MetaDome helps to look in more detail to a position of interest. If we do this for protein position 714 (Figure 2c) in CDK13 we find that it corresponds to consensus position 10 in the Pkinase domain (PF00069). At this position in CDK13 there are two variants reported in ClinVar: p.Gly714Arg (ClinVar ID: 375738) submitted by (Sifrim et al., 2016), and p.Gly714Asp (ClinVar ID: 449224) submitted by GeneDX. The first is reported as a de novo variant and is associated to Congenital Heart Defects, Dysmorphic Facial Features, and Intellectual Developmental Disorder. For the second there is no associated phenotype provided. As MetaDome annotates variants reported at homologous positions, we can find even more information for this particular position. At homologs aligned to this position we find a variant of identical change in PRKD1: p.Gly600Arg (ClinVar ID: 375740) reported as pathogenic and de novo in the same study (Sifrim et al., 2016). It is also associated to Congenital Heart Defects as well as associated to Ectodermal Dysplasia. There are three more reported pathogenic variants aligned to this position: MAK:p.Gly13Ser (ClinVar ID: 29783) associated to Retinitis Pigmentosa 62 (Özgül et al., 2011), PRKCG:p.Gly360Ser (ClinVar ID: 42129) associated to Spinocerebellar Ataxia Type14 (Klebe et al., 2005), and CIT:p.Gly106Val (ClinVar ID: 254134) associated to Microcephaly 17, primary, autosomal recessive (Özgül et al., 2011). These homologously related pathogenic variants and the severity of the associated phenotypes contributes to the evidence that this particular residue may be important at this position. Further evidence can be found from the fact that in human homolog domains this residue is extremely conserved. There are 330 unique genomic regions encoding for a codon aligned to this position (Data S2). Only in the gene PIK3R4 (ENST00000356763.3) does this codon encode for another residue than Glycine, namely a Threonine at position p.Thr35. In the same way that we explored pathogenic ClinVar variation we can also explore the variation reported in gnomAD. In CDK13 at protein position 714 there is no reported variant in gnomAD, but there are homologously related variations. There are 65 missense variants with average allele frequency of 1.24E−05 and 76 synonymous with average allele frequency 8.71E−03 and there is no reported nonsense variation (Data S1). When we inspect the tolerance landscape for CDK13 (Figure 2a) we can see that all of the ClinVar variants (either annotated in CDK13 or related via homologs) fall within the Pkinase Pfam protein domain (PF00069). In addition, the protein domain can clearly be seen as more intolerant to missense variation as compared with other parts of this protein, thereby supporting the ClinVar variants likely pathogenic role.

CONCLUSION

The MetaDome web server combines resources and information from different fields of expertise (e.g., genomics and proteomics) to increase the power in analyzing population and pathogenic variation by transposing this variation to homologous protein domains. Such a transfer of information is achieved by a per‐position mapping between the GENCODE and Swiss‐Prot databases. 79.4% of the Human Swiss‐Prot protein sequences are of identical match to one or more of 42,116 GENCODE transcripts. This means that 25.7% of the GENCODE transcriptions differ in messenger RNA (mRNA) but translate to the same Swiss‐Prot protein sequence. GENCODE previously reported that this is due to alternative splicing, of which a substantial proportion only affect untranslated regions (UTRs) and thus have no impact on the protein‐coding part of the gene (Harrow et al., 2006). MetaDome is especially informative if a variant of interest falls within a protein domain that has homologs. This is highly likely as 43.6% of the positions in the MetaDome mapping database are part of a homologous protein domain. Pathogenic missense variation is also highly likely to fall within a protein domain as we previously observed for 71% of HGMD and 72% of ClinVar pathogenic missense variants (Wiel et al., 2017). By aggregating variation over protein domain homologs via MetaDome, the resolution of genetic tolerance at a single amino acid is increased. Furthermore, we can obtain variation that could disrupt the functionality of a protein domain, as annotated throughout the entire human genome, which may potentially be disease‐causing. It should be noted, that by aggregating genetic variation in this way the specific context such as haplotype information or interactions with other proteins may be lost. Aggregation via meta‐domains only encapsulates general biological or molecular functions attributed to the domain. Nonetheless, we believe MetaDome can be used to better interpret variants of unknown significance through the use of meta‐domains and tolerance landscapes as we have shown in our example. As more genetic data accumulates in the years to come, MetaDome will become more and more accurate in predictions of intolerance at the base‐pair level and the meta‐domain landscapes will become even more populated with variation found in homolog protein domains. We can imagine many other ways of integrating this type of information to be helpful for variant interpretation. Future directions for the MetaDome web server could lead to machine learning empowered variant effect prediction, or visualization of the meta‐domain information in a protein 3D structure. Supporting information Click here for additional data file. Supporting information Click here for additional data file. Supporting information Click here for additional data file.

31 in total

1. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.

Authors: Emmanuel Boutet; Damien Lieberherr; Michael Tognolli; Michel Schneider; Parit Bansal; Alan J Bridge; Sylvain Poux; Lydie Bougueleret; Ioannis Xenarios
Journal: Methods Mol Biol Date: 2016

2. Mutational effects on stability are largely conserved during protein evolution.

Authors: Orr Ashenberg; L Ian Gong; Jesse D Bloom
Journal: Proc Natl Acad Sci U S A Date: 2013-12-09 Impact factor: 11.205

3. New mutations in protein kinase Cgamma associated with spinocerebellar ataxia type 14.

Authors: Stephan Klebe; Alexandra Durr; Alexander Rentschler; Valerie Hahn-Barma; Michael Abele; Naima Bouslam; Ludger Schöls; Pierre Jedynak; Sylvie Forlani; Elodie Denis; Christel Dussert; Yves Agid; Peter Bauer; Christoph Globas; Ullrich Wüllner; Alexis Brice; Olaf Riess; Giovanni Stevanin
Journal: Ann Neurol Date: 2005-11 Impact factor: 10.422

4. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

5. GENCODE: producing a reference annotation for ENCODE.

Authors: Jennifer Harrow; France Denoeud; Adam Frankish; Alexandre Reymond; Chao-Kung Chen; Jacqueline Chrast; Julien Lagarde; James G R Gilbert; Roy Storey; David Swarbreck; Colette Rossier; Catherine Ucla; Tim Hubbard; Stylianos E Antonarakis; Roderic Guigo
Journal: Genome Biol Date: 2006-08-07 Impact factor: 13.583

6. Genic intolerance to functional variation and the interpretation of personal genomes.

Authors: Slavé Petrovski; Quanli Wang; Erin L Heinzen; Andrew S Allen; David B Goldstein
Journal: PLoS Genet Date: 2013-08-22 Impact factor: 5.917

7. InterPro in 2017-beyond protein family and domain annotations.

Authors: Robert D Finn; Teresa K Attwood; Patricia C Babbitt; Alex Bateman; Peer Bork; Alan J Bridge; Hsin-Yu Chang; Zsuzsanna Dosztányi; Sara El-Gebali; Matthew Fraser; Julian Gough; David Haft; Gemma L Holliday; Hongzhan Huang; Xiaosong Huang; Ivica Letunic; Rodrigo Lopez; Shennan Lu; Aron Marchler-Bauer; Huaiyu Mi; Jaina Mistry; Darren A Natale; Marco Necci; Gift Nuka; Christine A Orengo; Youngmi Park; Sebastien Pesseat; Damiano Piovesan; Simon C Potter; Neil D Rawlings; Nicole Redaschi; Lorna Richardson; Catherine Rivoire; Amaia Sangrador-Vegas; Christian Sigrist; Ian Sillitoe; Ben Smithers; Silvano Squizzato; Granger Sutton; Narmada Thanki; Paul D Thomas; Silvio C E Tosatto; Cathy H Wu; Ioannis Xenarios; Lai-Su Yeh; Siew-Yit Young; Alex L Mitchell
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

8. The ExAC browser: displaying reference data information from over 60 000 exomes.

Authors: Konrad J Karczewski; Ben Weisburd; Brett Thomas; Matthew Solomonson; Douglas M Ruderfer; David Kavanagh; Tymor Hamamsy; Monkol Lek; Kaitlin E Samocha; Beryl B Cummings; Daniel Birnbaum; Mark J Daly; Daniel G MacArthur
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

9. Distinct genetic architectures for syndromic and nonsyndromic congenital heart defects identified by exome sequencing.

Authors: Alejandro Sifrim; Marc-Phillip Hitz; Anna Wilsdon; Jeroen Breckpot; Saeed H Al Turki; Bernard Thienpont; Jeremy McRae; Tomas W Fitzgerald; Tarjinder Singh; Ganesh Jawahar Swaminathan; Elena Prigmore; Diana Rajan; Hashim Abdul-Khaliq; Siddharth Banka; Ulrike M M Bauer; Jamie Bentham; Felix Berger; Shoumo Bhattacharya; Frances Bu'Lock; Natalie Canham; Irina-Gabriela Colgiu; Catherine Cosgrove; Helen Cox; Ingo Daehnert; Allan Daly; John Danesh; Alan Fryer; Marc Gewillig; Emma Hobson; Kirstin Hoff; Tessa Homfray; Anne-Karin Kahlert; Ami Ketley; Hans-Heiner Kramer; Katherine Lachlan; Anne Katrin Lampe; Jacoba J Louw; Ashok Kumar Manickara; Dorin Manase; Karen P McCarthy; Kay Metcalfe; Carmel Moore; Ruth Newbury-Ecob; Seham Osman Omer; Willem H Ouwehand; Soo-Mi Park; Michael J Parker; Thomas Pickardt; Martin O Pollard; Leema Robert; David J Roberts; Jennifer Sambrook; Kerry Setchfield; Brigitte Stiller; Chris Thornborough; Okan Toka; Hugh Watkins; Denise Williams; Michael Wright; Seema Mital; Piers E F Daubeney; Bernard Keavney; Judith Goodship; Riyadh Mahdi Abu-Sulaiman; Sabine Klaassen; Caroline F Wright; Helen V Firth; Jeffrey C Barrett; Koenraad Devriendt; David R FitzPatrick; J David Brook; Matthew E Hurles
Journal: Nat Genet Date: 2016-08-01 Impact factor: 38.330

10. ClinVar: public archive of interpretations of clinically relevant variants.

Authors: Melissa J Landrum; Jennifer M Lee; Mark Benson; Garth Brown; Chen Chao; Shanmuga Chitipiralla; Baoshan Gu; Jennifer Hart; Douglas Hoffman; Jeffrey Hoover; Wonhee Jang; Kenneth Katz; Michael Ovetsky; George Riley; Amanjeev Sethi; Ray Tully; Ricardo Villamarin-Salomon; Wendy Rubinstein; Donna R Maglott
Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971

50 in total

1. De Novo Variants in CNOT1, a Central Component of the CCR4-NOT Complex Involved in Gene Expression and RNA and Protein Stability, Cause Neurodevelopmental Delay.

Authors: Lisenka E L M Vissers; Sreehari Kalvakuri; Elke de Boer; Sinje Geuer; Machteld Oud; Inge van Outersterp; Michael Kwint; Melde Witmond; Simone Kersten; Daniel L Polla; Dilys Weijers; Amber Begtrup; Kirsty McWalter; Anna Ruiz; Elisabeth Gabau; Jenny E V Morton; Christopher Griffith; Karin Weiss; Candace Gamble; James Bartley; Hilary J Vernon; Kendra Brunet; Claudia Ruivenkamp; Sarina G Kant; Paul Kruszka; Austin Larson; Alexandra Afenjar; Thierry Billette de Villemeur; Kimberly Nugent; F Lucy Raymond; Hanka Venselaar; Florence Demurger; Claudia Soler-Alfonso; Dong Li; Elizabeth Bhoj; Ian Hayes; Nina Powell Hamilton; Ayesha Ahmad; Rachel Fisher; Myrthe van den Born; Marjolaine Willems; Arthur Sorlin; Julian Delanne; Sebastien Moutton; Philippe Christophe; Frederic Tran Mau-Them; Antonio Vitobello; Himanshu Goel; Lauren Massingham; Chanika Phornphutkul; Jennifer Schwab; Boris Keren; Perrine Charles; Maaike Vreeburg; Lenika De Simone; George Hoganson; Maria Iascone; Donatella Milani; Lucie Evenepoel; Nicole Revencu; D Isum Ward; Kaitlyn Burns; Ian Krantz; Sarah E Raible; Jill R Murrell; Kathleen Wood; Megan T Cho; Hans van Bokhoven; Maximilian Muenke; Tjitske Kleefstra; Rolf Bodmer; Arjan P M de Brouwer
Journal: Am J Hum Genet Date: 2020-06-17 Impact factor: 11.025

2. A novel deletion variant in CLN3 with highly variable expressivity is responsible for juvenile neuronal ceroid lipofuscinoses.

Authors: Naser Gilani; Ehsan Razmara; Mehmet Ozaslan; Ihsan Kareem Abdulzahra; Saeid Arzhang; Ali Reza Tavasoli; Masoud Garshasbi
Journal: Acta Neurol Belg Date: 2021-03-30 Impact factor: 2.396

3. De Novo Variants in SPOP Cause Two Clinically Distinct Neurodevelopmental Disorders.

Authors: Maria J Nabais Sá; Geniver El Tekle; Arjan P M de Brouwer; Sarah L Sawyer; Daniela Del Gaudio; Michael J Parker; Farah Kanani; Marie-José H van den Boogaard; Koen van Gassen; Margot I Van Allen; Klaas Wierenga; Gabriela Purcarin; Ellen Roy Elias; Amber Begtrup; Jennifer Keller-Ramey; Tiziano Bernasocchi; Laurens van de Wiel; Christian Gilissen; Hanka Venselaar; Rolph Pfundt; Lisenka E L M Vissers; Jean-Philippe P Theurillat; Bert B A de Vries
Journal: Am J Hum Genet Date: 2020-02-27 Impact factor: 11.025

4. De Novo KAT5 Variants Cause a Syndrome with Recognizable Facial Dysmorphisms, Cerebellar Atrophy, Sleep Disturbance, and Epilepsy.

Authors: Jonathan Humbert; Smrithi Salian; Periklis Makrythanasis; Gabrielle Lemire; Justine Rousseau; Sophie Ehresmann; Thomas Garcia; Rami Alasiri; Armand Bottani; Sylviane Hanquinet; Erin Beaver; Jennifer Heeley; Ann C M Smith; Seth I Berger; Stylianos E Antonarakis; Xiang-Jiao Yang; Jacques Côté; Philippe M Campeau
Journal: Am J Hum Genet Date: 2020-08-20 Impact factor: 11.025

5. Haploinsufficiency of the Notch Ligand DLL1 Causes Variable Neurodevelopmental Disorders.

Authors: Björn Fischer-Zirnsak; Lara Segebrecht; Max Schubach; Perrine Charles; Emily Alderman; Kathleen Brown; Maxime Cadieux-Dion; Tracy Cartwright; Yanmin Chen; Carrie Costin; Sarah Fehr; Keely M Fitzgerald; Emily Fleming; Kimberly Foss; Thoa Ha; Gabriele Hildebrand; Denise Horn; Shuxi Liu; Elysa J Marco; Marie McDonald; Kirsty McWalter; Simone Race; Eric T Rush; Yue Si; Carol Saunders; Anne Slavotinek; Sylvia Stockler-Ipsiroglu; Aida Telegrafi; Isabelle Thiffault; Erin Torti; Anne Chun-Hui Tsai; Xin Wang; Muhammad Zafar; Boris Keren; Uwe Kornak; Cornelius F Boerkoel; Ghayda Mirzaa; Nadja Ehmke
Journal: Am J Hum Genet Date: 2019-07-25 Impact factor: 11.025

6. Biallelic MADD variants cause a phenotypic spectrum ranging from developmental delay to a multisystem disorder.

Authors: Pauline E Schneeberger; Fanny Kortüm; Georg Christoph Korenke; Malik Alawi; René Santer; Mathias Woidy; Daniela Buhas; Stephanie Fox; Jane Juusola; Majid Alfadhel; Bryn D Webb; Emanuele G Coci; Rami Abou Jamra; Manuela Siekmeyer; Saskia Biskup; Corina Heller; Esther M Maier; Poupak Javaher-Haghighi; Maria F Bedeschi; Paola F Ajmone; Maria Iascone; Hilde Peeters; Katleen Ballon; Jaak Jaeken; Aroa Rodríguez Alonso; María Palomares-Bralo; Fernando Santos-Simarro; Marije E C Meuwissen; Diane Beysen; R Frank Kooy; Henry Houlden; David Murphy; Mohammad Doosti; Ehsan G Karimiani; Majid Mojarrad; Reza Maroofian; Lenka Noskova; Stanislav Kmoch; Tomas Honzik; Heidi Cope; Amarilis Sanchez-Valle; Bruce D Gelb; Ingo Kurth; Maja Hempel; Kerstin Kutsche
Journal: Brain Date: 2020-08-01 Impact factor: 13.501

7. Association of SLC32A1 Missense Variants With Genetic Epilepsy With Febrile Seizures Plus.

Authors: Sarah E Heron; Brigid M Regan; Rebekah V Harris; Alison E Gardner; Matthew J Coleman; Mark F Bennett; Bronwyn E Grinton; Katherine L Helbig; Michael R Sperling; Sheryl Haut; Eric B Geller; Peter Widdess-Walsh; James T Pelekanos; Melanie Bahlo; Slavé Petrovski; Erin L Heinzen; Michael S Hildebrand; Mark A Corbett; Ingrid E Scheffer; Jozef Gécz; Samuel F Berkovic
Journal: Neurology Date: 2021-03-23 Impact factor: 9.910

8. TRIM71 Deficiency Causes Germ Cell Loss During Mouse Embryogenesis and Is Associated With Human Male Infertility.

Authors: Lucia A Torres-Fernández; Jana Emich; Yasmine Port; Sibylle Mitschka; Marius Wöste; Simon Schneider; Daniela Fietz; Manon S Oud; Sara Di Persio; Nina Neuhaus; Sabine Kliesch; Michael Hölzel; Hubert Schorle; Corinna Friedrich; Frank Tüttelmann; Waldemar Kolanus
Journal: Front Cell Dev Biol Date: 2021-05-13

9. Haploinsufficiency of ARFGEF1 is associated with developmental delay, intellectual disability, and epilepsy with variable expressivity.

Authors: Quentin Thomas; Thierry Gautier; Dana Marafi; Thomas Besnard; Marjolaine Willems; Sébastien Moutton; Bertand Isidor; Benjamin Cogné; Solène Conrad; Romano Tenconi; Maria Iascone; Arthur Sorlin; Alice Masurel; Tabib Dabir; Adam Jackson; Siddharth Banka; Julian Delanne; James R Lupski; Nebal Waill Saadi; Fowzan S Alkuraya; Fatema Al Zahrani; Pankaj B Agrawal; Eleina England; Jill A Madden; Jennifer E Posey; Lydie Burglen; Diana Rodriguez; Martin Chevarin; Sylvie Nguyen; Frédéric Tran Mau-Them; Yannis Duffourd; Philippine Garret; Ange-Line Bruel; Patrick Callier; Nathalie Marle; Anne-Sophie Denomme-Pichon; Laurence Duplomb; Christophe Philippe; Christel Thauvin-Robinet; Jérôme Govin; Laurence Faivre; Antonio Vitobello
Journal: Genet Med Date: 2021-06-10 Impact factor: 8.822

10. Cadherin 2-Related Arrhythmogenic Cardiomyopathy: Prevalence and Clinical Features.

Authors: Alice Ghidoni; Perry M Elliott; Petros Syrris; Hugh Calkins; Cynthia A James; Daniel P Judge; Brittney Murray; Julien Barc; Vincent Probst; Jean Jacques Schott; Jiang-Ping Song; Richard N W Hauer; Edgar T Hoorntje; J Peter van Tintelen; Eric Schulze-Bahr; Robert M Hamilton; Kirti Mittal; Christopher Semsarian; Elijah R Behr; Michael J Ackerman; Cristina Basso; Gianfranco Parati; Davide Gentilini; Maria-Christina Kotta; Bongani M Mayosi; Peter J Schwartz; Lia Crotti
Journal: Circ Genom Precis Med Date: 2021-02-10