Literature DB >> 29997612

CDG: An Online Server for Detecting Biologically Closest Disease-Causing Genes and its Application to Primary Immunodeficiency.

David Requena¹, Patrick Maffucci^1,2,3, Benedetta Bigio¹, Lei Shang¹, Avinash Abhyankar⁴, Bertrand Boisson^1,5,6, Peter D Stenson⁷, David N Cooper⁷, Charlotte Cunningham-Rundles^2,3, Jean-Laurent Casanova^1,5,6,7,8,9, Laurent Abel^1,5,6, Yuval Itan^10,11.

Abstract

High-throughput genomic technologies yield about 20,000 variants in the protein-coding exome of each individual. A commonly used approach to select candidate disease-causing variants is to test whether the associated gene has been previously reported to be disease-causing. In the absence of known disease-causing genes, it can be challenging to associate candidate genes with specific genetic diseases. To facilitate the discovery of novel gene-disease associations, we determined the putative biologically closest known genes and their associated diseases for 13,005 human genes not currently reported to be disease-associated. We used these data to construct the closest disease-causing genes (CDG) server, which can be used to infer the closest genes with an associated disease for a user-defined list of genes or diseases. We demonstrate the utility of the CDG server in five immunodeficiency patient exomes across different diseases and modes of inheritance, where CDG dramatically reduced the number of candidate genes to be evaluated. This resource will be a considerable asset for ascertaining the potential relevance of genetic variants found in patient exomes to specific diseases of interest. The CDG database and online server are freely available to non-commercial users at: http://lab.rockefeller.edu/casanova/CDG.

Entities: Chemical Disease Gene Species

Keywords: disease-causing gene; gene filtering; genomics; human gene connectome; next-generation sequencing

Year: 2018 PMID： 29997612 PMCID： PMC6030251 DOI： 10.3389/fimmu.2018.01340

Source DB: PubMed Journal: Front Immunol ISSN： 1664-3224 Impact factor: 7.561

Introduction

Genetic mutations have been found to underlie a large number of inherited human diseases. In the past decade, refinements in next-generation sequencing techniques (NGS) have made it possible to detect the full set of gene variants in patients. The average human genome contains about 20,000 coding variants and hundreds of thousands of non-coding variants (1). A common approach to identify candidate variants for further investigation from NGS data involves screening for those in known disease-causing genes (2–4). However, variants in novel disease-associated genes should be estimated by computational predictions (5). Databases such as the Human Gene Mutation Database [HGMD (6)], and ClinVar (7, 8) provide manually curated information about mutations in known disease-causing genes, also known as the Clinome (9). Several methods including the Search Tool for the Retrieval of Interacting Genes/Proteins [STRING (10)], Exomiser that prioritizes genetic variants from a vcf file (11), the Probabilistic functional gene network of Homo Sapiens [HumanNet (12)], and Functional Coupling [FunCoup (13, 14)] can be used to assess human genes directly connected to candidate genes. The human gene connectome [HGC (15)] extends these approaches by prioritizing candidate genes according to their computed biological distances from known disease-causing genes. We generated a complementary resource, the closest disease-causing genes (CDG) database and server to identify novel gene-disease associations. CDG computes the biologically closest known disease-causing genes and corresponding diseases for 13,005 human candidate genes not currently observed to be disease-causing, allowing investigators to associate these candidate genes with known disease phenotypes. We demonstrate the efficiency of this method in five patients with various primary immunodeficiencies and modes of inheritance, significantly reducing the number of candidate genes in these examples by using CDG (see Supplementary Material, Section 2 for details). CDG also identifies novel gene candidates for lists of diseases defined by an investigator. Thus, this resource provides a reference for the potential relevance of novel candidate genes to specific disease phenotypes, simplifying the analysis of NGS data.

Materials and Methods

CDG Generation

Human Gene Mutation Database is a manually curated database of variants that may be associated with or predisposing to human genetic conditions (16, 17). From the HGMD March 2015 public full version (updated through December 2014), we selected 5,430 HGMD genes classified as high-quality disease-causing or disease-associated mutations (mostly linked to monogenic diseases). We next identified 13,005 protein-coding genes present in the HGC that are not currently reported to be disease-causing in the HGMD database. Briefly, the HGC (15) is a network of all human genes (represented as nodes), where each edge represents the direct biological distance between two human genes. Direct biological distance is defined as the inverse confidence score for binding connectivity provided by STRING (10). The HGC biological distance between any two genes is defined as the weighted sum of direct distances in the shortest path connecting two given genes (calculated using the Dijkstra algorithm), on the network containing most protein-coding human genes. For each of these 13,005 genes, we calculated their biologically CDG and associated diseases by first retrieving the corresponding connectome for each gene from the HGC database (15, 18). A gene-specific connectome contains, for any given human gene, the set of all other human genes ranked by their biological distance to that specific gene. Then, following the HGC criterion for biological relatedness, we selected only the HGMD known disease-causing genes in the connectome within p < 0.01. Additionally, we assigned the corresponding human phenotype ontology codes [HPO (19)] to each gene-phenotype association (Figure S1 in Supplementary Material). A summary of the CDG, diseases, and routes associated with each of the 13,005 genes not currently known to be disease-causing is provided in Table S1 in Supplementary Material.

Validation

We validated CDG and compared the performance of CDG with FunCoup and HumanNet using a validation set of genes not used during the construction of the original CDG database. As validation set, we used two external datasets (1) a new HGMD dataset, containing 339 disease-causing genes added between January and September 2015 (i.e., not used to construct CDG); and (2) the pathogenic genes from ClinVar not present in HGMD, comprising 84 genes. We calculated the CDG for each of these genes as described above and compared the performance of CDG versus FunCoup and HumanNet in terms of number of predicted genes and how many predicted diseases coincided with the reported disease. As FunCoup and HumanNet do not associate diseases, we retrieved the disease names related to each predicted gene from HGMD. To compare the predicted and expected disease names, we implemented in CDG the following phrase-comparison procedure (1) first, the disease names were compared by exact coincidence. Then (2) using the “starts-with” comparison: if one phrase exactly starts with the other phrase, or vice versa. If at this point no matches were found, we used (3) the Levenshtein distance algorithm (20). All comparisons between disease names for the validation dataset were verified manually.

Data Storage and web Access

To make CDG easily accessible, we created a webserver that allows to consult the CDG database using either genes or diseases as input. If the input gene is known to be disease-causing, the server provides the known associated diseases. And if the gene is unknown to be disease-causing, predicted data is displayed. The server also allows using disease names as input, returning the list of both known and predicted causative genes. The disease names in the CDG database are as reported in HGMD. If the user input is not a HGMD disease name, the procedure to compare disease names described above is used to estimate the closest HGMD disease name. For the CDG server, MySQL was used to structure and store the multi-dimensional profile of the results of this study, and to process queries to allow efficient access. JSP and servlets were used to parse inputs and generate queries. The web interface is stored on a Rockefeller University Linux-based server in solid state drives. The CDG resource is platform-independent and is freely available to all non-commercial users. The CDG database and server will be periodically updated with new public versions of HGMD, STRING, and HGC.

Results

CDG Validation and Comparative Analysis

We first explored the relationship between the 13,005 genes not currently described to cause clinical phenotypes with HGMD known disease-causing genes. Each of the 13,005 genes was associated on average with 48 HGMD disease-causing genes and 7 diseases by HGC biological proximity (see Table S1 in Supplementary Material for the top-ranked associations). Notably, 92.9% of the associated disease-causing genes were within one or two degrees of separation from the corresponding query gene (Figure 1). Conversely, only 13.9% of all human gene pairs were within one or two degrees of separation (p < 10−300, two-tailed equal variance t-test).

Figure 1

Predicted degrees of separation between (blue) the 13,005 genes from human gene mutation database (HGMD) not known to be disease-causing and their closest predicted HGMD disease-causing genes, and (orange) between all pairs of human genes. The accuracy and utility of these associations was then assessed using new disease-causing genes not known during the construction of the CDG database. Using the first dataset (339 new genes from HGMD), we found that 287 had at least one predicted gene by CDG, compared to 133 using FunCoup and 116 using HumanNet. From these predicted genes, 134 of 287 were associated with the expected disease by CDG, compared to 46 genes of 133 by FunCoup and 47 of 116 by HumanNet (Figure 2A). We repeated the comparison using the second dataset (84 genes from ClinVar not present in HGMD) and observed that CDG similarly outperformed the other two software both in number of genes with at least one predicted disease-causing gene and also in correct association with the expected disease (Figure 2B).

Figure 2

Comparative performance of CDG, FunCoup, and HumanNet using (A) 339 new genes in human gene mutation database (HGMD) and (B) using 84 genes in ClinVar that are not in HGMD. The numbers below each method show the number of genes with at least one predicted gene (left) and how many were associated with the expected disease (right). Black numbers show the gene distribution across the three servers and white numbers show how many were associated with the expected disease in each server. To address the robustness of the predictions, we randomly sampled 1,000 sets of 287 genes from the 5,430 known disease-causing genes and estimated their CDGs and associated diseases. CDG identified the expected disease in 86.33% of cases by exact disease name match. Then, we examined the profiles of biological proximity for CDG predictions and known disease-causing genes. Assuming a Gaussian distribution, we performed 10,000 bootstrapping simulations for HGC p-values of CDG predictions between the observed 287 new HGMD genes with at least one CDG and the expected set of 13,005 genes not currently known to cause disease. The observed and expected CDG predictions yielded similar p-value profiles for biological relatedness between the observed and expected gene sets and their CDG (Figure 3). Therefore, CDG associations are expected to be more robust and relevant for the putative diseases associated with candidate genes than previous methods. Due to the lack of flat files from FunCoup and HumanNet, it was not possible to repeat this analysis with these methods. Thus, we expect that CDG predictions are of significant utility to researchers exploring genes without published phenotypes.

Figure 3

Bootstrapping simulations between a set of (1) expected: p-values between 13,005 genes not reported to cause disease and their predicted CDGs; (2) observed: p-values between new human gene mutation database genes (i.e., not used to generate the CDGs presented in this study) and their predicted CDGs. Test performed by random sampling using a Gaussian distribution.

Examples of CDG Usage

Finally, we demonstrated the utility of CDG in WES data in five patients with various primary immunodeficiencies, modes of inheritance, and known mutated genes that were not in the HGMD public database during CDG generation (extended description and flowchart in Supplementary Material, Section 2). Phenotypes and associated genotypes in these examples include (1) severe autoinflammation, a homozygous mutation in RNF31 (21); (2) Epidermodysplasia verruciformis, a homozygous mutation in STK4 (MST1) (22); (3) herpes simplex encephalitis, a homozygous mutation in UNC93B1 (23); (4) common variable immunodeficiency, a heterozygous mutation in IKZF1 (24); and (5) natural killer cell deficiency, compound heterozygous mutations in GINS1 (25). The range of initial number of genes per patient was 14,800–18,862. We then applied standard QC (DP > 4, MQ > 40, and QD > 2), minor allele frequency (<1%) (26), and gene-level filtering using GDI (27) and MSC (28), reducing the number of genes in each patient to the range from 18 to 322 candidate genes (numbers mostly dependent on mode of inheritance). Finally, applying the CDG server to the number of genes to investigate reduced this range from 1 to 11, a reduction in candidate genes of 92.1–96.6%, without losing any of the pathogenic genes.

Conclusion

We provide the first resource by estimating the closest known disease-causing genes and their associated diseases for 13,005 human genes not currently known to be disease-causing. From the comparisons performed, we conclude that CDG predictions capture meaningful candidate disease-causing genes and diseases. We propose to use CDG with lists of genes from NGS studies or similar sources to (1) explore the likelihood of candidate genes being associated with a disease of interest by investigation of its CDGs and associated diseases; (2) rapidly identify known diseases associated with HGMD disease-causing genes; and (3) assign CDGs and associated diseases in variant annotation software. We are also providing an option for users to perform CDG queries based on OMIM (29), although this resource contains less pathogenic mutations compared to HGMD. See Supplementary Material, Section 3 for further details regarding the webserver’s construction. Users can submit genes to the webserver (Figure 4) to obtain two outputs (1) all CDGs and associated diseases, including their routes to the input genes (i.e., HGC-predicted genes on the shortest path) and (2) only the most significant CDG for each input gene (by p-value). If the input is a known disease-causing gene, the output will be all known associated diseases. Disease names can also be input to obtain known and predicted disease-causing genes for the phenotype concerned. When CDG does not provide desirable results, we propose to rerun it with different diseases that are phenotypically close to the disease that is investigated. We expect that the use of CDG with gene-level filtering methods, such as the gene damage index (27), will facilitate the discovery of new disease-causing genes. The CDG server will be updated when new public versions of HGMD become available, and new features will be added, including filtering by degrees of separation and phenotype matching. The CDG resource and database are available for download from the main CDG webpage: http://lab.rockefeller.edu/casanova/CDG.

Figure 4

Schematic of the closest disease-causing genes (CDG) server pipeline, where CDG can be estimated by queries of gene or disease lists provided by the user.

Author Contributions

YI initiated the study. PS, DC, CC-R provided data and expertise. DR, PM, BB, PS, and YI analyzed the data. DR and LS generated the webserver. DR performed the comparison. DR, PM, AA, J-LC, LA, and YI wrote the manuscript. J-LC, LA, and YI supervised the study. All the authors revised and approved the final version of the manuscript.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

28 in total

1. Commentaries on "Informatics and medicine: from molecules to populations".

Authors: R B Altman; R Balling; J F Brinkley; E Coiera; F Consorti; M A Dhansay; A Geissbuhler; W Hersh; S Y Kwankam; N M Lorenzi; F Martin-Sanchez; G I Mihalas; Y Shahar; K Takabayashi; G Wiederhold
Journal: Methods Inf Med Date: 2008 Impact factor: 2.176

2. Next-generation diagnostics and disease-gene discovery with the Exomiser.

Authors: Damian Smedley; Julius O B Jacobsen; Marten Jäger; Sebastian Köhler; Manuel Holtgrewe; Max Schubach; Enrico Siragusa; Tomasz Zemojtel; Orion J Buske; Nicole L Washington; William P Bone; Melissa A Haendel; Peter N Robinson
Journal: Nat Protoc Date: 2015-11-12 Impact factor: 13.491

Review 3. Sequencing studies in human genetics: design and interpretation.

Authors: David B Goldstein; Andrew Allen; Jonathan Keebler; Elliott H Margulies; Steven Petrou; Slavé Petrovski; Shamil Sunyaev
Journal: Nat Rev Genet Date: 2013-06-11 Impact factor: 53.242

Review 4. Exome and genome sequencing for inborn errors of immunity.

Authors: Isabelle Meyts; Barbara Bosch; Alexandre Bolze; Bertrand Boisson; Yuval Itan; Aziz Belkadi; Vincent Pedergnana; Leen Moens; Capucine Picard; Aurélie Cobat; Xavier Bossuyt; Laurent Abel; Jean-Laurent Casanova
Journal: J Allergy Clin Immunol Date: 2016-10 Impact factor: 10.793

5. Comparative interactomics with Funcoup 2.0.

Authors: Andrey Alexeyenko; Thomas Schmitt; Andreas Tjärnberg; Dmitri Guala; Oliver Frings; Erik L L Sonnhammer
Journal: Nucleic Acids Res Date: 2011-11-21 Impact factor: 16.971

6. STRING v10: protein-protein interaction networks, integrated over the tree of life.

Authors: Damian Szklarczyk; Andrea Franceschini; Stefan Wyder; Kristoffer Forslund; Davide Heller; Jaime Huerta-Cepas; Milan Simonovic; Alexander Roth; Alberto Santos; Kalliopi P Tsafou; Michael Kuhn; Peer Bork; Lars J Jensen; Christian von Mering
Journal: Nucleic Acids Res Date: 2014-10-28 Impact factor: 16.971

7. Guidelines for genetic studies in single patients: lessons from primary immunodeficiencies.

Authors: Jean-Laurent Casanova; Mary Ellen Conley; Stephen J Seligman; Laurent Abel; Luigi D Notarangelo
Journal: J Exp Med Date: 2014-10-13 Impact factor: 14.307

8. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

Review 9. The Human Phenotype Ontology in 2017.

Authors: Sebastian Köhler; Nicole A Vasilevsky; Mark Engelstad; Erin Foster; Julie McMurry; Ségolène Aymé; Gareth Baynam; Susan M Bello; Cornelius F Boerkoel; Kym M Boycott; Michael Brudno; Orion J Buske; Patrick F Chinnery; Valentina Cipriani; Laureen E Connell; Hugh J S Dawkins; Laura E DeMare; Andrew D Devereau; Bert B A de Vries; Helen V Firth; Kathleen Freson; Daniel Greene; Ada Hamosh; Ingo Helbig; Courtney Hum; Johanna A Jähn; Roger James; Roland Krause; Stanley J F Laulederkind; Hanns Lochmüller; Gholson J Lyon; Soichi Ogishima; Annie Olry; Willem H Ouwehand; Nikolas Pontikos; Ana Rath; Franz Schaefer; Richard H Scott; Michael Segal; Panagiotis I Sergouniotis; Richard Sever; Cynthia L Smith; Volker Straub; Rachel Thompson; Catherine Turner; Ernest Turro; Marijcke W M Veltman; Tom Vulliamy; Jing Yu; Julie von Ziegenweidt; Andreas Zankl; Stephan Züchner; Tomasz Zemojtel; Julius O B Jacobsen; Tudor Groza; Damian Smedley; Christopher J Mungall; Melissa Haendel; Peter N Robinson
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

10. ClinVar: public archive of relationships among sequence variation and human phenotype.

Authors: Melissa J Landrum; Jennifer M Lee; George R Riley; Wonhee Jang; Wendy S Rubinstein; Deanna M Church; Donna R Maglott
Journal: Nucleic Acids Res Date: 2013-11-14 Impact factor: 16.971

2 in total

1. BEND4 as a Candidate Gene for an Infection-Induced Acute Encephalopathy Characterized by a Cyst and Calcification of the Pons and Cerebellar Atrophy.

Authors: Bülent Kara; Oya Uyguner; Hülya Maraş Genç; Eylül Ece İşlek; Murat Kasap; Güven Toksoy; Gürler Akpınar; Emek Uyur Yalçın; Yonca Anık; Duran Üstek
Journal: Mol Syndromol Date: 2021-09-28

2. De novo variants in exomes of congenital heart disease patients identify risk genes and pathways.

Authors: Cigdem Sevim Bayrak; Peng Zhang; Martin Tristani-Firouzi; Bruce D Gelb; Yuval Itan
Journal: Genome Med Date: 2020-01-15 Impact factor: 11.117

2 in total