Literature DB >> 31642496

Gene4Denovo: an integrated database and analytic platform for de novo mutations in humans.

Guihu Zhao^1,2, Kuokuo Li³, Bin Li^1,2, Zheng Wang¹, Zhenghuan Fang³, Xiaomeng Wang³, Yi Zhang¹, Tengfei Luo³, Qiao Zhou¹, Lin Wang³, Yali Xie¹, Yijing Wang³, Qian Chen¹, Lu Xia³, Yu Tang¹, Beisha Tang^1,2, Kun Xia³, Jinchen Li^1,2,3.

Abstract

De novo mutations (DNMs) significantly contribute to sporadic diseases, particularly in neuropsychiatric disorders. Whole-exome sequencing (WES) and whole-genome sequencing (WGS) provide effective methods for detecting DNMs and prioritizing candidate genes. However, it remains a challenge for scientists, clinicians, and biologists to conveniently access and analyse data regarding DNMs and candidate genes from scattered publications. To fill the unmet need, we integrated 580 799 DNMs, including 30 060 coding DNMs detected by WES/WGS from 23 951 individuals across 24 phenotypes and prioritized a list of candidate genes with different degrees of statistical evidence, including 346 genes with false discovery rates <0.05. We then developed a database called Gene4Denovo (http://www.genemed.tech/gene4denovo/), which allowed these genetic data to be conveniently catalogued, searched, browsed, and analysed. In addition, Gene4Denovo integrated data from >60 genomic sources to provide comprehensive variant-level and gene-level annotation and information regarding the DNMs and candidate genes. Furthermore, Gene4Denovo provides end-users with limited bioinformatics skills to analyse their own genetic data, perform comprehensive annotation, and prioritize candidate genes using custom parameters. In conclusion, Gene4Denovo conveniently allows for the accelerated interpretation of DNM pathogenicity and the clinical implication of DNMs in humans.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2020 PMID： 31642496 PMCID： PMC7145562 DOI： 10.1093/nar/gkz923

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

De novo mutations (DNMs) are defined as variants observed in individuals that are not seen in either parent and these types of variants have been reported to play prominent roles in several genetic diseases (1,2). Trios-based whole-exome sequencing (WES) and whole-genome sequencing (WGS) are the most useful methods to detect DNMs and have been successful applied in prioritizing candidate genes for autism spectrum disorder (ASD) (3), congenital heart disease (CHD) (4), undiagnosed developmental disorder (UDD) (5), epileptic encephalopathy (EE) (6), intellectual disability (ID) (7), schizophrenia (SCZ) (8) and others. Combining analyses demonstrates that for DNMs in coding-regions including single nucleotide variants (SNVs), insertions and deletions (indels) are associated with 13–60% of neurodevelopmental disorders (1). In addition, other research has shown that 42% of individuals with UDD carry pathogenic DNMs in coding regions and it is estimated that 0.22–0.47% of births involve UDD influenced by DNMs (5). Furthermore, 13% of de novo missense variants and 43% of de novo loss-of-function (LoF) variants have been diagnosed in 12% and 9% of ASD, respectively (3). In addition to neurodevelopmental disorders, DNMs also contribute to neurodegenerative disorders, such as early onset Alzheimer disease (EOAD) (9) and early onset Parkinson disease (EOPD) (10). Because of the high clinical and genetic heterogeneities in single complex disorders, it is essential to integrate the data on DNMs that is distributed in different publications in order to more effectively prioritize novel candidate genes using a uniform strategy, such as that previously reported for autism and other neuropsychiatric disorders by us (11–13) and other groups (14–16). The denovo-db (17) aggregates a large number of DNMs identified from next-generation sequencing studies and facilitates the interpretation of DNMs in humans. However, the denovo-db only includes basic annotation information and does not provide a list of DNM-based candidate genes. Given that some diseases, such as different types of neurodevelopmental disorders, share significant aetiologies and phenotypes, some studies have integrated DNMs of different diseases in order to prioritize novel candidate genes (14–16). Consequently, the Developmental Brain Disorder Gene Database (14) and NPdenovo database (18) were developed to present the integrated genetic data. However, these two databases focus only on DNMs of limited types of diseases. With the increasing application of WES and WGS, greater numbers of DNMs will be detected in individuals with different phenotypes, adding to the challenge for scientists and clinicians to determine the pathogenicity of DNMs. For the advancement of precision medicine, great efforts will be needed to assess disease-causing variants and to identify candidate genes more precisely. In a previous study we demonstrated that integrating more genetic and clinical data sources can be beneficial for better interpretation of human variants and the prioritization of candidate genes (19). In the present study, we catalogued all published DNMs detected by WES/WGS, performed comprehensive variant-level and gene-level annotations, and prioritized statistically significant candidate genes. We then developed a user-friendly integrated database called Gene4Denovo which allows DNMs, candidate genes, and annotation information in humans to be conveniently searched, browsed, and analysed.

MATERIALS AND METHODS

DNM collection

We collected DNMs from original published WES/WGS studies with sample sizes >10 (3–11,20–69) (Supplementary Table S1, Figure S1). DNMs from the denovo-db (17) and NPdenovo (18) databases were also collected. The information collected for each DNM included chromosome, start position, end position, reference sequence, alternate sequence, individual identifier, phenotype, sequence platform, publication information, and PubMed identifier. If an individual identifier was not available, ‘NA’ was used to fill this category. We also used LiftOver to translate different versions of human reference genomes (hg18 or hg38) to reference genome hg19. The complementary DNA (cDNA) positions of DNMs from some publications were translated into genomic DNA (gDNA) positions using VarCards online function that we previously developed (http://varcards.biols.ac.cn/). Given that some samples overlapped with different studies, the redundant samples were removed. If a study had integrated samples of other published studies, the DNMs and sample size recorded in Gene4Denovo were the non-redundant samples and the integrated studies were not included. If samples of an original studies had not been integrated by any other studies, the sample size recorded in Gene4Denovo was the same as the original samples. DNMs of individuals with the same phenotype from different publications were merged. In addition to DNMs from unaffected controls and patients with different disorders, we also collected DNMs from one study with mixed phenotypes (17,53) and like denovo-db (17) filled the phenotypic information of the individuals in this study using ‘Mix’.

DNM annotation and candidate genes prioritization

We performed a comprehensive analysis of the collected DNMs (Supplementary Figure S1). ANNOVAR (70) was used to perform comprehensive annotation of the DNMs based on definitions of transcripts from RefSeq, UCSC known Gene, and Ensembl Gene. Based on the functional effects, DNMs were classified into the following different types: (i) LoF variants, including frameshift indels, splicing, stopgain, and stoploss variants, (ii) deleterious missense variants (Dmis), (iii) tolerant missense variants (Tmis), (iv) synonymous variants (Syn), (v) non-frameshift indels (NF) variants and (vi) noncoding variants. The pathogenesis of the missense variants were predicted using ReVe, which was recently developed by our group (71). Missense variants with a ReVe score higher than 0.7 were considered Dmis. The LoF and Dmis variants were referred to as putative functional (Pfun) variants. The transmitted and de novo association (TADA) model (72) was used to calculate the P-value and false discovery rate (FDR) for each gene with Pfun variants in each disorder (Supplementary Table S2). The TADA parameters for each disorder, including the background gene-level de novo mutation rate (GDNMR) of each gene, the fraction of risk genes among all human genes (π), the fold-enrichment (λ) and the relative risk (γ) were evaluated. The GDNMR was sourced from a previous study based on the trinucleotide model and several adjusted factors (73). The fraction of risk genes was evaluated by maximum likelihood estimation based on the number of Pfun DNMs and the number of genes with multiple Pfun DNMs, as described in previous studies (3,4,74). The fold-enrichment of LoF and Dmis were calculated by comparing the number of normalized LoF and Dmis variants in each case with the control. As previous studies, we normalized the number of LoF and Dmis using the number of de novo synonymous mutations in each case and the control. Finally, we calculated the relative risk of LoF and Dmis using the equation: π(γ − 1) = λ − 1. For some disorders with >500 samples, including ASD, CHD, UDD, ID, EE, SCZ and Tourette Disorder (TD), the parameters of the TADA model were re-evaluated (74). For other diseases with inadequate sample sizes, we used parameters estimated from all the integrated DNMs. Two strategies were used to prioritize candidate genes. In the first strategy, we performed TADA to calculate the FDR of each gene for each disorder. In the second strategy, we combined DNMs of each gene in all disorders and calculated the FDR. Genes with different FDR levels in either of the two prioritization strategies were classified using the following criteria, respectively: High confidence [0, 0.0001], Strong [0.0001, 0.001], Suggestive [0.001, 0.01], Positive [0.01, 0.05], Possible [0.05, 0.1] and Minor evidence [0.1, 0.2].

Variant-level data source

Initially, the allele frequencies of different populations were downloaded from various human genetic variation databases, including gnomAD (release 2.1.1), which contained variants of 125 748 exomes and 15 708 genomes from unrelated individuals sequenced as part of various disease-specific and population genetic studies including a total of 141 456 individuals (75); ExAC (release 1.0), which included 60 706 unrelated individuals sequenced as part of various disease-specific and population genetic studies (75,76); ESP6500 (release ESP6500SI-V2), which included 6503 exomes from European Americans and African Americans (77); 1000 Genomes Project (final phase of the project), which included genomic data for 2504 individuals from 26 different populations around the world (78); Kaviar genomic variant database (version 160 204-Public), which included integrated variants from 35 projects encompassing 13 200 genomes and 64 600 exomes (79); and Haplotype Reference Consortium (HRC), which included 64 976 haplotypes from 20 studies of predominantly European ancestry (80). The predictive scores and pathogenicity consequences of missense variants were assessed based on 24 in silico methods, including ReVe (71), SIFT (81,82), PolyPhen2 HVAR (83), PolyPhen2 HDIV (83), LRT (84), MutationTaster (85), MutationAssessor (86), FATHMM (87), PROVEAN (88), MetaSVM (89), MetaLR (89), VEST 3.0 (90), M-CAP (91), CADD (92), GERP++ (93), DANN (94), FATHMM MKL (95), Eigen (96), GenoCanyon (97), fitCons (98), PhyloP (99), PhastCons (100), SiPhy (101) and REVEL (102). In addition, we extracted variant and related diseases or phenotype information from public disease-specific databases, including Clinical Interpretation of genetic variants (InterVar) (103); ClinVar, a database of public reports on the relationships among human variations and phenotypes (104); COSMIC, a database of somatic mutation information and related details, which also contains information relating to human cancers (105); ICGC, which catalogues genomic abnormalities in tumours (106); and the single nucleotide polymorphism database dbSNP v150 (107). Finally, we acquired the protein domain for each DNM from InterPro (108) and the protein sequences across 21 species from the National Center for Biotechnology Information (NCBI) database HomoloGene (109).

Gene-level data source

A large amount of meaningful annotations for each gene was collected from public databases. Basic information and functional information of genes were sourced from the following: UniProt (release 201902), which is a collection of functional information on proteins (110); NCBI Gene, which includes gene-specific connections in the nexus of sequence, expression, function and homology data (111); NCBI BioSystems, (release 20170421), which categorizes the genes, proteins, and small molecules involved in the biological system (112); Gene ontology (GO; V1.4), which is a source of information on the functions of genes (113); and InBio Map (release 20160912), which includes information on protein–protein interactions (114). The genic intolerance score of each gene were collected from residual variation intolerance score (RVIS), which is a gene-based score intended to help in the interpretation of human sequence data (115); the novel gene intolerance ranking system LoFtool (116); the heptanucleotide context intolerance score, which is an intolerance score quantifying the difference between the expected and observed numbers of functional variants at a gene (117); the gene damage index (GDI), which is the accumulated mutational damage for each human gene in the general population (118); Episcore, which is a computational method to predict haploinsufficiency using epigenomic features and is complementary to mutation intolerance metrics (119); and the probability of loss of function intolerance (pLI) score, which indicates the probability that a gene is intolerant to a loss of function mutation (75). In addition, disease-related or phenotype-related information of genes was extracted from Online Mendelian Inheritance in Man (OMIM) (120); ClinVar (104); Human Phenotype Ontology (HPO), which is a standardized vocabulary of phenotypic abnormalities encountered in human disease (121); mammalian phenotype from mouse genome informatics (MGI) (122); and InterPro, which is a resource that provides functional analysis of protein sequences (108). Furthermore, we collected gene expression data from BrainSpan, which contains data regarding gene expression in specific brain regions and covers several developmental stages (123); the Genotype-Tissue Expression project (GTEx), which involves the relationship among genetic variation and gene expression in multiple human tissues (124); and the protein subcellular map from the Human Protein Atlas, which is a map of protein expression across 32 human tissues (125). Finally, the drug–gene interactions data and gene druggability were sourced from the latest Drug-gene Interaction Database (DGIdb, v3.0), which assembled 56 039 drug-gene interaction claims (126).

Database construction and interface

Gene4Denovo (http://www.genemed.tech/gene4denovo) was developed using JavaScript, PHP, and Perl using a Linux platform on a Nginx web server. A front and back separation model was used. The front end was based on vue and used the UI Toolkit element, which supports all modern browsers across platforms, including Microsoft Edge, Safari, FireFox and Google Chrome. The back end was based on Laravel, a PHP web framework. The front and back separation model has a number of advantages, including simplicity of control, modularity and expandability. Gene4Denovo is compatible with all major browser environments and different operating systems, including Windows, Linux, and Mac. The data were stored in a MySQL database.

RESULTS AND WEB INTERFACE

DNMs and candidate genes

The Gene4Denovo database fully integrated 580 799 DNMs from 23 951 individuals across 24 phenotypes from 59 publications, including 553 404 DNMs detected by WGS and 27 395 DNMs detected by WES (Table 1). Most of the DNMs and samples were collected from nine phenotypes that included 6511 patients with ASD (n = 280 782), 4293 patients with UDD (n = 8361), 2645 patients with CHD (n = 2990), 933 patients with EE (n = 1213), 1331 patients with ID (n = 1493), 1094 patients with SCZ (n = 1064), 812 patients with TD (n = 805), 3391 unaffected controls (n = 174 836) and 1548 individuals with Mix phenotypes (n = 107 834). Using comprehensive annotation, we preferentially focused on 30 060 DNMs in coding regions and splicing sites (4582 LoF, 6651 Dmis, 11 781 Tmis, 6550 Syn and 496 NF). The DNMs included 8175 in ASD, 7696 in UDD, 2972 in CHD, 1478 in ID, 1165 in EE, 1052 in SCZ, 781 in TD, 470 in Congenital diaphragmatic hernia (CDH), 319 in Craniosynostosis (CRAN), 219 in Periventricular nodular heterotopia (PNH), 109 in Amyotrophic lateral sclerosis (ALS), 68 in Bipolar disorder (BP), 60 in EOPD, 59 in Cerebral palsy (CP), 40 in Neural tube defects (NTD), 19 in Early-onset high myopia (EOHM), 15 in EOAD, 13 in Smith-Magenis syndrome (SMS), 6 in Cantu syndrome (CS), 5 in Sporadic infantile spasm syndrome (SISS), 4 in Acromelic frontonasal dysostosis (AFD), 4 in Anophthalmia/Microphthalmia (AM), 3629 in Control and 1702 in Mix.

Table 1.

Summary of collected DNMs in Gene4Denovo database

Phenotypes	Abbreviation	Study	Trios	DNMs	Coding DNMs
Autism spectrum disorder	ASD	11	6511	280 782	8175
Undiagnosed developmental disorder	UDD	1	4293	8361	7696
Congenital heart disorder	CHD	1	2645	2990	2972
Intellectual disability	ID	7	1331	1493	1478
Epileptic encephalopathy	EE	7	933	1213	1165
Schizophrenia	SCZ	7	1094	1064	1052
Tourette disorder	TD	2	812	805	781
Congenital diaphragmatic hernia	CDH	1	362	470	470
Craniosynostosis	CRAN	1	291	322	319
Periventricular nodular heterotopia	PNH	1	202	219	219
Amyotrophic lateral sclerosis	ALS	3	173	111	109
Bipolar disorder	BP	1	79	71	68
Early onset Parkinson disease	EOPD	2	49	60	60
Cerebral palsy	CP	1	98	61	59
Neural tube defects	NTD	1	43	40	40
Early-onset high myopia	EOHM	1	18	20	19
Early onset Alzheimer disease	EOAD	1	12	15	15
Smith-Magenis syndrome	SMS	1	13	13	13
Cantu syndrome	CS	1	14	6	6
Sporadic infantile spasm syndrome	SISS	1	10	5	5
Acromelic frontonasal dysostosis	AFD	1	4	4	4
Anophthalmia/Microphthalmia	AM	1	25	4	4
Control	Control	9	3391	174 836	3629
Mix phenotype	Mix	1	1548	107 834	1702
Total		59	23 951	580 799	30 060

All DNMs reported in primary publications were integrated in Gene4Denovo database. ANNOVAR was performed to annotate these DNMs. Variants with functional effects of frameshift indels, stopgain, and stoploss, missense, synonymous, non-frameshift indels and splicing site (≤2 bp) were defined as coding DNMs. DNMs in AFD with sample size <10 (n = 4) from denovo-db database were also integrated in present study.

Summary of collected DNMs in Gene4Denovo database All DNMs reported in primary publications were integrated in Gene4Denovo database. ANNOVAR was performed to annotate these DNMs. Variants with functional effects of frameshift indels, stopgain, and stoploss, missense, synonymous, non-frameshift indels and splicing site (≤2 bp) were defined as coding DNMs. DNMs in AFD with sample size <10 (n = 4) from denovo-db database were also integrated in present study. Based on the TADA model parameters (Supplementary Table S2) and Pfun DNMs (the combination of LoF and Dmis) of each disorder, we prioritized 591 candidate genes with FDR values <0.2 from 18 disorders, including ASD (n = 140), UDD (n = 308), CHD (n = 60), ID (n = 121), EE (n = 80), SCZ (n = 10), TD (n = 11), CDH (n = 2), CRAN (n = 1), PNH (n = 1), ALS (n = 1), BP (n = 1), EOPD (n = 1), CP (n = 1), NTD (n = 1), SMS (n = 1), CS (n = 1), AFD (n = 1) (Table 2, (Supplementary Table S3). Due to the small sample size and high genetic heterogeneity, we did not prioritize any significant candidate genes in EOAD, SISS, AM or EOHM. More samples are needed for further study of these diseases. Since most of disease samples we collected shared significant aetiology and clinical presentations, we also combined Pfun DNMs from all disorders, performed TADA analysis again, and prioritized 385 candidate genes with FDR <0.2, which included 301 genes that have been prioritized by single-disorder analysis and 84 genes that have been prioritized by cross-disorder analysis. After removing redundancy, we ultimately identified 675 candidate genes and ranked the genes into six tiers based on the strength of the statistical evidence of FDR (Table 2). The tiers included 132 high-confidence genes (FDR ≤ 0.0001, 19.56%), 36 strong genes (FDR < 0.0001 to ≤ 0.001, 5.33%), 62 suggestive genes (FDR < 0.001 to ≤ 0.01, 9.19%), 116 positive genes (FDR < 0.01 to ≤ 0.05, 17.19%), 99 possible genes (FDR < 0.05 to ≤ 0.1, 14.67%), and 230 minor-evidence genes (FDR < 0.1 to ≤ 0.2, 34.07%). We noted that 39.41% (266/675) candidate genes carried Pfun DNMs in only one disorder and 27.26% (184/675), 19.70% (133/675), 9.19% (62/675), 3.26% (22/675), 1.04% (7/675) and 0.15% (1/675) of candidate genes carried Pfun DNMs in two, three, four, five, six and seven disorders, respectively (Supplementary Table S3). For example, ARID1B, CACNA1E, DDX3X, POGZ, RYR2, SCN2A and SMAD6 carried Pfun DNMs in six disorders and KMT2C in seven disorders.

Table 2.

Summary of prioritized candidate genes in Gene4Denovo database

Disease (trios)	FDR ≤ 0.0001	0.0001< FDR ≤ 0.001	0.001 < FDR ≤ 0.01	0.01 < FDR ≤ 0.05	0.05 < FDR ≤ 0.1	0.1 < FDR < 0.2
ASD (6511)	13	9	10	29	26	53
UDD (4293)	85	21	43	50	40	69
CHD (2645)	3	3	4	12	13	25
ID (1331)	26	13	18	16	14	34
EE (933)	14	3	14	12	8	29
SCZ (1094)	0	0	0	0	1	9
TD (812)	0	0	0	2	3	6
CDH (362)	0	1	0	0	1	0
CRAN (291)	0	1	0	0	0	0
PNH (202)	1	0	0	0	0	0
ALS (173)	0	0	0	0	0	1
BP (79)	0	0	0	0	0	1
EOPD (49)	0	0	0	0	0	1
CP (98)	0	0	0	1	0	0
NTD (43)	0	0	0	1	0	0
SMS (13)	1	0	0	0	0	0
CS (14)	1	0	0	0	0	0
AFD (4)	0	1	0	0	0	0
CD (19 012)	117	27	46	60	47	88
Total	132	36	62	116	99	230

ASD, autism spectrum disorder; UDD, undiagnosed developmental disorder; CHD, congenital heart disorder; ID, intellectual disability; EE, epileptic encephalopathy; SCZ, schizophrenia; TD, tourette disorder; CDH, congenital diaphragmatic hernia; CRAN, craniosynostosis; PNH, periventricular nodular heterotopia; ALS, amyotrophic lateral sclerosis; BP, bipolar disorder; EOPD, early onset parkinson disease; CP, cerebral palsy; NTD, neural tube defects; SMS, smith-magenis syndrome; CS, cantu syndrome; AFD, acromelic frontonasal dysostosis. CD, combined all samples with different disorders. Number of genes with FDR < 0.2 in each disorder and cross disorders analysis were showed in this table. We ranked all candidate genes into six tiers based on the strength of false discovery rate (FDR). The total number of candidate genes were counted after removing redundancy.

Summary of prioritized candidate genes in Gene4Denovo database ASD, autism spectrum disorder; UDD, undiagnosed developmental disorder; CHD, congenital heart disorder; ID, intellectual disability; EE, epileptic encephalopathy; SCZ, schizophrenia; TD, tourette disorder; CDH, congenital diaphragmatic hernia; CRAN, craniosynostosis; PNH, periventricular nodular heterotopia; ALS, amyotrophic lateral sclerosis; BP, bipolar disorder; EOPD, early onset parkinson disease; CP, cerebral palsy; NTD, neural tube defects; SMS, smith-magenis syndrome; CS, cantu syndrome; AFD, acromelic frontonasal dysostosis. CD, combined all samples with different disorders. Number of genes with FDR < 0.2 in each disorder and cross disorders analysis were showed in this table. We ranked all candidate genes into six tiers based on the strength of false discovery rate (FDR). The total number of candidate genes were counted after removing redundancy.

Gene4Denovo search modules

To accelerate the interpretation of DNMs and candidate genes, we developed a database called Gene4Denovo, which features a user-friendly query interface and a set of custom functions and provides a comprehensive overview of DNMs, candidate genes, and their annotation information. The query interface contains panels for quick searches and for advanced searches (Figure 1). The quick search function is the main tool to quickly access detail information regarding DNMs and can be found on the home page. The quick search automatically recognizes a variety of key terms, such as gene symbol, genomic region, cytoband, transcript accession, the nucleic acid change in a certain genes or transcripts, the genomic coordinate of a variant, as well as the DNM identifier. Moreover, several examples of input query formats are available by clicking the ‘example’ link with the corresponding examples occurring in the input box. The advanced search (http://www.genemed.tech/gene4denovo/search) supports batch searches and allows users to specify annotated datasets. The advanced search provides options that include primary information, predictive algorithms for nonsynonymous variation, allele frequency in different populations, and disease-related and phenotype-related information. The advanced search also has six types of input forms that are similar to the quick search (gene symbol, genomic region, cytoband, transcript accessions, the nucleic acid change in a certain gene or transcript, and the genomic coordinate of a variant). To improve the user experience, the advanced search query form and the corresponding result sets are displayed on the same page. Of note, the search results can be freely exported as Excel files to download.

Figure 1.

Snapshot of variant-level implications in Gene4Denovo. Two approaches are available to access variant-level implications, the ‘Quick search’ and ‘Advanced search’. The results of a quick search for the KCNQ2 gene are shown as an example, including the functional effects at the transcript and protein levels, homology, predicted damaging severity of missense variants, allele frequencies in different populations, and information in disease-related databases.

Variant-level implications in Gene4Denovo

Both quick and advanced searches provide access to detailed DNM annotation data. Search results are returned as a page that contains two tables. The first table is a summary of DNMs for each gene in each disorder while the second table displays all the detail information regarding variant-level annotations (Figure 1). The summary table synoptically presents the number of LoF, Dmis, Pfun and Tmis, synonymous, the non-frameshift and non-coding variants, the P-value, and the FDR for each gene in each disorder. The variants table presents detailed information for each DNM, including the following aspects: (i) the functional effect and reference information for each DNM; (ii) the predicted damaging scores and functional consequences of missense variants based on 24 in silico algorithms; (iii) the allele frequencies of different populations based on seven data sources; (iv) the disease-related information from seven popular related data sources and (v) the protein sequences across 21 species from HomoloGene, including the graphic presentation of multiple sequence alignment between species. The variants table can be filtered by functional effects, adding flexibility to the output. In addition, users can specify any of the mentioned data sources to limit the contents presented to those of specific interest.

Gene-level implications in Gene4Denovo

On the page of variants-level implications, users can click on the corresponding gene symbols in the summary table or variants table to access detailed information regarding the given genes. All genes containing DNMs were curated in Gene4Denovo, which currently includes the following six specified panels: (i) basic information, (ii) gene function, (iii) phenotype and disease, (iv) gene expression, (v) variants in different populations and (vi) drug–gene interaction (Figure 2, (Supplementary Table S4). The ‘basic information’ displays the integrated basic information for the gene, including the official gene name, synonyms, genomic coordinate, gene type and functional annotations, the genic intolerance score based on six studies (75,115–119), and a summary of the cellular function of the protein encoded by the gene sourced according to UniProt (110). The ‘gene function’ consists of five sub-panels, including (i) the molecular function retrieved from UniProtKB; (ii) gene ontology terms retrieved from Gene Ontology Consortium (113); (iii) domain information retrieved from InterPro (108); (iv) protein–protein interactions retrieved from InBio Map (114) and (v) biological pathway information retrieved from BioSystems (112). The ‘phenotype and disease’ panel consists of four sub-panels, including (i) phenotype data retrieved from OMIM (120); (ii) clinical variation data retrieved from ClinVar (104); (iii) mammalian phenotype data retrieved from MGI (122) and (iv) human phenotype ontology retrieved from HPO (121). The ‘gene expression’ panel consists of four sub-panels, including (i) spatio-temporal expression profiles retrieved from BrainSpan (123); (ii) cell diversity and expression in the human cortex based on single-cell RNA-seq from the Allen Brain Atlas; (iii) gene expression data in 31 primary tissues and 53 secondary tissues retrieved from GTEx (124) and (iv) subcellular location retrieved from The Human Protein Atlas (125). The ‘variants in different populations’ panel provides the number of variants with different functional effects at different threshold in different populations. The ‘drug–gene interaction’ panel provides data for drug–gene interactions and gene druggability, which is retrieved from DGIdb v3.0 (126).

Figure 2.

Snapshot of gene-level implications in Gene4Denovo. The typical gene-level implications of the KCNQ2 gene are illustrated as an example, including basic information, gene functions, associated phenotypes and diseases, gene expression, variants in different populations, and drug–gene interactions.

Customized analysis section in Gene4Denovo

Gene4Denovo provides an interface to allow users to freely analyse their own genetic data (http://genemed.tech/gene4denovo/analysis). As shown in Figure 3, the analysis process includes four simple steps: (i) inputting an email address, (ii) choosing the Trio or Non-trio option of users’ genetic data, (iii) uploading genetic data files (VCF4 format) and (iv) inputting the basic information for each sample. If users select the Trio option, the users must select the paternal sample ID, maternal sample ID, children's sample ID and the gender of the children. Gene4Denovo will automatically identify the DNMs, homozygous variants, compound heterozygous variants, and X-linked variants using default parameters. If the Non-trio option is chosen, the users must select the genotypes of each sample, including heterozygous, homozygous, wild type, and so on. Gene4Denovo will identify the user-defined co-segregated rare damaging variants using default parameters. It is noteworthy that users are able to specify cut off values of quality control, the data sources of annotation and the parameters used for identifying rare damaging variants. In the quality control panel, users are able to specify several parameters used to detect high-confidence genetic variants, including the minimum QUAL, sequencing depth, allele depth, and genotype quality. There are four annotation sub-panels: (i) ‘Basic information annotation’ to specify three basic data sources of annotation (such as cytoband database, gene-level-based databases, and Gene4Denovo), which refer to the identifier, putative functional DNMs, P-value and FDR of candidate genes in each disorder; (ii) ‘Pathogenicity prediction of missense variants’ to specify the methods and cut-off values for predicting deleterious missense variants; (iii) ‘Allele frequency in variant population’ to specify the cut off values of allele frequency for detecting rare variants according to different population databases and (iv) ‘Clinical related database’ to specify clinical related database, such as InterVar, ClinVar, COSMIC, ICGC and NCI-60. After completing the analysis, Gene4Denovo sends an email to the designated email address that includes a link for downloading the analysis results.

Figure 3.

Snapshot of analysis panel in Gene4Denovo. There are four steps in the analysis process: inputting an email address, choosing the Trio or Non-trio option, uploading the data files, and inputting the trio or genotype information. To increase flexibility, users are able to specify annotation datasets, such as functional effects, allele frequencies, and predicted damaging scores from any of the 24 in silico algorithms.

Other sections in Gene4Denovo

Gene4Denovo also contains additional useful sections. These include (i) the browse section, which can be used for accessing gene-level summary implication efficiently; (ii) the upload section, which provides a user-friendly web-based process for uploading and archiving users’ DNMs; (iii) the download section, which allows users to freely access all released datasets in Gene4Denovo without login requirements and to download the complete de novo data files via http; (iv) the data source, which shows brief information regarding the integrated databases and (v) the tutorial section, which provides a further description of Gene4Denovo and details on how to get started.

DISCUSSION

A DNM-based strategy for genome and exome analyses provides unprecedented opportunities to promote our knowledge regarding genetic pathogenic mechanisms in humans for complex disorders having high clinical and genetic heterogeneity (127–130). However, a major challenge is the scattered distribution of DNM data and annotated genetic and clinical data sources (14,17,18). To make the DNM and annotated data more accessible, we collected DNM data from various published WES/WGS studies, performed uniform comprehensive annotation, and prioritized a list of candidate genes. In addition, we developed a user-friendly, interactive, open-access web-based interface to browse, search, analyse, and download the integrated data. More than 60 popular genomic data sources were integrated into Gene4Denovo in order to provide users with comprehensive information regarding variants and genes. Gene4Denovo accentuates the importance of integrating DNMs from different studies in a uniform manner. First, we found that integrating DNMs from multiple publications for a single disorder improved the power of prioritizing candidate genes due to increasing the sample sizes. Additionally, users will be able to analyse more DNMs by integrating their own in-house data with our database and then prioritize new candidate genes. Second, we found that integrating DNMs from different disorders resulted in an additional 84 candidate genes being prioritized, including eight genes that reached an FDR < 0.01. For example, KMT2C (FDR = 7.02 × 10−3) was prioritized as a strong candidate gene by combining DNMs from seven disorders (ASD, CHD, UDD, ID, SCZ, ALS and BP). Third, integrated DNMs of unaffected controls were able to be used as negative control in the identification of pathogenic variants or candidate genes. For example, KDM5B carried 9, 3, 3, 1, 1 Pfun DNMs in patients with ASD (FDR = 2.30 × 10−8), CHD (FDR = 7.17 × 10−3), UDD (FDR = 0.093), TD (FDR = 0.36), and CDH (FDR = 0.61), respectively. However, we found that KDM5B also carried five Pfun DNMs in control cohorts, suggesting that DNMs of KDM5B may not be associated with these diseases. This was consistent with a recent study that found KDM5B is a recessive gene associated with neurodevelopmental disorders (131). We fully prioritized 675 candidate genes. Of the candidate genes, 60.59% (409/675) carried Pfun DNMs in two or more disorders. This data may be useful in identifying biomarkers that can be used in a translational setting for genetic counselling and clinical assessment. Some of the candidate genes have been well validated by functional experiments or clinical phenotypes, such as CHD8 (132), SCN2A (133,134), CACNA1E (135) and POGZ (136,137), while others need further functional validation. All individuals carry ∼70 DNMs in their genome, but only a small number of DNMs contribute to human diseases, making it a challenge to interpret the pathogenicity of DNMs and genes (56). To address this need, we integrated >60 genomic sources into Gene4Denovo in order to provide comprehensive analyses of DNMs and candidate genes. First, it has reported that approximately one third of DNMs of neurodevelopmental disorders are present in the general population and this type of DNM might do not contribute to risk of developing a disorder (138). Therefore, it was important for the population-based background allele frequency to be integrated into Gene4Denovo so to allow for a better understanding of pathogenic variants. Second, several computational methods have been used to predict deleterious variants in humans, but these methods provide inconsistent results (71). Therefore, Gene4Denovo offers prediction scores from 24 in silico algorithms and allows users to select one or a combination of multiple suitable methods. Third, Gene4Denovo integrated variant-level information from popular genetic database, such as ClinVar (104), OMIM (120) and HPO (139), which may help users to comprehensively evaluate the pathogenicity of genes and genetic variants. Fourth, Gene4Denovo integrated meaningful gene-level information, such as gene function and gene expression patterns, in order to provide users comprehensive information regarding a given gene from a one-stop database. Despite of the advancement of other available databases related to DNMs, Gene4Denovo exhibits significant differences that represent major advances. The mirDNMR database (140) focuses on gene-level background DNM rates predicted by four different methods instead of analysing DNMs themselves. The EpiDenovo database (141) provides the associations between embryonic epigenomes and DNMs in developmental disorders. Compared to the denovo-db (17), Gene4Denovo not only integrated more DNMs, but also provided more comprehensive annotation information collected from >60 genomic data sources. This extensive integration should further facilitate the interpretation of DNMs. In addition, Gene4Denovo prioritizes a list of candidate genes with different degrees of statistical evidence. This is important for biologists in selecting genes for functional validation and for geneticists and clinicians for genetic counselling. Furthermore, in order to facilitate research communities to take advantage of the integrated DNMs, candidate genes, and other genomic data sources, Gene4Denovo provides a user-friendly interface for detecting DNMs, homozygous variants, X-linked variants, and co-segregated variants for performing customized comprehensive annotations, and for prioritizing pathogenic variants and risk-genes. There are some limitations to the present study that require further effort in order to be resolved. First, the candidate genes were prioritized using the TADA model, which influenced by several factors, such as GDNMR, the tools used for predicting deleterious missenses, and the enrichment of LoF and Dmis. We encourage users to download the integrated DNMs from Gene4Denovo database and prioritized candidate genes using different parameters and new models. For example, Nguyen, et al. developed a new method called extTADA and prioritized 288 candidate genes in neurodevelopmental disorders. Additional experiment validation and more detailed clinical phenotypes of patients are still needed. Second, despite noncoding variants being catalogued in the Gene4Denovo database, the current study prioritized candidate genes based only on DNMs in coding regions. DNMs in noncoding regions, such as those in promoters (45), are also likely to contribute to the risk of developing disorders. In the next version of Gene4Denovo, we plan to integrate both coding DNMs and noncoding DNMs for prioritizing candidate genes. Third, in order to provide uniform genetic data, Gene4Denovo focused on DNMs detected by WES/WGS, which are most wildly used in medical genetics. This means that some DNMs from targeted sequencing studies and case reports were not included. Since it is still challenging to accurately detect de novo copy-number variations (CNVs) from NGS data, especially WES data, the current version of Gene4Denovo did not integrate de novo CNVs. However, we plan to add de novo CNVs in the next version of Gene4Denovo. In addition, we plan to continuously collect DNMs from the latest published WES/WGS studies and to update the Gene4Denovo database every six months. We also promote and highly appreciate users uploading their own DNM data and archives into Gene4Denovo by using the uploading interface. In conclusion, Gene4Denovo offers a large number of freely available DNMs with uniform curation and annotations across 24 phenotypes. Gene4Denovo also provides a list of prioritized candidate gene and comprehensive genetic and clinical information for each DNM and gene. We hope the Gene4Denovo database will provide a great convenience for geneticists, biologists, and clinicians and accelerate the interpretation of DNM pathogenicity and its clinical implication. Click here for additional data file.

140 in total

1. SIFT missense predictions for genomes.

Authors: Robert Vaser; Swarnaseetha Adusumalli; Sim Ngak Leng; Mile Sikic; Pauline C Ng
Journal: Nat Protoc Date: 2015-12-03 Impact factor: 13.491

2. Whole-exome sequencing points to considerable genetic heterogeneity of cerebral palsy.

Authors: G McMichael; M N Bainbridge; E Haan; M Corbett; A Gardner; S Thompson; B W M van Bon; C L van Eyk; J Broadbent; C Reynolds; M E O'Callaghan; L S Nguyen; D L Adelson; R Russo; S Jhangiani; H Doddapaneni; D M Muzny; R A Gibbs; J Gecz; A H MacLennan
Journal: Mol Psychiatry Date: 2015-02-10 Impact factor: 15.992

3. Mental health: On the spectrum.

Authors: David Adam
Journal: Nature Date: 2013-04-25 Impact factor: 49.962

4. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder.

Authors: Joon-Yong An; Kevin Lin; Lingxue Zhu; Donna M Werling; Shan Dong; Harrison Brand; Harold Z Wang; Xuefang Zhao; Grace B Schwartz; Ryan L Collins; Benjamin B Currall; Claudia Dastmalchi; Jeanselle Dea; Clif Duhn; Michael C Gilson; Lambertus Klei; Lindsay Liang; Eirene Markenscoff-Papadimitriou; Sirisha Pochareddy; Nadav Ahituv; Joseph D Buxbaum; Hilary Coon; Mark J Daly; Young Shin Kim; Gabor T Marth; Benjamin M Neale; Aaron R Quinlan; John L Rubenstein; Nenad Sestan; Matthew W State; A Jeremy Willsey; Michael E Talkowski; Bernie Devlin; Kathryn Roeder; Stephan J Sanders
Journal: Science Date: 2018-12-14 Impact factor: 47.728

5. Coding mutations in NUS1 contribute to Parkinson's disease.

Authors: Ji-Feng Guo; Lu Zhang; Kai Li; Jun-Pu Mei; Jin Xue; Jia Chen; Xia Tang; Lu Shen; Hong Jiang; Chao Chen; Hui Guo; Xue-Li Wu; Si-Long Sun; Qian Xu; Qi-Ying Sun; Piu Chan; Hui-Fang Shang; Tao Wang; Guo-Hua Zhao; Jing-Yu Liu; Xue-Feng Xie; Yi-Qi Jiang; Zhen-Hua Liu; Yu-Wen Zhao; Zuo-Bin Zhu; Jia-da Li; Zheng-Mao Hu; Xin-Xiang Yan; Xiao-Dong Fang; Guang-Hui Wang; Feng-Yu Zhang; Kun Xia; Chun-Yu Liu; Xiong-Wei Zhu; Zhen-Yu Yue; Shuai Cheng Li; Huai-Bin Cai; Zhuo-Hua Zhang; Ran-Hui Duan; Bei-Sha Tang
Journal: Proc Natl Acad Sci U S A Date: 2018-10-22 Impact factor: 11.205

6. De novo mutations in schizophrenia implicate chromatin remodeling and support a genetic overlap with autism and intellectual disability.

Authors: S E McCarthy; J Gillis; M Kramer; J Lihm; S Yoon; Y Berstein; M Mistry; P Pavlidis; R Solomon; E Ghiban; E Antoniou; E Kelleher; C O'Brien; G Donohoe; M Gill; D W Morris; W R McCombie; A Corvin
Journal: Mol Psychiatry Date: 2014-04-29 Impact factor: 15.992

7. Rates, distribution and implications of postzygotic mosaic mutations in autism spectrum disorder.

Authors: Elaine T Lim; Mohammed Uddin; Silvia De Rubeis; Yingleong Chan; Anne S Kamumbu; Xiaochang Zhang; Alissa M D'Gama; Sonia N Kim; Robert Sean Hill; Arthur P Goldberg; Christopher Poultney; Nancy J Minshew; Itaru Kushima; Branko Aleksic; Norio Ozaki; Mara Parellada; Celso Arango; Maria J Penzol; Angel Carracedo; Alexander Kolevzon; Christina M Hultman; Lauren A Weiss; Menachem Fromer; Andreas G Chiocchetti; Christine M Freitag; George M Church; Stephen W Scherer; Joseph D Buxbaum; Christopher A Walsh
Journal: Nat Neurosci Date: 2017-07-17 Impact factor: 24.884

8. Identifying Mendelian disease genes with the variant effect scoring tool.

Authors: Hannah Carter; Christopher Douville; Peter D Stenson; David N Cooper; Rachel Karchin
Journal: BMC Genomics Date: 2013-05-28 Impact factor: 3.969

9. De novo mutations in epileptic encephalopathies.

Authors: Andrew S Allen; Samuel F Berkovic; Patrick Cossette; Norman Delanty; Dennis Dlugos; Evan E Eichler; Michael P Epstein; Tracy Glauser; David B Goldstein; Yujun Han; Erin L Heinzen; Yuki Hitomi; Katherine B Howell; Michael R Johnson; Ruben Kuzniecky; Daniel H Lowenstein; Yi-Fan Lu; Maura R Z Madou; Anthony G Marson; Heather C Mefford; Sahar Esmaeeli Nieh; Terence J O'Brien; Ruth Ottman; Slavé Petrovski; Annapurna Poduri; Elizabeth K Ruzzo; Ingrid E Scheffer; Elliott H Sherr; Christopher J Yuskaitis; Bassel Abou-Khalil; Brian K Alldredge; Jocelyn F Bautista; Samuel F Berkovic; Alex Boro; Gregory D Cascino; Damian Consalvo; Patricia Crumrine; Orrin Devinsky; Dennis Dlugos; Michael P Epstein; Miguel Fiol; Nathan B Fountain; Jacqueline French; Daniel Friedman; Eric B Geller; Tracy Glauser; Simon Glynn; Sheryl R Haut; Jean Hayward; Sandra L Helmers; Sucheta Joshi; Andres Kanner; Heidi E Kirsch; Robert C Knowlton; Eric H Kossoff; Rachel Kuperman; Ruben Kuzniecky; Daniel H Lowenstein; Shannon M McGuire; Paul V Motika; Edward J Novotny; Ruth Ottman; Juliann M Paolicchi; Jack M Parent; Kristen Park; Annapurna Poduri; Ingrid E Scheffer; Renée A Shellhaas; Elliott H Sherr; Jerry J Shih; Rani Singh; Joseph Sirven; Michael C Smith; Joseph Sullivan; Liu Lin Thio; Anu Venkat; Eileen P G Vining; Gretchen K Von Allmen; Judith L Weisenberg; Peter Widdess-Walsh; Melodie R Winawer
Journal: Nature Date: 2013-08-11 Impact factor: 49.962

Review 10. Genomic insights into the overlap between psychiatric disorders: implications for research and clinical practice.

Authors: Joanne L Doherty; Michael J Owen
Journal: Genome Med Date: 2014-04-28 Impact factor: 11.117

17 in total

1. CircleBase: an integrated resource and analysis platform for human eccDNAs.

Authors: Xiaolu Zhao; Leisheng Shi; Shasha Ruan; Wenjian Bi; Yifan Chen; Lin Chen; Yifan Liu; Mingkun Li; Jie Qiao; Fengbiao Mao
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

2. Whole-exome DNA sequencing in childhood anxiety disorders identifies rare de novo damaging coding variants.

Authors: Emily Olfson; Eli R Lebowitz; Grace Hommel; Neha Pashankar; Wendy K Silverman; Thomas V Fernandez
Journal: Depress Anxiety Date: 2022-03-21 Impact factor: 8.128

3. Gene4MND: An Integrative Genetic Database and Analytic Platform for Motor Neuron Disease.

Authors: Guihu Zhao; Zhen Liu; Mengli Wang; Yanchun Yuan; Jie Ni; Wanzhen Li; Ling Huang; Yiting Hu; Pan Liu; Xiaorong Hou; Jifeng Guo; Hong Jiang; Lu Shen; Beisha Tang; Jinchen Li; Junling Wang
Journal: Front Mol Neurosci Date: 2021-04-01 Impact factor: 5.639

4. GPCards: An integrated database of genotype-phenotype correlations in human genetic diseases.

Authors: Bin Li; Zheng Wang; Qian Chen; Kuokuo Li; Xiaomeng Wang; Yijing Wang; Qian Zeng; Ying Han; Bin Lu; Yuwen Zhao; Rui Zhang; Li Jiang; Hongxu Pan; Tengfei Luo; Yi Zhang; Zhenghuan Fang; Xuewen Xiao; Xun Zhou; Rui Wang; Lu Zhou; Yige Wang; Zhenhua Yuan; Lu Xia; Jifeng Guo; Beisha Tang; Kun Xia; Guihu Zhao; Jinchen Li
Journal: Comput Struct Biotechnol J Date: 2021-03-22 Impact factor: 7.271

5. DECO: a framework for jointly analyzing de novo and rare case/control variants, and biological pathways.

Authors: Tan-Hoang Nguyen; Xin He; Ruth C Brown; Bradley T Webb; Kenneth S Kendler; Vladimir I Vladimirov; Brien P Riley; Silviu-Alin Bacanu
Journal: Brief Bioinform Date: 2021-09-02 Impact factor: 11.622

6. Targeted sequencing and integrative analysis to prioritize candidate genes in neurodevelopmental disorders.

Authors: Yi Zhang; Tao Wang; Yan Wang; Kun Xia; Jinchen Li; Zhongsheng Sun
Journal: Mol Neurobiol Date: 2021-04-15 Impact factor: 5.590

Review 7. Research advances on neurite outgrowth inhibitor B receptor.

Authors: Rui Zhang; Bei-Sha Tang; Ji-Feng Guo
Journal: J Cell Mol Med Date: 2020-06-15 Impact factor: 5.310

8. Genetic evidence of gender difference in autism spectrum disorder supports the female-protective effect.

Authors: Yi Zhang; Na Li; Chao Li; Ze Zhang; Huajing Teng; Yan Wang; Tingting Zhao; Leisheng Shi; Kun Zhang; Kun Xia; Jinchen Li; Zhongsheng Sun
Journal: Transl Psychiatry Date: 2020-01-15 Impact factor: 6.222

9. Functional relationships between recessive inherited genes and genes with de novo variants in autism spectrum disorder.

Authors: Lin Wang; Yi Zhang; Kuokuo Li; Zheng Wang; Xiaomeng Wang; Bin Li; Guihu Zhao; Zhenghuan Fang; Zhengbao Ling; Tengfei Luo; Lu Xia; Yanping Li; Hui Guo; Zhengmao Hu; Jinchen Li; Zhongsheng Sun; Kun Xia
Journal: Mol Autism Date: 2020-10-06 Impact factor: 7.509

10. Cross-Disorder Analysis of De Novo Variants Increases the Power of Prioritising Candidate Genes.

Authors: Kuokuo Li; Zhengbao Ling; Tengfei Luo; Guihu Zhao; Qiao Zhou; Xiaomeng Wang; Kun Xia; Jinchen Li; Bin Li
Journal: Life (Basel) Date: 2021-03-12