| Literature DB >> 32183904 |
Dennis Lal1,2,3,4,5, Patrick May6, Eduardo Perez-Palma7,8, Kaitlin E Samocha9,10,11, Jack A Kosmicki9,10, Elise B Robinson9,10,12, Rikke S Møller13,14, Roland Krause15, Peter Nürnberg7,16,17, Sarah Weckhuysen18,19,20, Peter De Jonghe18, Renzo Guerrini21, Lisa M Niestroj7, Juliana Du7, Carla Marini21, James S Ware22, Mitja Kurki9,10, Padhraig Gormley9,10, Sha Tang23, Sitao Wu23, Saskia Biskup24, Annapurna Poduri25, Bernd A Neubauer26, Bobby P C Koeleman27, Katherine L Helbig23,28, Yvonne G Weber29,30, Ingo Helbig28,31,32,33, Amit R Majithia34, Aarno Palotie9,10,35, Mark J Daly36,37,38.
Abstract
BACKGROUND: Classifying pathogenicity of missense variants represents a major challenge in clinical practice during the diagnoses of rare and genetic heterogeneous neurodevelopmental disorders (NDDs). While orthologous gene conservation is commonly employed in variant annotation, approximately 80% of known disease-associated genes belong to gene families. The use of gene family information for disease gene discovery and variant interpretation has not yet been investigated on a genome-wide scale. We empirically evaluate whether paralog-conserved or non-conserved sites in human gene families are important in NDDs.Entities:
Keywords: Conservation; Gene family; Missense variants; Neurodevelopmental disorders; Paralogs
Mesh:
Year: 2020 PMID: 32183904 PMCID: PMC7079346 DOI: 10.1186/s13073-020-00725-6
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Fig. 1Vertical (ortholog) vs. horizontal (paralog) conservation. Top: protein sequence alignment of voltage-gated sodium channels. Top left: alignment of Homo sapiens (NP_001159435.1), Bos taurus (NP_001180147.1), and Mus musculus (NP_001300926.1) SCN1A protein sequences. High sequence similarity is depicted by violet amino acid coloring and yellow conservation bars below the alignment using JalView. Top right: protein alignment in JalView of all members of the human voltage-gated sodium channel gene family (SCN1A, SCN2A, SCN3A, SCN4A, SCN5A, SCN7A, SCN8A, SCN9A, SCN10A, SCN11A). This alignment of paralogs shows less conservation compared to the alignment of SCN1A to its vertical cross-species orthologs on the left. Bottom left: GERP score analysis over all genes within gene families (homolog conservation is measured by the percentage of all nucleotides per gene with GERP scores > 2). Bottom right: distribution percentage of nucleotides per gene within gene families having para_zscores > 0. Conservation between close homologs is generally much more uniform and homogeneous than conservation between paralogs
Fig. 2Assessment of paralog conservation. a Identification of missense variant gene family enrichment in NDD patients for paralog-conserved missense variants. NDD-associated missense variants are enriched in paralog-conserved sites. y-axis: missense variant enrichment analysis considering only paralog non-conserved sites across genes of each gene family (para_zcore ≤ 0, pmissense_not_conserved). x-axis: missense variant enrichment analysis considering only paralog-conserved sites (para_zcore > 0, pmissense_conserved). None of the gene families shows exome-wide significant enrichment for paralog non-conserved sites. Twenty-six gene families (depicted by circles) show exome-wide significant de novo missense variant burden at paralog-conserved sites. The significance threshold was calculated by Bonferroni correction for testing 5 × 2871 gene families (P = 3.48 × 10−6) and is depicted by the blue dotted line. b Enrichment of missense variants in paralog-conserved sites in genes with significant DNM burden in this study. Distribution of NDD patient missense, nonsense, and synonymous para_zscores for all non-significantly enriched genes (top) and genes significantly enriched for DNM missense variants (bottom panel) depicted by density plots. DNM burden was calculated using the mutational framework described by Samotcha et al. (for details, see the “Methods” section). Genes were categorized into two groups: those with a significant burden and those without. In disease-associated genes (those with DNM burden), missense variants were enriched at paralog-conserved sites relative to missense variants in non-significantly enriched genes (P value < 2.2E−16, top vs. bottom panel). Missense variants in genes without DNM burden were not enriched at paralog-conserved sites compared to synonymous variants (P value = 0.1157, top panel). In genes with DNM burden (bottom panel), missense variants were significantly enriched at paralog-conserved sites compared to synonymous variants (P value = 3.01 × 10−4). The same test for nonsense variants vs. synonymous variants did not show significant differences in paralog conservation (P value = 0.3913). P values were calculated using a Wilcoxon test
Forty-three significantly enriched gene families in the combined de novo paralog-conserved missense and PTV analysis for 10,068 NDD trios. Only enriched gene families significant after applying the Bonferroni significance threshold for testing 5 × 2871 gene families (3.48 × 10−6) are included. Gene names highlighted in red are affected by DNM and the number of DNM is indicated inside the soft brackets. Genes in bold have not previously been reported as significantly enriched in exome-wide ASD, DD, or EPI studies
| Gene families | DNMs expected | DNMs observed | 5264 DD patients | 3982 ASD patients | 822 EPI patients | 2087 controls | |
|---|---|---|---|---|---|---|---|
| 1.40 | 12 | 3.33E−08 | 11 | 1 | 0 | 1 | |
| 0.90 | 10 | 4.31E−08 | 8 | 2 | 0 | 1 | |
| 1.18 | 11 | 5.29E−08 | 9 | 2 | 0 | 0 | |
| 0.75 | 9 | 1.04E−07 | 9 | 0 | 0 | 0 | |
| 1.04 | 10 | 1.61E−07 | 8 | 1 | 1 | 0 | |
| 0.82 | 9 | 2.21E−07 | 9 | 0 | 0 | 0 | |
| 0.28 | 6 | 5.43E−07 | 6 | 0 | 0 | 0 | |
| 0.53 | 7 | 1.48E−06 | 4 | 3 | 0 | 0 | |
| 0.35 | 6 | 2.01E−06 | 1 | 5 | 0 | 0 |
Fig. 3Established NDD disease genes are brain expressed and under evolutionary constraint. Every dot represents a gene of the 43 DNM enriched gene families. The colors of the box and font represent the number of DNMs (N.DNM) identified in the gene in 10,668 NDD trios. y-axis: brain gene expression level in RPKM derived from the GTEx expression dataset; x-axis: gene constraint scores (left: pLI, indicating gene LoF intolerance; right: missense z-score, indicating gene missense intolerance). Disease-associated DNMs are likely to affect brain-expressed and evolutionary constrained genes (defined as brain expression RPKM > 1, constraint score pLI ≥ 0.9 and missense z-score > 3.09; green boxes). In support of this hypothesis, we observe that all previously known and frequently mutated genes are brain expressed and under evolutionary constraint
Fig. 4Visualization of para_zscores for KCNQ2, STXBP1, CACNA1A, and GRIN2B. Protein sequence is plotted from left to right. Each bar and dot represent one amino acid. Amino acids affected by a missense mutation in the NDD cohort are colored blue, patient PTVs are depicted in pink, and synonymous variants in orange. Amino acid residues with no mutations are colored gray. y-axis: para_zscore. Positive values indicate paralog conservation, and the highest score indicates that these amino acids are identical over all gene family members. The red dotted lines indicate the mean paralog conservation of each protein sequence, and the bars below the mean indicate regions of low paralog conservation, thus higher sequence variability over all members of the gene family