| Literature DB >> 31871067 |
Eduardo Pérez-Palma1,2, Patrick May3, Sumaiya Iqbal4,5, Lisa-Marie Niestroj1, Juanjiangmeng Du1, Henrike O Heyne4,5,6, Jessica A Castrillon1, Anne O'Donnell-Luria4, Peter Nürnberg1, Aarno Palotie4,5,6, Mark Daly4,5,6, Dennis Lal1,2,4,5,7.
Abstract
Missense variant interpretation is challenging. Essential regions for protein function are conserved among gene-family members, and genetic variants within these regions are potentially more likely to confer risk to disease. Here, we generated 2871 gene-family protein sequence alignments involving 9990 genes and performed missense variant burden analyses to identify novel essential protein regions. We mapped 2,219,811 variants from the general population into these alignments and compared their distribution with 76,153 missense variants from patients. With this gene-family approach, we identified 465 regions enriched for patient variants spanning 41,463 amino acids in 1252 genes. As a comparison, by testing the same genes individually, we identified fewer patient variant enriched regions, involving only 2639 amino acids and 215 genes. Next, we selected de novo variants from 6753 patients with neurodevelopmental disorders and 1911 unaffected siblings and observed an 8.33-fold enrichment of patient variants in our identified regions (95% C.I. = 3.90-Inf, P-value = 2.72 × 10-11). By using the complete ClinVar variant set, we found that missense variants inside the identified regions are 106-fold more likely to be classified as pathogenic in comparison to benign classification (OR = 106.15, 95% C.I = 70.66-Inf, P-value < 2.2 × 10-16). All pathogenic variant enriched regions (PERs) identified are available online through "PER viewer," a user-friendly online platform for interactive data mining, visualization, and download. In summary, our gene-family burden analysis approach identified novel PERs in protein sequences. This annotation can empower variant interpretation.Entities:
Mesh:
Year: 2019 PMID: 31871067 PMCID: PMC6961572 DOI: 10.1101/gr.252601.119
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Study workflow and the PER viewer. (A) Starting from protein alignments of paralogous genes (gene-family approach) or all genes (gene-wise approach), missense variants from gnomAD (population; green) and ClinVar/HGMD (patient; purple) were mapped independently to the corresponding amino acid positions. (B) The mapping follows a binary notation. For sites with at least one missense variant reported, a “1” state was assigned. Alternatively, if no mutation was found, a “0” state was annotated instead. Amino acid sliding window (bin) counting over the alignment/sequence was used to calculate the corresponding missense burden. (C) The ratio between the number of sites with missense variants inside and outside the bin defines the burden area (population burden = green; patient burden = purple). Statistical comparison between the population and patient variant burden across aligned sequences allowed the identification of significant pathogenic variant enriched regions (PERs; red area).
Figure 2.PERs detected with the family-wise and gene-wise burden analyses. Summary statistics for family-wise (orange) and gene-wise (green) approaches are shown for number of PERs detected (A), number of genes with PERs (B), and number of amino acids involved in in PERs (C). For B and C, the number of genes and amino acids associated to disease is shown in purple. (D) To reflect gene with PERs distribution by approach, a Venn diagram is shown. (E,F) Overall enrichment (log odds ratio) and significance (adjusted P-value) distribution of all PERs detected in each approach are shown in E and F, respectively.
Figure 3.PER viewer tool example. The voltage-gated sodium channel family. (A) Missense burden analysis of the voltage-gated sodium channel protein family (family ID: 2614.subset.3) composed by SCN1A, SCN2A, SCN3A, SCN4A, SCN5A, SCN7A, SCN8A, SCN9A, SCN10A, and SCN11A. Population and patient missense burden are shown in green and purple, respectively. Significant pathogenic enriched regions (PERs) identified are shown in the red negative area and are proportional to their adjusted P-values (gray horizontal lines). (B) Table view of pathogenic enriched region 5 (PER5; positions 941–949). Gene columns denote individual canonical sequence alongside corresponding amino acid position. Column “Gene:Disease” displays analogous diseases observed in the patient data set. N/A sites show aligned amino acids positions with no disease reported.
Figure 4.Disease-causing variants are enriched in PERs. (A) Neurodevelopmental disorder DNVs inside PERs. Case and control comparison of DNVs inside PERs retrieved from Heyne et al. (2018) is shown for all genes (blue; OR = 8.33, 95% C.I. = 3.90-Inf, P-value = 2.72 × 10−11) and genes with high probability of being loss-of-function intolerant (light blue; OR = Inf, 95%C.I. = 7.48-Inf, P-value = 1.34 × 10−9). Fold enrichment observed in cases was calculated with a one-sided Fisher's exact test. Resulting odds ratio (OR) with 95% confidence and corresponding P-values are shown in the horizontal axis. (B) ClinVar missense variants (from October 2019 release) inside PERs with benign and unknown (VUS) clinical significance. The number of variants observed is shown considering all genes (blue) and pLI > 0.9 genes only (light blue). (C) Burden analysis performance over time. PER detection was performed with patient variants reported until 2017 and 2018 and compared with the current 2019 data set analysis. (Left) Overall amount of PERs, amino acid, and genes detected as a function of the number of input patient variants. (Right) Rate of PERs, genes, and amino acids detected per patient variant contained in 2017, 2018, and 2019 sources.