| Literature DB >> 35468861 |
Kohei Hamanaka1, Noriko Miyake2, Takeshi Mizuguchi2, Satoko Miyatake2,3, Yuri Uchiyama2,4, Naomi Tsuchida2,4, Futoshi Sekiguchi2, Satomi Mitsuhashi2, Yoshinori Tsurusaki5, Mitsuko Nakashima6, Hirotomo Saitsu6, Kohei Yamada2, Masamune Sakamoto2, Hiromi Fukuda2, Sachiko Ohori2, Ken Saida2, Toshiyuki Itai2, Yoshiteru Azuma2,7, Eriko Koshimizu2, Atsushi Fujita2, Biray Erturk8,9, Yoko Hiraki10, Gaik-Siew Ch'ng11, Mitsuhiro Kato12, Nobuhiko Okamoto13, Atsushi Takata14,15, Naomichi Matsumoto16.
Abstract
BACKGROUND: Previous large-scale studies of de novo variants identified a number of genes associated with neurodevelopmental disorders (NDDs); however, it was also predicted that many NDD-associated genes await discovery. Such genes can be discovered by integrating copy number variants (CNVs), which have not been fully considered in previous studies, and increasing the sample size.Entities:
Keywords: Autism spectrum disorder; Copy number variant; Copy number variation; De novo variant; Deep learning; Epileptic encephalopathy; Intellectual disability; Mutation rate; Neurodevelopmental disorder; Rare disease
Mesh:
Substances:
Year: 2022 PMID: 35468861 PMCID: PMC9040275 DOI: 10.1186/s13073-022-01042-w
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 15.266
Fig. 1Framework for estimating mutation rates of < 1 Mb LOF CNVs per gene. a A conceptional overview showing the method for calculating the mutation rates of < 1 Mb LOF dnCNVs per gene. b A scheme depicting the method for selecting training genes. We selected training genes (here, the gene in red) that are LOF-tolerant and flanked by upstream and downstream > 1 Mb regions without any LOF-intolerant genes
Fig. 6Integration of the bioinformatic analysis results using deep learning. a Scheme for the NN model. White circle: neurons of layers; line: connections between neurons. b AUC of the full NN model, the eight predictors, and the three existing gene prioritization metrics for PC3 and NC3. The blue violin plot for the NN model (“NN”) represents the distribution based on 500 full NN models, with a red dot indicating the median. c Violin plots of the full NN model scores of various gene sets. PL: the 34 plausible candidate genes. P-values of one-tailed Wilcoxon rank-sum tests are shown above. d Posterior probabilities that the 34 plausible candidate genes are true NDD-associated genes. The probabilities are the median of probabilities computed by 100 full NN models. NN model scores are shown in parentheses. Genes are arranged in the order of NN model scores. Dotted line: 90%
Fig. 2Contribution of dnCNVs to statistical significance of DNM enrichment analyses. a A plot of q-values of DNM enrichment analyses for each gene before (x-axis) and after (y-axis) combining dnCNV data. The gray diagonal line indicates the line of y = x. The small inset is a magnified image. The dotted lines in the small inset: thresholds for exome-wide statistical significance (q-value = 0.05). b Visualization of the LOF dnCNVs affecting GLTSCR1 in a YCU case. From top to bottom, the plots show the exon–intron structures of the canonical transcripts, LOEUF, CNVs called by the exome hidden markov model (XHMM), and z scores of depth in the XHMM analysis. LOEUF of each gene is shown as a horizontal line corresponding to its genomic region. In the plot of z score for depth, the red line indicates the z score of the case with the LOF dnCNV, and the black lines indicate the z scores of 500 randomly selected control individuals. c IGV images of WGS data of a family with a UBR3 dnCNV (13302) and a family with a MARK2 dnCNV (12103). At the top, coverage and paired-end reads of all family members and exon–intron structures of genes are shown. At the bottom, magnified images of coverage and paired-end reads of the affected proband are shown. In the magnified images, discordant read pairs, whose read one and read two surround a dnCNV, are connected with a black line, and split reads, which span a breakpoint, are connected with a red line. p1, the affected proband; fa, the father; mo, the mother; s1, the healthy sibling
Fig. 3Spatiotemporal expression patterns of the 328 known and 34 plausible candidate genes. a Enrichment analyses of genes specifically expressed in each brain region at each developmental stage in the 328 known (the six columns of large hexagons) and 34 plausible new genes (columns of small hexagons on the right of the columns of large hexagons). Sizes of the hexagons for the 328 genes correlate with their gene set sizes. The red colors correspond to q-values of Fisher’s exact tests adjusted by the BH method. The regions of the hexagons for the 328 genes closer to the center of each hexagon correspond to genes with smaller pSI scores, namely, increasing specificity (< 0.05, < 0.01, < 0.001, and < 0.0001, respectively), while the hexagons for the 34 genes correspond to genes with pSI scores < 0.05. b Enrichment analyses of genes of each co-expression module in the 328 known (the upper row) and 34 plausible candidate genes (the lower row). The circle colors correspond to q-values of hypergeometric tests adjusted by the BH method. The circle sizes indicate the ratio of each module proportion in the 328 or 34 genes relative to that in all genes
Fig. 4GO terms enriched in the 328 known and 34 plausible candidate genes. a Clusters of GO terms enriched (q-value < 0.01) in the 328 known and 34 plausible candidate genes. Only clusters of ten or more nodes are shown. Each node represents a GO term. Nodes are connected by an edge when the Jaccard and overlap combined coefficient for their gene members is > 0.5. Node size represents the number of gene members. Nodes are colored red when the nodes are statistically significant in the 34 plausible candidate genes. Gray ovals represent manually annotated GO groups. b Histograms of numbers of GO terms enriched (q-value < 0.01) in 34 randomly selected genes. This simulation was repeated 1000 times. In each simulation, only the 1086 terms enriched in the 328 known genes (Additional file 2: Table S13) were analyzed. Red bars indicate the number of GO terms enriched in the 34 plausible candidate genes. Empirical p-values of the enrichment in the 34 genes are the proportion of simulations with a number of GO terms equal to or more than that of the red bars. BP, GO biological process terms; CC, GO cellular component terms; MF, GO molecular function terms
Fig. 5STRING clusters enriched in the 328 known and 34 plausible candidate genes. STRING clusters whose members are enriched (q-value < 0.01) in the proteins encoded by the 328 known and 34 candidate genes. Nodes are clustered according to the similarity of their members. Nodes are connected by an edge when the Jaccard and overlap combined coefficient for their members is > 0.375. Gray nodes: STRING clusters significantly enriched in the 328 known genes; red nodes: STRING clusters significantly enriched in the 34 candidate genes. Gray ovals: groups of nodes with similar annotations