| Literature DB >> 25312340 |
August Y Huang1, Xiaojing Xu2, Adam Y Ye3, Qixi Wu4, Linlin Yan5, Boxun Zhao6, Xiaoxu Yang5, Yao He7, Sheng Wang4, Zheng Zhang7, Bowen Gu4, Han-Qing Zhao5, Meng Wang5, Hua Gao5, Ge Gao5, Zhichao Zhang2, Xiaoling Yang2, Xiru Wu2, Yuehua Zhang2, Liping Wei1.
Abstract
Postzygotic single-nucleotide mutations (pSNMs) have been studied in cancer and a few other overgrowth human disorders at whole-genome scale and found to play critical roles. However, in clinically unremarkable individuals, pSNMs have never been identified at whole-genome scale largely due to technical difficulties and lack of matched control tissue samples, and thus the genome-wide characteristics of pSNMs remain unknown. We developed a new Bayesian-based mosaic genotyper and a series of effective error filters, using which we were able to identify 17 SNM sites from ~80× whole-genome sequencing of peripheral blood DNAs from three clinically unremarkable adults. The pSNMs were thoroughly validated using pyrosequencing, Sanger sequencing of individual cloned fragments, and multiplex ligation-dependent probe amplification. The mutant allele fraction ranged from 5%-31%. We found that C→T and C→A were the predominant types of postzygotic mutations, similar to the somatic mutation profile in tumor tissues. Simulation data showed that the overall mutation rate was an order of magnitude lower than that in cancer. We detected varied allele fractions of the pSNMs among multiple samples obtained from the same individuals, including blood, saliva, hair follicle, buccal mucosa, urine, and semen samples, indicating that pSNMs could affect multiple sources of somatic cells as well as germ cells. Two of the adults have children who were diagnosed with Dravet syndrome. We identified two non-synonymous pSNMs in SCN1A, a causal gene for Dravet syndrome, from these two unrelated adults and found that the mutant alleles were transmitted to their children, highlighting the clinical importance of detecting pSNMs in genetic counseling.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25312340 PMCID: PMC4220156 DOI: 10.1038/cr.2014.131
Source DB: PubMed Journal: Cell Res ISSN: 1001-0602 Impact factor: 25.617
Figure 1A new computational pipeline for genome-wide identification of pSNMs without matched control tissue samples. (A) Overall framework of the pipeline including read pre-processing, genotyping and filtering. The processes of mosaic identification and filtering were implemented in our scripts. (B) The Bayesian-based genotyper demonstrated as a probabilistic graphical model. Four genotypes were defined: ref-hom for “homozygous for the reference allele”, het for “heterozygous”, alt-hom for “homozygous for the alternative allele”, and mosaic for “mosaic”. The posterior probabilities were inferred from prior and conditional probabilities that were calculated or simulated from known population genetics data and next-generation sequencing data (see Materials and Methods). (C) Simulated behavior of the Bayesian genotyper when the sequencing depth and base quality varied. The X axis denotes the alternative allele fraction. The Y axis denotes the posterior probability of the four genotypes. Columns 1 to 4 represent sequencing depths of 20, 40, 80, and 160, respectively, and rows 1 to 3 represent base qualities of 10, 20, and 30, respectively. It showed that increasing sequencing depth could improve the power to distinguish between mosaic and heterozygous sites, whereas increasing base quality could be helpful in distinguishing between mosaic and homozygous sites. (D) The power to distinguish mosaic sites from the simulated ∼20 000 homozygous and ∼20 000 heterozygous sites by sequentially applying the Bayesian genotyper and each of the ten error filters. This result demonstrates the high specificity of our pipeline in excluding germline sites and the relative contribution of the genotyper and filters.
Error filters used in the computational pipeline
| Filter name | Definition |
|---|---|
| Repetitive regions | We rejected nucleotide positions (“sites”) located in annotated repetitive DNA elements and self-alignment regions with similarity score > 80. |
| Homopolymers | We rejected sites located in or near homopolymers which were defined as four or more continuous identical nucleotides, and their flanking regions which were defined as 2 bp from homopolymers shorter than 6 nt or 3 bp from longer homopolymers. |
| Base-calling error | We rejected sites for which the minor allele could be explained by random base-calling errors according to LoFreq[ |
| Extreme depth | We rejected sites with sequencing depth that was either too low (< 25) or too high (> 150), compared to the average sequencing depth of ∼80. |
| Misaligned reads | We rejected sites where > 50% of the reads supporting the major or minor alleles had high risk of being misaligned, defined as when the BWA and BLAT alignments were inconsistent or when the site fell within 15 bp of the start or end of the aligned read or within 5 bp from a gap in the alignment. |
| Strand bias | We rejected sites where the majority of reads supporting the alternative allele were found in only one strand direction. The Fisher's exact test was performed to compare the ratio of the reads supporting the reference and alternative alleles between two strand directions, and sites with a |
| Clustered sites | We rejected sites located in or within 20 kb from the genomic regions clustered with three or more sites with minor allele fractions between 10% and 35% and maximal distance between two adjacent sites < 10 kb. |
| Complete linkage | We rejected sites for which one allele showed complete coincidence with an adjacent polymorphic site within the same read-pair. The Fisher's exact test was performed by counting the number of read-pairs supporting the four types of allele combinations, and sites with a |
| Within-read position | We rejected sites where the majority of sites supporting the alternative alleles were clustered at one end of the reads. The Wilcoxon rank-sum test was performed to compare the positions of the site along the reads between those supporting the reference and alternative alleles, and sites with a |
| Observed in common | We rejected sites whose allele fractions showed large deviations from germline expectations in two or more individuals. |
Figure 2Specificity and precision of identifying pSNMs using our pipeline, Varscan 2, and muTect. (A-B) False positive rate for true reference sites (A) and true non-reference sites (B). Error bars: 95% confidence intervals. (C) Proportion of true pSNMs among all identified sites. The postzygotic mutation rate was set to 4.4 × 10−7 per base, based on estimates in this study. The X axis denotes the minor allele fraction of the pSNM sites.
Figure 3Identification and validation of pSNMs in the whole-genome sequences from peripheral blood samples of three individuals. (A-C) Pedigree structures of the three participating families. Red arrows point to the three individuals selected for whole-genome sequencing. (D-F) Alternative allele fractions and sequencing depth of the pSNMs identified in the individuals ACC1-II-1 (D), DS1-II-2 (E), and DS2-I-1 (F) using our pipeline. The candidate pSNM sites are shown in red along with the germline sites shown in gray. The sites with extreme depth or allele fraction are not shown. Different shades of red represent mosaic posterior probabilities. (G) Validation by pyrosequencing. The X axis shows the pSNMs identified and validated in the three individuals. pSNM site IDs correspond to Table 2. The Y axis shows the alternative allele fractions of the pSNM sites in the case, unrelated control, and parents, detected by pyrosequencing. The dashed line represents allele fraction of 0.05, which is the detection threshold of pyrosequencing. (H) Copy number abnormalities are ruled out for all but two of the pSNM sites using MLPA. Seventeen sites showed normal copy numbers with normalized signal ratios between 0.7 and 1.3. Two sites with extra DNA copies are marked by asterisks. Error bars represent the SD of three replications of MLPA. (I) Correlation of the minor allele fractions estimated by whole-genome sequencing and pyrosequencing of the validated sites. The sequencing depth is represented by the size of the dots.
Validated pSNMs in the three individuals
| Individual | ID | Position | Ref base | Alt base | Whole-genome sequencing | Pyrosequencing | Sanger sequening | |||
|---|---|---|---|---|---|---|---|---|---|---|
| Ref read # (%) | Alt read # (%) | Ref % | Alt % | Ref clone # | Alt clone # | |||||
| ACC1-II-1 | B1 | 15:80878576 | C | T | 118 (93%) | 9 (7%) | 91% | 9% | 41 | 4 |
| ACC1-II-1 | B2 | 4:165099831 | C | T | 73 (89%) | 9 (11%) | 90% | 10% | 105 | 5 |
| ACC1-II-1 | B3 | 6:85605629 | A | T | 86 (87%) | 13 (13%) | 90% | 10% | 75 | 11 |
| ACC1-II-1 | B5 | 6:154692299 | G | C | 104 (95%) | 6 (5%) | 87% | 13% | 91 | 8 |
| ACC1-II-1 | B13 | 18:70512197 | T | G | 67 (69%) | 30 (31%) | 64% | 36% | 29 | 15 |
| DS1-II-2 | F1 | 17:52287119 | G | T | 89 (88%) | 12 (12%) | 91% | 9% | 33 | 3 |
| DS1-II-2 | F2 | 19:55529705 | C | T | 93 (93%) | 7 (7%) | 93% | 7% | 167 | 2 |
| DS1-II-2 | F3 | 19:56343191 | T | C | 86 (92%) | 7 (8%) | 96% | 4% | 82 | 4 |
| DS1-II-2 | F6 | 14:71380614 | T | G | 73 (92%) | 6 (8%) | 92% | 8% | 113 | 3 |
| DS1-II-2 | F7 | 16:84073033 | A | C | 44 (83%) | 9 (17%) | 88% | 12% | 36 | 4 |
| DS1-II-2 | F9 | 13:79932615 | C | A | 51 (73%) | 19 (27%) | 77% | 23% | 25 | 9 |
| DS1-II-2 | F10 | 2:166848782 | G | C | 38 (76%) | 12 (24%) | 73% | 27% | 19 | 3 |
| DS2-I-1 | X1 | 1:72008400 | A | G | 93 (90%) | 10 (10%) | 77% | 23% | 20 | 3 |
| DS2-I-1 | X2 | 16:64312186 | C | T | 76 (92%) | 7 (8%) | 88% | 12% | 24 | 5 |
| DS2-I-1 | X3 | X:68611473 | C | A | 47 (90%) | 5 (10%) | 94% | 6% | 166 | 3 |
| DS2-I-1 | X6 | 2:166854673 | G | T | 58 (78%) | 16 (22%) | 78% | 22% | 30 | 8 |
| DS2-I-1 | X8 | 13:89662020 | A | C | 53 (77%) | 16 (23%) | 82% | 18% | 45 | 9 |
Figure 4Characteristics of the validated pSNMs. (A) The mutation spectrum of validated pSNMs. The C→T and C→A mutations accounted for half of the mutations identified at the mosaic sites. (B) Allele fractions of the pSNM sites in different samples within the same individuals. Three pSNMs were detected in both somatic and semen samples. (C) Similarity of the allele fractions of the pSNMs between different samples. The blood and saliva samples showed the highest similarities in the six samples.
Figure 5pSNMs detected in DS1-II-2 and DS2-I-1 in the gene SCN1A and transmitted to their respective child with Dravet syndrome as a heterozygous mutation. (A) The two non-synonymous mutations are highlighted by red arrows on transmembrane structure of the sodium channel encoded by SCN1A. These mutations alter residues located at the ends of the loop structures in domains III and IV, adjacent to previously known pathogenic mutations in Dravet syndrome which are shown here as small circles with different colors representing different mutation type. (B) The parent-to-offspring transmission model is illustrated for c.5003C→G pSNM. In the mother, the mutant allele generated by postzygotic mutations is present in a proportion of the cell population and identified by our pipeline as a pSNM. The mosaicism apparently affected germ cells, and thus the offspring had a chance to inherit the mutation during gametogenesis and fertilization, leading to the heterozygous genotype.