| Literature DB >> 25597990 |
Søren Besenbacher1, Siyang Liu2, José M G Izarzugaza3, Jakob Grove4, Kirstine Belling3, Jette Bork-Jensen5, Shujia Huang6, Thomas D Als7, Shengting Li8, Rachita Yadav3, Arcadio Rubio-García3, Francesco Lescai7, Ditte Demontis7, Junhua Rao9, Weijian Ye9, Thomas Mailund10, Rune M Friborg10, Christian N S Pedersen1, Ruiqi Xu9, Jihua Sun9, Hao Liu9, Ou Wang9, Xiaofang Cheng9, David Flores3, Emil Rydza3, Kristoffer Rapacki3, John Damm Sørensen3, Piotr Chmura3, David Westergaard3, Piotr Dworzynski3, Thorkild I A Sørensen11, Ole Lund3, Torben Hansen12, Xun Xu9, Ning Li9, Lars Bolund13, Oluf Pedersen5, Hans Eiberg14, Anders Krogh15, Anders D Børglum7, Søren Brunak3, Karsten Kristiansen16, Mikkel H Schierup10, Jun Wang17, Ramneek Gupta3, Palle Villesen10, Simon Rasmussen3.
Abstract
Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e-8 and 1.5e-9 per nucleotide per generation for SNVs and indels, respectively.Entities:
Mesh:
Year: 2015 PMID: 25597990 PMCID: PMC4309431 DOI: 10.1038/ncomms6969
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1Allele frequencies and loss of function (LOF) mutations.
(a) Derived allele frequencies of bi-allelic known (n=6.69M, blue) and novel (n=415k, orange) SNVs with genotype information in all 20 parents. Fixed variants are excluded. (b) Folded minor allele frequencies of known deletions (n=510k, solid blue), known insertions (n=383k, dashed turquoise), novel deletions (n=136k, solid red) and novel insertions (n=126k, dashed orange) with regard to the reference genome. Only bi-allelic and non-fixed sites with genotype information in the 20 parents are included (for derived allele frequencies, see Supplementary Fig. 1). (c) Size distribution of bi-allelic and non-fixed indels (n=1.19M) and indels in coding regions only (n=1392, insert), legend as in b, for insert coding indels (purple) and non-coding (green). (d) Estimated number of LOF variants for each parent (n=20), in total 10.6% of the mutations was in olfactory genes and 4.1% in zinc finger proteins. Stop gains (magenta), splice donor (blue), splice acceptor (turquoise) and indel frameshift (orange).
Sanger validation of SNVs, short indels and de novo variants.
| Overall | 0.02 | 46 |
| Novel | 0.05 | 21 |
| Loss of function | 0.00 | 25 |
| Overall | 0.15 | 26 |
| Repeat | 0.27 | 15 |
| Non-repeat | 0.00 | 11 |
| Novel | 0.18 | 11 |
| Loss of function | 0.13 | 15 |
| SNVs | 0.04 | 24 |
| Indels | 0.11 | 19 |
FDR, false discovery rate; GATK, Genome Analysis Toolkit; SNV, single nucleotide variation.
Sanger sequencing experiments were used to assay the FDR for 21 novel, 25 loss of function and 24 de novo SNVs. For indels a total of 11 novel, 15 loss of function and 19 de novo indels were assayed.
Figure 2events in the trios.
(a) Allele balance of detected de novo SNVs (n=730). Variants with low allele balance (<0.3) are considered to be somatic mutations while variants with high allele balance (>0.3) are considered to be germline mutations. (b) Mutational context of somatic (n=222) and germline (n=508) de novo SNVs, assuming that there are no strand differences (that is, G->T mutations are considered equal to C->A mutations). Both somatic and germline mutations follow the same pattern of increased frequency of transitions versus transversions and an extremely high transition rate in CpG sites. Orange: CpG mutations, turquoise: mutations at A or T site and magenta: non-CpG mutations at C or G site. Error bars represent s.e.m. (c) Germline SNVs increase significantly with paternal age. The blue line is a linear fit to the age of the father at the child’s birth and germline SNV mutation rate, and the error bars represent s.e.m. (d) Allele balance of detected de novo indels (n=121). (e,f) The indel length distribution indicates that short deletions are more common than short insertions in both germline (n=70) (e) and somatic tissue (n=51) (f). (g) Germline indel rate show no compelling correlation with paternal age, the blue line is a linear fit to the age of the father at the child’s birth and germline indel mutation rate, and the error bars represent s.e.m.
Figure 3Structural variants and novel sequences identified in the de novo assemblies of 10 trios.
(a) Length of the variants present in the individual assemblies (n=30), the total length is given by the coloured numbers. The lower and upper hinges of the boxes correspond to the 25th and 75th percentiles and the whiskers represent the 1.5 × inter-quartile range (IQR) extending from the hinges. See Supplementary Fig. 10 for definitions of different types of structural variants. (b) Same as a but count of variants instead, individual counts are shown as box plots and total count by coloured numbers. (c) Length distribution and novelty of the variants (n=232k, 50% reciprocal overlap). The box plots indicate the number of variants per individual (n=30) at a certain length range; see box plot definition in a. Red dashed line: Alu peak at 300–400 bp. Orange dashed line: LINE peak at 6–7 kbp. (d) Variant mechanism. The y-axis indicates the proportion of variants annotated with different mechanisms corresponding to the length range in c. NAHR, non-allelic homologous recombination (green); NHR, non-homologous rearrangement (yellow); TEI, transposable element insertion (blue); unknown (white); VNTR: variable number of tandem repeats (magenta).
Sanger validation of novel SVs called by SoapAsmVar.
| Novel SVs | 0.07 | 68 |
| 50–100 | 0.10 | 39 |
| 101–300 | 0.00 | 9 |
| 301–500 | 0.10 | 10 |
| 501–1,000 | 0.00 | 10 |
| DEL | 0.17 | 12 |
| INS | 0.05 | 56 |
| Unknown | 0.04 | 25 |
| NAHR | 0.22 | 9 |
| NHR | 0.06 | 33 |
| TEI | 0.00 | 1 |
| VNTR | NA | 0 |
DEL, deletion; FDR, false discovery rate; GATK, Genome Analysis Toolkit; INS, insertion; NA, not applicable; NAHR, non-allelic homologous recombination; NHR, non-homologous recombination; SV, structural variant; TEI, transposable element insertions; VNTR, variable number of tandem repeats.
In total 68 novel variants were subjected to validation and the FDR was stratified by length, type of indel and formation mechanism.
Figure 4Number of novel variants per sample.
Number of novel variants identified from adding additional unrelated individuals (n=20). The visualized data is the average of 1,000 random samples of the individual order. Blue, SNVs; magenta, SVs >50 bp; orange, short indels.