| Literature DB >> 28933420 |
Hákon Jónsson1, Patrick Sulem1, Birte Kehr1, Snaedis Kristmundsdottir1, Florian Zink1, Eirikur Hjartarson1, Marteinn T Hardarson1, Kristjan E Hjorleifsson1, Hannes P Eggertsson1, Sigurjon Axel Gudjonsson1, Lucas D Ward1, Gudny A Arnadottir1, Einar A Helgason1, Hannes Helgason1, Arnaldur Gylfason1, Adalbjorg Jonasdottir1, Aslaug Jonasdottir1, Thorunn Rafnar1, Soren Besenbacher2,3, Michael L Frigge1, Simon N Stacey1, Olafur Th Magnusson1, Unnur Thorsteinsdottir1,4, Gisli Masson1, Augustine Kong1,5, Bjarni V Halldorsson1,6, Agnar Helgason1,7, Daniel F Gudbjartsson1,5, Kari Stefansson1,4.
Abstract
Understanding of sequence diversity is the cornerstone of analysis of genetic disorders, population genetics, and evolutionary biology. Here, we present an update of our sequencing set to 15,220 Icelanders who we sequenced to an average genome-wide coverage of 34X. We identified 39,020,168 autosomal variants passing GATK filters: 31,079,378 SNPs and 7,940,790 indels. Calling de novo mutations (DNMs) is a formidable challenge given the high false positive rate in sequencing datasets relative to the mutation rate. Here we addressed this issue by using segregation of alleles in three-generation families. Using this transmission assay, we controlled the false positive rate and identified 108,778 high quality DNMs. Furthermore, we used our extended family structure and read pair tracing of DNMs to a panel of phased SNPs, to determine the parent of origin of 42,961 DNMs.Entities:
Mesh:
Year: 2017 PMID: 28933420 PMCID: PMC5607473 DOI: 10.1038/sdata.2017.115
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Summary of the genotyping chips used for the individuals in the LRP panel.
| The overlap column corresponds to the numbers of genotyping chips used for individuals that were present in the LRP set of Gudbjartsson | ||||
|---|---|---|---|---|
| OmniExpress | HumanOmniExpress | 725,095 | 49,482 | 35,847 |
| OmniExpress24 | HumanOmniExpress-24 | 714,758 | 33,894 | 184 |
| HumanHap 300 | Illumina HumanHap 317 K SNP Chip | 317,870 | 22,429 | 22,394 |
| HumanHapCNV 370 | Merge of HumanHap300 and HumanCNV chips | 371,900 | 14,138 | 14,085 |
| Human Omni1 | 1 M SNP Chip redesigned, 500 K diff versus normal 1 M | 1,137,466 | 10,859 | 10,809 |
| OmniExpPlus | DECODE | 706,534 | 9,725 | 9,708 |
| Omni2.5–8 | HumanOmni2.5–8 | 2,379,855 | 3,948 | 3,941 |
| OmniExpMulti | HumanOmniExp-12v1MultiUse | 730,525 | 2,808 | 2,799 |
| HumanOmni2.5 | HumanOmni2.5 | 2,443,177 | 2,352 | 2,340 |
| HumanHap 1 M | Illumina HumanHap 1 M SNP Chip | 1,136,004 | 1,267 | 1,261 |
| Omni5 | HumanOmni5-4v1 | 4,301,332 | 661 | 658 |
| HumanHap 610 | Illumina Human610-Quad v1 SNP Chip | 620,901 | 646 | 640 |
| Omni2.5Multi | HumanOmni2.5-4v1-Multi_D | 2,443,177 | 400 | 399 |
| Human660W | Human660W | 655,214 | 22 | 1 |
The pairwise co-occurrences of sample preparation and flow-cell type.
| This table is based on the subset of individuals sequenced and used for genotyping by sequencing in this study. Note that the same individual can appear multiple times if sequenced more than once. The sample preparation methods are described in the ‘Preparation of samples for whole genome sequencing’ section. This table is reproduced from Supplementary Table 18 from Jónsson | ||||||
|---|---|---|---|---|---|---|
| TrueSeq DNA | 6,603 | 452 | 107 | 0 | 0 | 0 |
| TruSeq Nano | 0 | 5,310 | 1,054 | 124 | 0 | 0 |
| TruSeq PCR-Free | 0 | 0 | 6 | 0 | 21 | 1 |
| Unknown | 0 | 0 | 4 | 1 | 0 | 0 |
The co-occurrences of flowcell and machine type for the BM-BAM files.
| The rows and columns correspond to the machine and flowcell types, respectively. | ||||||
|---|---|---|---|---|---|---|
| GAIIx | 8,309 | 0 | 0 | 0 | 0 | 0 |
| HiSeq | 0 | 1,115 | 4,946 | 136 | 0 | 0 |
| HiSeq X | 0 | 0 | 0 | 0 | 6,606 | 6,705 |
The frequencies of flowcell and machine combinations for the AM-BAM files
| HiSeq X | 12,091 | HiSeq X | 6,439 |
| GAIIx-HiSeq | 1,228 | HiSeq X2 | 5,648 |
| HiSeq | 719 | GAIIx-HiSeq 2000 | 997 |
| GAIIx-HiSeq-HiSeq X | 473 | HiSeq 2000 | 599 |
| GAIIx-HiSeq X | 432 | GAIIx-HiSeq X2 | 432 |
| GAIIx | 253 | GAIIx-HiSeq-HiSeq X2 | 271 |
| HiSeq-HiSeq X | 24 | GAIIx | 253 |
| GAIIx-HiSeq 2000-HiSeq X2 | 202 | ||
| GAIIx-HiSeq | 139 | ||
| HiSeq 2000-HiSeq 2500 | 94 | ||
| GAIIx-HiSeq-HiSeq 2000 | 92 | ||
| HiSeq 2500 | 23 | ||
| HiSeq 2000-HiSeq X2 | 10 | ||
| HiSeq 2000-HiSeq X | 8 | ||
| HiSeq-HiSeq X2 | 6 | ||
| HiSeq X-HiSeq X2 | 4 | ||
| HiSeq | 2 | ||
| HiSeq-HiSeq 2000 | 1 |
Summary of the variants.
| The columns GATK-P, Phase-P (>0.8) and Imputation-P (>0.8) correspond to number of variants passing the respective filters. Multi allelic variants were dichtomized. | |||||
|---|---|---|---|---|---|
| Indel | Biallelic | 5,962,773 | 3,510,962 | 4,009,176 | 3,235,133 |
| Indel | Non-biallelic | 7,085,563 | 4,429,828 | 5,462,957 | 3,619,729 |
| SNP | Biallelic | 69,663,011 | 30,518,223 | 35,631,803 | 29,319,382 |
| SNP | Non-biallelic | 3,406,010 | 561,155 | 1,679,332 | 953,751 |
| Total | Biallelic | 75,625,784 | 34,029,185 | 39,640,979 | 32,554,515 |
| Total | Non-biallelic | 10,491,573 | 4,990,983 | 7,142,289 | 4,573,480 |
Figure 1Schematic overview of the DNM characterization.
Figure 2The GAM model predicted response for all DNM candidates.
The red line corresponds to the 0.8 GAM response requirement for the high quality DNMs.
Summary of the phased DNMs.
| Three gen. imputation | 15,746 | 0.145 |
| Read pair tracing | 31,834 | 0.293 |
| Phased by both methods | 4,566 | 0.0420 |
| Phased by both methods, discordant (1.16%) | 53 | 0.000487 |
| Consensus approach | 42,961 | 0.395 |
Figure 3The fraction of discordant DNMs between MZ twins.
There were used 91 monozygotic twin pairs for the discordance calculation. The discordance fraction was calculated as the fraction of the proband’s high quality DNMs not found in the MZ twin.