| Literature DB >> 31776332 |
Hannes P Eggertsson1,2, Snaedis Kristmundsdottir3,4, Doruk Beyter3, Hakon Jonsson3, Astros Skuladottir3, Marteinn T Hardarson3, Daniel F Gudbjartsson3,5, Kari Stefansson3,6, Bjarni V Halldorsson7,8, Pall Melsted9,10.
Abstract
Analysis of sequence diversity in the human genome is fundamental for genetic studies. Structural variants (SVs) are frequently omitted in sequence analysis studies, although each has a relatively large impact on the genome. Here, we present GraphTyper2, which uses pangenome graphs to genotype SVs and small variants using short-reads. Comparison to the syndip benchmark dataset shows that our SV genotyping is sensitive and variant segregation in families demonstrates the accuracy of our approach. We demonstrate that incorporating public assembly data into our pipeline greatly improves sensitivity, particularly for large insertions. We validate 6,812 SVs on average per genome using long-read data of 41 Icelanders. We show that GraphTyper2 can simultaneously genotype tens of thousands of whole-genomes by characterizing 60 million small variants and half a million SVs in 49,962 Icelanders, including 80 thousand SVs with high-confidence.Entities:
Mesh:
Year: 2019 PMID: 31776332 PMCID: PMC6881350 DOI: 10.1038/s41467-019-13341-9
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of data structure and workflow. a Example structural variants and their encoding in an acyclic graph structure. b Workflow for constructing a GraphTyper graph with SNPs, indels and SVs. SVs are detected from each sample independently and then merged across all the samples, such that SV sites of the same type and similar position and size are reported only once. SNPs and indels that are given as input into the graph construction can be detected using GraphTyper or obtained from a database.
Fig. 2Comparisons to SVs in the syndip dataset. The breakpoint precision threshold is the maximum number of base-pairs we allowed at both breakpoints for an SV to be considered recalled. a Deletion recall comparison between SV genotyping methods. b Deletion false discovery rate comparison. The Manta and Manta + GraphTyper lines are overlapping. c Insertion recall comparison. Delly and smoove were not evaluated since they are not designed to discover all types of insertions. The Manta and Manta + GraphTyper lines are overlapping. d Insertion false discover rate comparison. e Deletion recall by deletion size with a breakpoint precision threshold of 50 bp. f Insertion recall by insertion size with a breakpoint precision threshold of 50 bp.
Fig. 3High-confidence SV genotypes in four Icelandic families. a Family tree of the four families. Shown are genotypes of a 313 bp deletion starting at chr20:19,080,772 (GRCh38). b Frequency distribution of SVs called on chromosome 20. There are 112 bins, the number of chromosomes in the callset. c The allele transmission rate of an SV from parent to offspring. For germline variants, the distribution is expected to be symmetric around 50%.
Fig. 4Overlap of previously published SV datasets and SVs we find in Iceland. a Fraction of SVs in an external SV dataset that are also found in Iceland. b Distribution of the number of insertions, deletions, and breakends of an external dataset that is found in Iceland. Maximum distance threshold used was 50 bp.