| Literature DB >> 31856913 |
Sai Chen1, Peter Krusche2,3, Egor Dolzhenko1, Rachel M Sherman4, Roman Petrovski2, Felix Schlesinger1, Melanie Kirsche4, David R Bentley2, Michael C Schatz4,5, Fritz J Sedlazeck6, Michael A Eberle7.
Abstract
Accurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long-read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.Entities:
Keywords: Population studies; Sequence graphs; Structural variation; Targeted variant calling
Mesh:
Year: 2019 PMID: 31856913 PMCID: PMC6921448 DOI: 10.1186/s13059-019-1909-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Overview of the SV genotyping workflow implemented in Paragraph. The illustration shows the process to genotype a blockwise sequence swap. Starting from an entry in a VCF file that specifies the SV breakpoints and alternative allele sequences, Paragraph constructs a sequence graph containing all alleles as paths of the graph. Colored rectangles labeled FLANK, ALTERNATIVE, and REFERENCE are nodes with actual sequences, and solid arrows connecting these nodes are edges of the graph. All reads from the original, linear alignments that aligned near or across the breakpoints are then realigned to the constructed graph. Based on alignments of these reads, the SV is genotyped as described in the “Methods” section
Performance of different genotypers and de novo callers, measured against 50 bp or longer SV from our LRGT
| Type | Deletion | Insertion | ||||||
|---|---|---|---|---|---|---|---|---|
| Paragraph | Delly Genotyper | SVTyper (100+ bp) | Manta | Delly | Lumpy (100+ bp) | Paragraph | Manta | |
| #Tested TPs | 16,936 | 16,936 | 11,160 | 16,936 | 16,936 | 11,160 | 21,303 | 21,303 |
| Recall | 0.84 | 0.76 | 0.70 | 0.62 | 0.61 | 0.64 | 0.88 | 0.35 |
| #Tested FPs | 10,778 | 10,778 | 6960 | – | – | – | 11,307 | – |
| Precision | 0.92 | 0.85 | 0.98 | – | – | – | 0.89 | – |
| 0.88 | 0.80 | 0.82 | – | – | – | 0.88 | – | |
Genotyping/calling was evaluated on short-read data of the three samples sequenced with 150 bp paired-end reads on Illumina platforms. As SVTyper and Lumpy are limited to deletions longer than 100 bp, they have fewer tested SVs than other methods
Fig. 2Estimated recall of different methods, partitioned by SV length. Recall was estimated on the three samples using LRGT as the truth set. A negative SV length indicates a deletion, and a positive SV length indicates an insertion. Colored lines in a show recall of different methods; solid gray bars in b represent the count of SVs in each size range in LRGT. The center of the plot is empty since SVs must be at least 50 bp in length
Fig. 3Demonstration of the impact of recall when tested SVs include errors in their breakpoints. Breakpoint deviations measure the differences in positions between matching deletions in the CLR calls and in LRGT. Paragraph recall was estimated using CLR calls as genotyping input and TPs in LRGT as the ground truth. Breakpoint deviations were binned at 1 bp for deviations less than 18 bp and at 2 bp for deviations larger or equal to 19 bp. Solid bars show the number of deletions in each size range (left axis). Points and the solid line show the recall for individual size and the overall regression curve (right axis)
Fig. 4The impact of TRs on SV recall. a Estimated Paragraph recall from LRGT, partitioned by SV length and grouped by their positioning with TRs. b LRGT SV count partitioned by length and grouped by their positionings with TRs
Fig. 5Population-scale genotyping and function annotation of LRGT SVs. a The AF distribution of LRGT SVs in the Polaris 100-individual population. b PCA biplot of individuals in the population, based on genotypes of HWE-passing SVs. c The AF distribution of HWE-passing SVs in different functional elements. SV count: 191 in UTRs, 554 in exons, 420 in pseudogenes, 9542 in introns, and 6603 in intergenic regions