| Literature DB >> 34195837 |
Michael D Linderman1, Crystal Paudyal1, Musab Shakeel1, William Kelley1, Ali Bashir2, Bruce D Gelb3.
Abstract
BACKGROUND: Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome next-generation sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases.Entities:
Keywords: next-generation sequencing; structural variants; whole-genome sequencing
Mesh:
Year: 2021 PMID: 34195837 PMCID: PMC8246072 DOI: 10.1093/gigascience/giab046
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:NPSV dataflow and example SV evidence. (A) NPSV dataflow showing the matched training and prediction pipelines. For each putative SV and genotype, NPSV generates 1 or more simulated replicates. These simulated data, shown in the schematic as red, blue, and green clusters for homozygous reference (hom. ref.), heterozygous (het.), and homozygous alternate (hom. alt.) genotypes, respectively, are used to train sample- and variant-specific classifiers for predicting the SV genotype. (B) Synthetic training data (colored circles/bars) and actual data (black square/line) for a homozygous alternate 822-bp deletion in HG002. This SV is the deletion of 1 copy of a repeat and as a result of the repetitive genomic context, no fragments were uniquely realigned to the SV's alternate allele and no alternate spanning fragments were identified. The actual data are consistent with the simulated homozygous alternate data and not a homozygous reference genotype as might be expected from the absence of alternate allele realignments. This SV is successfully genotyped as homozygous alternate by NPSV when building a variant-specific classifier.
Figure 2:Genotyping accuracy for HG002 and NA12878 SVs. Top: Genotype concordance and non-reference concordance (presence or absence) for GIAB SVs (including “LongReadHomRef” SVs where “long reads supported homozygous reference for all individuals”) in high-confidence Tier 1 regions and the Tier 1 regions and lower-confidence Tier 2 SVs combined. Bottom: Concordance for NA12878 call sets. The NPSV accuracy is the mean of 10 runs. The best concordance is indicated with a black outline. The asterisk shows tools used in the construction of that call set.
Genotyping accuracy with discovery SVs as the input to SV genotyping and GIAB SVs in Tier 1 regions as the truth set; concordance is calculated for the subset of SVs successfully identified by the discovery tool
| Caller genotyping, % | NPSV genotyper, % | |||||
|---|---|---|---|---|---|---|
| Caller | Type | Discovery recall, % | Concordance | Non-reference concordance | Concordance | Non-reference concordance |
| lumpy | DEL | 30.5 | 82.1 | 87.2 | 88.5 | 92.7 |
| manta | DEL | 67.9 | 90.1 | 91.8 | 92.2 | 93.6 |
| manta | INS | 25.2 | 87.3 | 93.5 | 89.1 | 93.7 |
Mendelian error rate and Mendelian error breakdown for GIAB autosomal SVs in Tier 1 regions
| DEL | INS | |||||||
|---|---|---|---|---|---|---|---|---|
|
|
| |||||||
| Tool | MER, % (proportion) | Heterozygous | Homozygous | Other | MER, % (proportion) | Heterozygous | Homozygous | Other |
| npsv (single) | 3.60 (231/6,416) | 99 | 7 | 125 | 4.64 (291/6,269) | 58 | 6 | 227 |
| npsv (variant) | 3.09 (198/6,416) | 111 | 1 | 86 | 5.14 (322/6,269) | 74 | 4 | 244 |
| npsv (hybrid) | 3.21 (206/6,416) | 106 | 1 | 99 | 5.10 (320/6,269) | 65 | 6 | 249 |
| delly | 1.66 (92/5,535) | 50 | 2 | 40 | 1.92 (78/4,059) | 26 | 0 | 52 |
| genomestrip | 2.02 (128/6,337) | 81 | 3 | 44 | ||||
| graphtyper | 5.53 (353/6,386) | 109 | 16 | 228 | 10.13 (608/6,004) | 86 | 27 | 495 |
| paragraph | 2.76 (175/6,351) | 85 | 2 | 88 | 5.42 (329/6,067) | 91 | 4 | 234 |
| sv2 | 8.80 (536/6,089) | 129 | 47 | 360 | ||||
| svtyper | 2.28 (145/6,349) | 79 | 6 | 60 | ||||
| svviz2 (mapq) | 2.46 (158/6,416) | 94 | 3 | 61 | 3.32 (208/6,269) | 75 | 3 | 130 |
NPSV default configuration uses hybrid mode for deletions and single mode for insertions. ME: Mendelian error; MER: Mendelian error rate.
Figure 3:Genotype concordance for GIAB SVs with offset breakpoints. Genotype concordance for GIAB variant-only SVs in Tier 1 regions grouped by the maximum offset between the GIAB breakpoints and the breakpoints for the corresponding SV called with PBSV in PacBio long-read sequencing data. The line shows the concordance when using the PBSV SVs as the input to NPSV running the default genotyoping mode (“hybrid” for deletions, “single” for insertions). The background bar chart shows the underlying distribution of offsets. The same analysis for select comparison tools is included in Supplementary Fig. S5.