| Literature DB >> 33815469 |
Junfu Guo1, Chang Shi1, Xi Chen1, Ou Wang2, Ping Liu3, Huanming Yang4, Xun Xu5, Wenwei Zhang2, Hongmei Zhu2.
Abstract
Co-barcoded reads originating from long DNA fragments (mean length >30 kbp) maintain both single base level accuracy and long-range genomic information. We propose a pipeline, stLFRsv, to detect structural variation using co-barcoded reads. stLFRsv identifies abnormal large gaps between co-barcoded reads to detect potential breakpoints and reconstruct complex structural variants (SVs). Haplotype phasing by co-barcoded reads increases the signal to noise ratio, and barcode sharing profiles are used to filter out false positives. We integrate the short read SV caller smoove for smaller variants with stLFRsv. The integrated pipeline was evaluated on the well-characterized genome HG002/NA24385, and 74.5% precision and a 22.4% recall rate were obtained for deletions. stLFRsv revealed some large variants not included in the benchmark set that were verified by long reads or assembly. For the HG001/NA12878 genome, stLFRsv also achieved the best performance for both resource usage and the detection of large variants. Our work indicates that co-barcoded read technology has the potential to improve genome completeness.Entities:
Keywords: : human genome; breakpoints; co-barcoded reads; complex variants; structural variation
Year: 2021 PMID: 33815469 PMCID: PMC8012683 DOI: 10.3389/fgene.2021.636239
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Long DNA fragments (colored lines) are constructed by read pairs (small solid blocks) that share the same barcode. When aligned to the reference genome, long DNA fragments covering large structural variations are broken into sub-fragments by large gaps. The blue arrows indicate the directions of genome sequences (big hollow blocks). (A) Deletion. (B) Inversion. (C) Tandem duplication. (D) Insertion.
FIGURE 2Workflow and algorithm. (A) Structural variation detection workflow. (B) Cluster segment ends by bins: left end cluster and right end cluster. (C) Pair up ends by shared barcodes. (D) Pair down candidates by removing those nearby. (E) Split into haplotypes by phasing barcodes on phasing blocks. (F) Use barcode sharing heatmap pattern as a filter, and anchor the variation on the genome. Each point in the heatmap represents the shared barcode number at the corresponding X-axis and Y-axis positions by color depth.
FIGURE 3Large variations do not match the GIAB benchmark in HG002. (A) Heatmap for a deletion on Chr3. (B) Heatmap for an inversion on Chr12. (C) Heatmap for a deletion on Chr12. (D) Long read alignment supports the inversion in (B). (E) Long read alignment supports the deletion in (A). (F) Assembly alignment to reference by Blast for the deletion in (C). (G) Heatmap for a deletion on Chr19 and long read alignment. (H) Heatmap and structure for an inversion on Chr11 and assembly alignment.
Deletion evaluation on whole genome against GIAB HG002 benchmark.
| 100× long reads | 100× co-barcoded reads | |||||||
| Sniffles | Long Ranger | NAIBR | stLFRsv | smoove | stLFRsv + smoove | GROC-SVs | ||
| Mapping | Minimap2 | lariat | bwamem2 | bwamem2 | bwamem2 | bwamem2 | bwamem2 | |
| 50–1 k | Benchmark | 4,719 | ||||||
| Total call | 9,453 | 3,583 | 2 | 0 | 972 | 972 | 0 | |
| TP | 4,168 | 2,304 | 2 | 0 | 724 | 724 | 0 | |
| FP | 5,285 | 1,279 | 0 | 0 | 248 | 248 | 0 | |
| FN | 551 | 2,415 | 4,717 | 4,719 | 3,995 | 3,995 | 4,719 | |
| Precision | 44.09% | 64.30% | 100.00% | – | 74.49% | 74.49% | – | |
| Recall | 88.32% | 48.82% | 0.04% | – | 15.34% | 15.34% | – | |
| 1 k–10 k | Benchmark | 577 | ||||||
| Total call | 902 | 489 | 155 | 13 | 554 | 556 | 0 | |
| TP | 533 | 391 | 125 | 12 | 434 | 436 | 0 | |
| FP | 369 | 98 | 30 | 1 | 120 | 120 | 0 | |
| FN | 44 | 186 | 452 | 565 | 143 | 141 | 577 | |
| Precision | 59.09% | 79.96% | 80.65% | 92.31% | 78.34% | 78.42% | – | |
| Recall | 92.37% | 67.76% | 21.66% | 2.08% | 75.22% | 75.56% | – | |
| 10 k–30 k | Benchmark | 31 | ||||||
| Total call | 60 | 27 | 31 | 56 | 35 | 56 | 9 | |
| TP | 28 | 19 | 24 | 30 | 22 | 30 | 7 | |
| FP | 32 | 8 | 7 | 26 | 13 | 26 | 2 | |
| FN | 3 | 12 | 7 | 1 | 9 | 1 | 24 | |
| Precision | 46.67% | 70.37% | 77.42% | 53.57% | 62.86% | 53.57% | 77.78% | |
| Recall | 90.32% | 61.29% | 77.42% | 96.77% | 70.97% | 96.77% | 22.58% | |
| >30 k | Benchmark | 9 | ||||||
| Total call | 55 | 14 | 28 | 23 | 36 | 23 | 13 | |
| TP | 9 | 6 | 8 | 7 | 7 | 7 | 4 | |
| FP | 46 | 8 | 20 | 16 | 29 | 16 | 9 | |
| FN | 0 | 3 | 1 | 2 | 2 | 2 | 5 | |
| Precision | 16.36% | 42.86% | 28.57% | 30.43% | 19.44% | 30.43% | 30.77% | |
| Recall | 100.00% | 66.67% | 88.89% | 77.78% | 77.78% | 77.78% | 44.44% | |
Detection capability and estimated parameters of different HG002 libraries.
| Input DNA amount | 1 ng | 1 ng | 1.5 ng | 1.5 ng | |
| Reads count | 2,525,286,352 | 3,029,968,430 | 2,172,780,252 | 2,994,596,020 | |
| Average sequencing depth (after duplication removed) | 44.34 | 35.77 | 46.73 | 44.38 | |
| High-quality read ratio | 89.57% | 78.03% | 79.15% | 75.55% | |
| Read pairs per segment | 32.33 | 18.30 | 18.40 | 17.21 | |
| Barcode conflict (segments per barcode) | 1.55 | 1.41 | 2.04 | 1.70 | |
| Estimated parameters | 1,500 | 1,500 | 2,500 | 1,900 | |
| 4 | 4 | 4 | 4 | ||
| 13,100 | 13,900 | 22,200 | 13,800 | ||
| Detection capability | Deletion (bp) | 13,500 | 13,500 | 22,500 | 13,300 |
| Inversion/duplication (bp) | 48,100 | 28,600 | 46,700 | 32,200 | |
FIGURE 4The resource usage of four pipelines when processing 100× stLFR reads of HG002 with the given parameters in Supplementary Table 1. *The resource usage of Long Ranger is estimated by the log file because it is a fully integrated functional pipeline.
FIGURE 5(A) Venn diagram of detected deletions from four structural variant (SV) callers in HG001. (B) The heatmap and structure for a complex inversion in HG001. The signs of two breakpoints are marked by blue and yellow circles in the heatmap.