| Literature DB >> 32192518 |
Fatih Karaoğlanoğlu1, Camir Ricketts2,3, Ezgi Ebren1, Marzieh Eslami Rasekh4, Iman Hajirasouliha5,6, Can Alkan7,8.
Abstract
Most existing methods for structural variant detection focus on discovery and genotyping of deletions, insertions, and mobile elements. Detection of balanced structural variants with no gain or loss of genomic segments, for example, inversions and translocations, is a particularly challenging task. Furthermore, there are very few algorithms to predict the insertion locus of large interspersed segmental duplications and characterize translocations. Here, we propose novel algorithms to characterize large interspersed segmental duplications, inversions, deletions, and translocations using linked-read sequencing data. We redesign our earlier algorithm, VALOR, and implement our new algorithms in a new software package, called VALOR2.Entities:
Keywords: Linked-reads; Structural variation; WGS
Mesh:
Year: 2020 PMID: 32192518 PMCID: PMC7083023 DOI: 10.1186/s13059-020-01975-8
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Split molecule and read pair sequence signatures used in VALOR2. a Deletion. b Inversion. c Interspersed duplication in direct orientation. d Inverted duplication. e Translocation. Note that e, shows only non-reciprocal translocations. For reciprocal translocations please refer to Additional file 1: Figure S1). In each case, the large molecules that span the SV breakpoints are split into two mapped regions. Note that it is not possible to determine the mapped strand of the split molecules shown here. In e, the section including B and C is moved to between A and D. We do not show the inverted translocations here for simplicity. From the perspective of the reference genome (i.e., mapping), A, B, C, D, E, and F are defined as submolecules; A/B, C/D, and E/F pairs are candidate splits; and A/B-C/D quadruple is a split molecule pair
Fig. 2Building the SV graph from split molecule pairs for an interspersed duplication. a Four pairs of split molecules that signal the event. b Corresponding SV graph, where each vertex denotes a pair of submolecules that signal the SV, and edges show “agreement” between pairs. The shaded area corresponds to the quasi-clique selected as representative of the putative SV
Fig. 3Molecule size histogram mapped to chromosome 1 as observed in the linked-read sequencing data generated from the genome of the NA12878 sample [18]
Prediction performance evaluation using simulated structural variants
| Duplications (direct) | VALOR2 | 111 | 103 | 89 | 14 | 22 | |||
| Duplications (inverted) | VALOR2 | 49 | 51 | 43 | 8 | 6 | |||
| Inversions | VALOR2 | 90 | 65 | 54 | 11 | 36 | 0.83 | 0.60 | 0.70 |
| VALOR1 | 90 | 63 | 47 | 13 | 43 | 0.78 | 0.52 | 0.63 | |
| LUMPY/smoove | 90 | 35 | 27 | 7 | 63 | 0.79 | 0.30 | 0.44 | |
| DELLY | 90 | 358 | 39 | 293 | 51 | 0.12 | 0.43 | 0.18 | |
| TARDIS | 90 | 43 | 34 | 1 | 56 | 0.97 | 0.38 | 0.54 | |
| Sniffles | 90 | 787 | 72 | 603 | 18 | 0.11 | 0.19 | ||
| Long Ranger | 90 | 75 | 54 | 20 | 36 | 0.73 | 0.60 | 0.66 | |
| Long Ranger ∪ VALOR2 | 90 | 102 | 70 | 31 | 20 | 0.69 | 0.78 | ||
| Long Ranger ∩ VALOR2 | 90 | 38 | 38 | 0 | 52 | 0.42 | 0.59 | ||
| Deletions | VALOR2 | 85 | 81 | 74 | 7 | 11 | 0.91 | 0.87 | 0.89 |
| LUMPY/smoove | 85 | 292 | 66 | 226 | 19 | 0.23 | 0.78 | 0.35 | |
| DELLY | 85 | 496 | 72 | 424 | 13 | 0.15 | 0.85 | 0.25 | |
| TARDIS | 85 | 152 | 70 | 82 | 15 | 0.46 | 0.82 | 0.59 | |
| Sniffles | 85 | 467 | 72 | 395 | 13 | 0.15 | 0.85 | 0.26 | |
| Long Ranger | 85 | 262 | 79 | 175 | 6 | 0.31 | 0.93 | 0.47 | |
| Long Ranger ∪ VALOR2 | 85 | 270 | 163 | 185 | 3 | 0.47 | 0.63 | ||
| Long Ranger ∩ VALOR2 | 85 | 84 | 79 | 5 | 6 | 0.93 | |||
| Translocations | VALOR2 | 38 | 27 | 27 | 0 | 11 | 0.71 | 0.83 | |
| LUMPY/smoove | 38 | 4 | 2 | 2 | 36 | 0.50 | 0.05 | 0.10 | |
| DELLY | 38 | 116 | 30 | 86 | 8 | 0.26 | 0.79 | 0.39 | |
| Long Ranger | 38 | 29 | 26 | 3 | 12 | 0.90 | 0.68 | 0.78 | |
| Long Ranger ∪ VALOR2 | 38 | 38 | 53 | 3 | 3 | 0.95 | |||
| Long Ranger ∩ VALOR2 | 38 | 18 | 18 | 0 | 20 | 0.47 | 0.64 |
We evaluate the prediction performance of only large SVs (> 80 kbp for inversions, > 40 kbp for duplications, > 100 kbp for deletions, and > 100 kbp for translocations). Note that VALOR1, LUMPY, DELLY, Sniffles, and Long Ranger are not able to call interspersed duplications, and TARDIS can call duplications < 10 kb, which is smaller than the variants shown in this table. Precision is calculated as , and recall is defined as , where TP is the true positive, FP is the false positive, FN is the false negative, Pr. is the precision, and Rec is the recall. F1-score (shown as F1) is calculated as . SV calls predicted by both Long Ranger and VALOR2 (> 50% reciprocal overlap) are merged into a single call. Best values are highlighted with boldface font
Fig. 4Comparison of size distribution of detected true (i.e., known) calls in simulation data as a density plot. We demonstrate that VALOR2 SV detection size range is complementary to WGS-based approaches
Large structural variants found in biological data sets
| Deletions | NA19238 | 8 | 8 | 1 | 1 | 81 | 49 | 192 | 127 | 14 | 13 |
| NA19239 | 10 | 10 | 3 | 3 | 104 | 64 | 232 | 157 | 17 | 14 | |
| NA19240 | 11 | 11 | 2 | 2 | 95 | 59 | 228 | 157 | 15 | 14 | |
| NA12878 | 14 | 14 | 18 | 18 | 138 | 62 | 273 | 170 | 20 | 20 | |
| CHM1 | 9 | 8 | 109 | 72 | 106 | 47 | 226 | 113 | 20 | 19 | |
| CHM13 | 7 | 7 | 95 | 65 | 78 | 43 | 660 | 423 | 10 | 8 | |
| Inversions | NA19238 | 56 | 17 | 2 | 2 | 3 | 0 | 407 | 37 | 14 | 1 |
| NA19239 | 49 | 15 | 1 | 1 | 4 | 0 | 406 | 33 | 11 | 0 | |
| NA19240 | 89 | 25 | 3 | 2 | 4 | 0 | 435 | 31 | 9 | 1 | |
| NA12878 | 33 | 12 | 5 | 1 | 3 | 0 | 415 | 37 | 43 | 1 | |
| CHM1 | 35 | 26 | 2 | 2 | 3 | 0 | 259 | 23 | 22 | 1 | |
| CHM13 | 40 | 28 | 2 | 2 | 5 | 0 | 1496 | 65 | 50 | 0 | |
| Duplications | NA19238 | 9 | 5 | 3 | 3 | 142 | 91 | 307 | 183 | 77 | 46 |
| NA19239 | 9 | 5 | 0 | 0 | 158 | 96 | 298 | 189 | 79 | 42 | |
| NA19240 | 19 | 8 | 2 | 2 | 139 | 91 | 284 | 187 | 82 | 47 | |
| NA12878 | 6 | 4 | 20 | 19 | 196 | 93 | 341 | 184 | 293 | 133 | |
| CHM1 | 5 | 3 | 0 | 0 | 164 | 83 | 289 | 138 | 131 | 64 | |
| CHM13 | 7 | 3 | 0 | 0 | 519 | 276 | 1425 | 784 | 329 | 196 | |
| Translocations | NA19238 | 1 | 0 | 0 | 0 | 336 | 0 | 8788 | 0 | N/A | N/A |
| NA19239 | 3 | 0 | 0 | 0 | 368 | 0 | 8946 | 0 | N/A | N/A | |
| NA19240 | 1 | 0 | 0 | 0 | 362 | 0 | 9250 | 0 | N/A | N/A | |
| NA12878 | 1 | 0 | 1 | 0 | 842 | 0 | 9770 | 0 | N/A | N/A | |
| CHM1 | 0 | 0 | 0 | 0 | 320 | 0 | 6511 | 0 | N/A | N/A | |
| CHM13 | 0 | 0 | 0 | 0 | 184 | 0 | 117667 | 0 | N/A | N/A | |
Similar to Table 1, we only report large SVs we discovered in real data sets (> 80 kbp for inversions, > 40 kbp for duplications, > 100 kbp for deletions, and > 100 kbp for translocations). We ran LUMPY using the smoove wrapper as recommended by the authors. Note that TARDIS does not predict translocations. We merged tandem and interspersed duplications in this table since Long Ranger, LUMPY, and DELLY do not differentiate between them. ∗For CNVs (deletions and duplications), known variants refer to those that are reported in dbVar [52] non-redundant call set (https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/). For balanced rearrangements (inversions and translocations), we used the gnomAD [53] v2.1.1 call set, lifted over to GRCh38 (https://storage.googleapis.com/gnomad-public/papers/2019-sv/gnomad_v2.1_sv.sites.vcf.gz)