| Literature DB >> 35685364 |
Cheng Quan1, Hao Lu1, Yiming Lu1,2, Gangqiao Zhou1,3,4,2.
Abstract
Population-scale studies of structural variation (SV) are growing rapidly worldwide with the development of long-read sequencing technology, yielding a considerable number of novel SVs and complete gap-closed genome assemblies. Herein, we highlight recent studies using a hybrid sequencing strategy and present the challenges toward large-scale genotyping for SVs due to the reference bias. Genotyping SVs at a population scale remains challenging, which severely impacts genotype-based population genetic studies or genome-wide association studies of complex diseases. We summarize academic efforts to improve genotype quality through linear or graph representations of reference and alternative alleles. Graph-based genotypers capable of integrating diverse genetic information are effectively applied to large and diverse cohorts, contributing to unbiased downstream analysis. Meanwhile, there is still an urgent need in this field for efficient tools to construct complex graphs and perform sequence-to-graph alignments.Entities:
Keywords: Genotyping; Long-read sequencing; Pan-genome; Structural variation
Year: 2022 PMID: 35685364 PMCID: PMC9163579 DOI: 10.1016/j.csbj.2022.05.047
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1An overview of the hybrid sequencing strategy. A small number of long-read sequencing (LRS) samples are used for variant detection, followed by a large number of short-read sequencing (SRS) samples for genotyping. After the discovery of SV collections, including deletions (DEL), duplications (DUP), insertions (INS), inversions (INV), inter-chromosomal translocations (CTX), and complex SVs (CPX), these variants are added to the reference to construct a linear representation of alternative alleles or a graph representation of all alleles. Two strategies are then used to perform the genotyping, aligning short reads to the primary contig along with the alternative sequences or performing a sequence-to-graph alignment.
An overview of population-scale structural variation studies using a hybrid sequencing strategy.
| Study | Discovery | Genotyping | Genotyper | Genotyping rate | Recall rate |
|---|---|---|---|---|---|
| Lu et al. (2022) | 35 (20–40×)a | 35 (HGSVC) | danbing-tk v1.3 | – | – |
| Beyter et al. (2021) | 3,622 (17×)b | 10,000 (deCODE, 34×) | GraphTyper v2.6 | – | 36% |
| Ebert et al. (2021) | 35 (20–40×)a | 3,202 (1KG, 34×) | Paragraph v2.4 | 79% | 74% |
| Quan et al. (2021) | 25 (10–20×)b | 276 (40x) | Paragraph v2.4 | 69% | 54% |
| Sirén et al. (2021) | 16 (>50×)a | 2000 (MESA, 20×) | toil-vg | – | – |
| Yan et al. (2021) | 15 (>50×)a | 2504 (1KG, 30×) | Paragraph v2.2 | 86% | 73% |
| Ouzhuluobu et al. (2020) | ZF1 (70×)a | 77 (30x) | CNVnator | – | – |
| Soto etal. (2020) | 2 nonhumansb | 8 (SGDP, 42×) | SVTyper v0.7 | 96% | 45% |
| Audano et al. (2019) | 15 (>50×)a | 174 (1KG, 25×) | SMRT-SV v2 | 55% | 97% |
| Chaisson et al. (2019) | 9 (>50×)ab | 238 (SGDP, 40×) | SMRT-SV | > 92% | > 96% |
| Kronenberg et al. (2018) | CHM13 (>65×)a | 16 (SGDP) | SVTyper v0.1 | – | – |
| Huddleston et al. (2017) | CHM1 (62×)a CHM13 (66×)a | 30 (1KG, 30×) | SMRT-SV | 79% | 93% |
Genotyping rate, the proportion of SVs successfully genotyped, is usually determined by a missing rate threshold and the Hardy-Weinberg hypothesis. Recall rate, the proportion of the alternative allele presented in at least one haplotype. HGSVC, Human Genome Structural Variation Consortium. GTEx, the Genotype-Tissue Expression project. 1KG, 1000 Genomes Project. SGDP, Simons Genome Diversity Project. MESA, Multi-Ethnic Study of Atherosclerosis cohort. a Pacbio long-read sequencing. b Nanopore sequencing. – indicates that the information was not addressed in the paper.
An overview of structural variation genotypers based on the linear reference genome.
| Tools | Input | Feature | Genotyping model | Supported SV types | ||||
|---|---|---|---|---|---|---|---|---|
| INS | DEL | DUP | INV | TRA | ||||
| STIX | SRS | RP, SR | – | × | ✓ | ✓ | ✓ | ✓ |
| muCNV | SRS | RP, SR, RD | GMM | × | ✓ | ✓ | ✓ | × |
| NPSV | SRS | Realignment features | SVM/RF | ✓ | ✓ | × | × | × |
| Nebula | SRS | Unique and affected k-mers | GMM | ✓ | ✓ | × | ✓ | × |
| CNV-JACG | SRS | RP, SR, RD, and other sequence features | RF | × | ✓ | ✓ | × | × |
| SMRT-SV | SRS | Realignment features | SVM | ✓ | ✓ | × | × | × |
| SV2 | SRS | RP, SR, RD, HAR | SVM | × | ✓ | ✓ | × | × |
| Genome STRiP | SRS | RP, SP, RD | GMM | × | ✓ | ✓ | × | × |
| SVTyper | SRS | RP, SR | BM | × | ✓ | ✓ | ✓ | ✓ |
| Delly2 | SRS | RP, SR, RD | BM | × | ✓ | ✓ | ✓ | ✓ |
| CNVnator | SRS | RP | SGM | × | ✓ | ✓ | × | × |
| SVJedi | LRS | Realignment features | BM | ✓ | ✓ | ✓ | ✓ | ✓ |
| Sniffles | LRS | SR, alignment events | BST | ✓ | ✓ | ✓ | ✓ | ✓ |
| svviz2 | LRS | Realignment MAPQ | BM | ✓ | ✓ | ✓ | ✓ | ✓ |
RP, Read pair. SR, split-read. RD, read-depth. HAR, heterozygous allele ration. MAPQ, mapping quality. GMM, Gaussian mixture models. SVM, Support Vector Machine. RF, random forest. MLE, maximum likelihood estimation. SGM, single Gaussian models. BM, Bayesian model. BST, Binary Search Tree. SRS, short-read sequencing. LRS, long-read sequencing.
An overview of graph-based genotypers for structural variation.
| Tools | Graph construction | Graph Indexing strategy | Sequence-to-Graph | Genotyping algorithm |
|---|---|---|---|---|
| Gramtools | NDAG | vBWT | Variation-aware backward search | Coverage model |
| Minos | NDAG | vBWT | Variation-aware backward search | Coverage model |
| toil-vg | VG | GCSA2, GBWT, XG,snarl | SMEM seeds | Coverage model |
| PanGenie | DAG | k-mer hash table | – | HMM |
| GraphTyper2 | DAG | k-mer hash table | Matching k-mers as seeds | Coverage model |
| Paragraph | DAG | Path families | GSSW | Coverage model |
| BayesTyper | VG | Variant cluster groups | Heuristic search | Generative Model |
DAG, directed acyclic graph. NDAG, nested DAG. VG, variation graph. BWT, Burrows–Wheeler transform. vBWT, variation BWT. SMEM, super-maximal exact match. HMM, Hidden Markov Model. GSSW, graph SIMD Smith-Waterman algorithm.
An overview of tools for graph construction and sequence-to-graph alignment.
| Category | Tools | Graph | Output format | Description | Ref |
|---|---|---|---|---|---|
| Graph Construction | seqwish | VG | GFA | A VG building from a set of sequences and alignments between them | |
| Cuttlefish | DBG | GFA | A colored compacted DBG building from a collection of genome references | ||
| ODGI | VG | ODGI | A suite of tools that implements scalable algorithms | ||
| Pandora | DAG | FASTA | A pan-genome graph structure and algorithms for identifying variants | ||
| Simplitigs | DBG | FASTA | A compact representation of DBG | ||
| Bifrost | DBG | GFA | A parallel algorithm enabling the direct construction of the compacted DBG | ||
| libbdsg | VG | GFA | Tools allow for construction and manipulation of genome graphs with dense variation | ||
| minigraph | VG | GFA | A graph-based data model to represent multiple genomes | ||
| SevenBridges | DAG | – | A computationally graph genome implementation | ||
| vg | VG | VG | A toolkit of computational methods for creating and manipulating VG | ||
| Wheeler graphs | DBG | DOT | A framework for BWT-based data structures | ||
| Graph alignment | GraphChainer | VG | GAM | A algorithm to co-linearly chain a set of seeds in an acyclic VG | |
| BlastFrost | DBG | GFA | Query Bifrost data structure for sequences of interest | ||
| A* | – | ALN | A seed heuristic enabling fast and optimal sequence-to-graph alignment | ||
| Giraffe | VG | VG | A pangenome short-read mapper that can map to a collection of haplotypes | ||
| GraphAligner | VG | GAF | A tool for aligning long reads to genome graphs | ||
| SPAligner | DBG | GPA | A tool for aligning long diverged nucleotide and amino acid sequences to assembly graphs | ||
| Vargas | DAG | SAM | A heuristic-free algorithm to find the highest-scoring alignment | ||
| PaSGAL | DAG | TSV | A parallel algorithm for computing sequence to graph alignments | ||
| HISAT2 | DBG | SAM | A tool can align both DNA and RNA sequences using a graph Ferragina Manzini index | ||
| V-ALIGN | DAG | TXT | A tool based on dynamic programming that allows gapped alignment directly on the input graph | ||
VG, variation graph. DBG, de Bruijn graph. DAG, directed acyclic graph.