| Literature DB >> 34424587 |
Jana Wold1, Klaus-Peter Koepfli2,3,4, Stephanie J Galla1,5, David Eccles6, Carolyn J Hogg7, Marissa F Le Lec8, Joseph Guhlin8,9, Anna W Santure10, Tammy E Steeves1.
Abstract
Structural variants (SVs) are large rearrangements (>50 bp) within the genome that impact gene function and the content and structure of chromosomes. As a result, SVs are a significant source of functional genomic variation, that is, variation at genomic regions underpinning phenotype differences, that can have large effects on individual and population fitness. While there are increasing opportunities to investigate functional genomic variation in threatened species via single nucleotide polymorphism (SNP) data sets, SVs remain understudied despite their potential influence on fitness traits of conservation interest. In this future-focused Opinion, we contend that characterizing SVs offers the conservation genomics community an exciting opportunity to complement SNP-based approaches to enhance species recovery. We also leverage the existing literature-predominantly in human health, agriculture and ecoevolutionary biology-to identify approaches for readily characterizing SVs and consider how integrating these into the conservation genomics toolbox may transform the way we manage some of the world's most threatened species.Entities:
Keywords: conservation genetics; fitness traits; functional diversity; genome-wide diversity; pangenomes; structural variation
Mesh:
Year: 2021 PMID: 34424587 PMCID: PMC9290615 DOI: 10.1111/mec.16141
Source DB: PubMed Journal: Mol Ecol ISSN: 0962-1083 Impact factor: 6.622
FIGURE 1Structural variant (SV) types and how they differ from a reference genome: (a) deletion, where a segment of DNA is not present in an individual, but present in the reference; (b) duplication, a rearrangement where there is more than one copy of a particular region of the genome, often in tandem. These can be either intrachromosomal, as shown here, or interchromosomal; (c) insertion, a DNA sequence present in an individual sample, but not present in the reference; (d) inversion, a segment of the chromosome that has had a double‐stranded break at an upstream and downstream location and become reversed in order; (e) translocation, a rearrangement where a portion of one chromosome has been transposed onto another
FIGURE 2Decision tree for characterizing structural variants (SVs), including large and/or complex SVs. See text for details. Chromatin conformation capture methods include Hi‐C, Omni‐C, and Pore‐C. PacBio, Pacific Biosciences; ONT, Oxford Nanopore Technologies. 1Choi et al. (2009); 2Iannucci et al. (2021)
FIGURE 3Structural variant (SV) types and common problems associated with short‐read sequence data: (a) deletion, called when reads do not map to, and/or are split across, a given region on the reference genome. Generally the most straightforward SV to detect with short‐read data, but complex rearrangements may preclude mapping and result in a false call; (b) duplication, typically identified by an increase in read depth, however reference error may result in a false call, preclude assessments of copy number variation, or miss sequence variation due to unmapped reads; (c) insertion, short‐reads may be mapped if the majority of each read aligns to the reference genome, but reads composed mostly of the insertion sequence may remain unmapped; (d) inversion, breakpoints (i.e., exact positions of double‐stranded DNA breaks) are difficult to resolve as they typically occur in highly repetitive regions; (e) translocation, breakpoints difficult to resolve as they commonly occur in highly repetitive regions, reads may also incorrectly map to the “original” chromosome from which the translocation arose. Further, it is common for more than one SV or different SV types to occur in close proximity. Resolving these complex SVs is particularly challenging with short‐read sequence data as multiple mapping signals may contradict one another (e.g., multiple deletions in close proximity, deletions that overlap with inversion/translocation breakpoints), unless large samples sizes are used (Collins et al., 2020)
FIGURE 4Characterization of SVs using assembled whole‐genomes and genome graphs using multiple reference genomes: (a) Schematic of structural variant (SV) discovery using alignment of assembled whole genomes. (b) Schematic of genome graphs, which can then be used to characterize a pangenome and facilitate rapid genotyping of individuals, as “core” regions (genomic regions that do not vary among individuals) are readily distinguished from “accessory” regions (genomic regions that do vary among individuals). Genome graphs can also facilitate accurate genotyping as more than one allele may be considered at once. The use of multiple reference genomes can reduce reference bias and are better able to capture insertions, inversion haplotypes and complex SVs (i.e., genomic regions where multiple SVs occur in close proximity)
Comparison of short‐read sequencing, long‐read sequencing, multiple reference genomes and optical mapping approaches for characterizing structural variants (SVs)
| Feature | Short‐reads | Long‐reads | Multiple reference genomes | Optical mapping |
|---|---|---|---|---|
| Type of SV that can be assessed | Deletions, Insertions, Inversions, Copy Number Variants, and some Translocations | Deletions, insertions, inversions, translocations, duplications, complex SVs | Deletions, insertions, inversions, translocations, duplications, complex SVs | Deletions, Insertions, Inversions, Translocations, Duplications |
| High‐quality reference genome | Required | Required | Required (multiple) | Optional |
| Technology used to generate data for SV discovery and genotyping | Illumina HiSeq, NovaSeq, NextSeq | Pacific Biosciences (PacBio) Sequel; Oxford Nanopore Technologies (ONT) |
Multiple reference genomes generated using a combination of short‐, long‐, linked‐reads (e.g., Hi‐C, Pore‐C) for representative individuals Short‐ and/or long‐reads generated for population‐level genotyping | Bionano Genomics Saphyr |
| DNA quality | Moderate to high molecular weight | High to ultra‐high molecular weight | High to ultra‐high molecular weight | Ultra‐high molecular weight |
| Length of SVs detected | 50 bp to <1 Mb | 50 bp to >1 Mb | 50 bp to >1 Mb | 500 bp to >1 Mb |
| Sequence coverage required | ≥10x | ≥10x |
≥10x to generate multiple reference genomes ≥10x for population‐level genotyping | >30× |
| Method of SV discovery | Alignment‐based, including read‐pair, read depth, split‐reads and local assembly | Alignment‐based with local assembly |
Alignment of multiple de novo assembled whole reference genomes; Genome graphs | Alignment‐based, either to reference genome or to optical maps from different samples |
| SV genotype evidence | Multiple sources of evidence required depending on SV type (read pair, read depth, split read) | Single source of evidence (long continuous reads) usually sufficient | Two sources of evidence required (multiple references and population‐level sampling) | Order, position, and orientation of fluorescently‐labelled sequence motifs |
| Example algorithms/programs | BreakDancer; LUMPY; Manta | SMRT‐SV (PacBio); NanoVar (ONT) | Minimap2; MuMmer; Graphtyper2 | Bionano SVCaller |
| Proportion of SVs discovered across the genome | Moderate | High | High | High |
| Recall rate (proportion of “true” variants that are detected) | Low | High | High | High |
| False positive rate for detection | High | Low | Low to moderate | Low |
| SV discovery challenges |
Choice of method influences number and types of SVs detected; Benchmark data mostly lacking for non‐model species to verify SV accuracy; Reference genome bias; Low precision for detecting complex SVs; Low performance in low complexity regions, near SNPs and small indels. | Sequencing error rate can complicate read mapping and SV detection. |
Total number and type of SVs discovered dependent on number of genomes sequenced and compared (sample size); Precision affected by complexity of SV type. |
Sequence motif used for fluorescent labelling is highly limited and thus may reduce application to some species with complex genomes; Absence of nucleotide information may hinder detection of complex SVs. |
| Computational resources required | High | Moderate, depending on sample number | High, depending on sample number | Moderate, depending on sample number |
| Cost per sample | Low | High | Moderate to high | Moderate |
High molecular weight DNA ≥20 Kb fragments; ultra‐high molecular weight DNA ≥100 Kb fragments