| Literature DB >> 34050336 |
Wouter De Coster1,2, Matthias H Weissensteiner3, Fritz J Sedlazeck4.
Abstract
Long-read sequencing technologies have now reached a level of accuracy and yield that allows their application to variant detection at a scale of tens to thousands of samples. Concomitant with the development of new computational tools, the first population-scale studies involving long-read sequencing have emerged over the past 2 years and, given the continuous advancement of the field, many more are likely to follow. In this Review, we survey recent developments in population-scale long-read sequencing, highlight potential challenges of a scaled-up approach and provide guidance regarding experimental design. We provide an overview of current long-read sequencing platforms, variant calling methodologies and approaches for de novo assemblies and reference-based mapping approaches. Furthermore, we summarize strategies for variant validation, genotyping and predicting functional impact and emphasize challenges remaining in achieving long-read sequencing at a population scale.Entities:
Mesh:
Year: 2021 PMID: 34050336 PMCID: PMC8161719 DOI: 10.1038/s41576-021-00367-3
Source DB: PubMed Journal: Nat Rev Genet ISSN: 1471-0056 Impact factor: 53.242
Fig. 1Overview of population-scale studies using long-read sequencing.
Studies published in 2019–2021 in which five or more samples were sequenced are included. Genome size of study organisms is viewed in three different categories (<500 Mbp, 500–2,000 Mbp and >2,000 Mbp), and the methodological approach taken to investigate genetic variation (comparison of assemblies, read mapping against a reference or both) is illustrated by the different colours. For further details, see Table 1.
An overview of long-read-based population studies
| Study | Organism and category | Technologya and analysis approach | Sample sizeb | Genome size (Mbp) | Ref. |
|---|---|---|---|---|---|
| Kou et al. (2020) | Rice Agriculture | PacBio Assembly comparison and read mapping | 15 (LR); 393 (SR) | 430 | [ |
| Weissensteiner et al. (2020) | Crow Evolution | PacBio Read mapping | 33 (LR); 127 (SR) | 1,300 | [ |
| Chakraborty et al. (2019) | Evolution | PacBio Assembly comparison | 14 (LR) | 180 | [ |
| Jiao & Schneeberger (2020) | Evolution | PacBio Assembly comparison | 7 (LR) | 135 | [ |
| Alonge et al. (2020) | Tomato Agriculture | ONT Read mapping | 100 (LR) | 950 | [ |
| Beyter et al. (2020) | Human Human evolution | ONT Read mapping | 3622 (LR) | 3,200 | [ |
| Tusso et al. (2019) | Yeast Evolution | ONT and PacBio Assembly comparison and read mapping | 17 (LR); 161 (SR) | 12 | [ |
| Liu et al. (2020) | Soy bean Agriculture | PacBio Assembly comparison | 26 (LR) | 1,150 | [ |
| Chawla et al. (2020) | Rapeseed Agriculture | ONT and PacBio Read mapping | 12 (LR) | 1,132 | [ |
| Hiatt et al. (2020) | Human Human evolution | PacBio Assembly comparison and read mapping | 18 (LR) | 3,200 | [ |
| Mitsuhashi et al. (2020) | Human Human evolution | ONT and PacBio Read mapping | 37 (LR) | 3,200 | [ |
| Shafin et al. (2020) | Human Human evolution | ONT Assembly comparison | 11 (LR) | 3,200 | [ |
| De Roeck et al. (2020) | Human Human evolution | ONT Read mapping | 11 (LR) | 3,200 | [ |
| Chaisson et al. (2019) | Human Human evolution | ONT and PacBio Assembly comparison | 9 (LR) | 3,200 | [ |
| Morena-Barrio et al. (2020) | Human Human evolution | ONT Read mapping | 19 (LR) | 3,200 | [ |
| Song et al. (2020) | Rapeseed Agriculture | PacBio Assembly comparison | 8 (LR) | 1,132 | [ |
| Sone et al. (2019) | Human Human evolution | ONT and PacBio Read mapping | 17 (LR) | 3,200 | [ |
| Kim et al. (2020) | Evolution | ONT Assembly comparison | 101 (LR) | 180 | [ |
| Pauper et al. (2020) | Human Human evolution | PacBio Read mapping | 15 (LR) | 3,200 | [ |
| Ebert et al. (2020) | Human Human evolution | PacBio Assembly comparison | 64 (LR) | 3,200 | [ |
| Quan et al. (2020) | Human Human evolution | ONT Read mapping | 25 (LR) | 3,200 | [ |
| Hufford et al. (2021) | Maize Agriculture | PacBio Assembly comparison | 26 (LR) | 2,200 | [ |
| Hu et al. (2021) | Maize Agriculture | PacBio Assembly comparison | 6 (LR) | 2,200 | [ |
| Wu et al. (2021) | Human Human evolution | ONT and PacBio Read mapping | 405 (LR) | 3,200 | [ |
aTwo main platforms are used in long-read sequencing projects, Pacific Biosciences (PacBio) high fidelity (HiFi) and Oxford Nanopore Technologies (ONT) PromethION. bSample sizes for long-read (LR) and short-read (SR) sequencing are specified.
Fig. 2Overview of long-read population study design.
a | The experimental design of three different approaches is outlined. In the first strategy (left), all samples are sequenced at medium to high coverage by long-read sequencing. In the second approach (middle), a proportion of the samples are sequenced with medium to high coverage and the remainder using low coverage by long-read sequencing (similar to the initial 1000 Genomes project). In the third approach (right), a proportion of the samples are sequenced at medium to high coverage by long-read sequencing and the remainder by short-read sequencing. The decision of which approach to take will affect the ability to detect common (red symbols) or rare (grey symbols) events in the population. The decision also depends on the available budget, existing data and the sample DNA availability. b | Overview of current established sequencing technologies based on CHM13 sequencing data[79]: Illumina, Pacific Biosciences (PacBio) High Fidelity (HiFi) reads or ultra-long reads from Oxford Nanopore Technologies (ONT). The N50 read length and average read accuracy are highlighted in orange. Although each technology has advantages and disadvantages, HiFi and ONT are the most promising for future applications. c | Overview of analysis strategies. Although multiple approaches are available, the main decision is whether to use an alignment-based approach or a de novo assembly-based approach, which has implications for sequencing requirements and the approaches, resolution and comprehensiveness of downstream computational analysis.
An overview of software tools for analysing long-read sequencing data
| Category | Tool name | Description | Ref. |
|---|---|---|---|
| De novo assembly | (Hi)Canu | Versatile de novo assembler | [ |
| Flye | Fast de novo assembler that can also operate on low coverage data | [ | |
| Shasta | Fast ONT assembler | [ | |
| Falcon Unzip | PacBio assembler for phased assemblies | [ | |
| Peregrine | Optimized assembler for HiFi data only | [ | |
| hifiasm | Optimized assembler for HiFi data only | [ | |
| PGAS | Phased assembly including strand seq | [ | |
| Genomic alignment | LAST | Versatile method to align contigs or genomes | [ |
| MUMmer | Long-standing genomic aligner | [ | |
| minimap2 | Pairwise alignment method for long reads up to genomes | [ | |
| Cactus | Progressive genomic alignment method allowing integration of more than two genomes at a time | [ | |
| SibeliaZ | Fast genome aligner of multiple genomes | [ | |
| Read alignment | minimap2 | Pairwise alignment method for long reads up to genomes | [ |
| NGMLR | Convex gap cost implementation | [ | |
| Winnowmap | Improvements for mapping in repetitive regions | [ | |
| lra | Efficient convex-cost gap penalty sequence and contig aligner | [ | |
| Graph genome methods | Giraffe | Rapid reads to graph aligner | [ |
| vg | Toolkit to construct and convert graphs with methods to genotype and call variants | [ | |
| minigraph | A sequence-to-graph mapper and graph constructor based on minimap2 | [ | |
| GraphAligner | Sequence-to-graph aligner for long reads | [ | |
| GraphTyper2 | Genotyping variants in a graph genome from short reads | [ | |
| Paragraph | Genotyping structural variants in a regional graph genome from short reads | [ | |
| PanGenie | k-mer-based genotyping of short reads in a haplotype-resolved graph | [ | |
| Phasing | WhatsHap | Phasing method for SNVs and smaller indels | [ |
| HapCut2 | Phasing method for SNVs | [ | |
| SV calling from alignment | pbsv | Joint calling of SVs across samples | [ |
| Sniffles | Automatic parameter estimation | [ | |
| CuteSV | Highly parallelized SV calling | [ | |
| SVIM | Uses graph-based clustering of candidates | [ | |
| SV calling from assemblies | dipcall | Deletion and insertion calling from de novo assembly | [ |
| SVIM-asm | SV calling from (diploid) de novo assembly | [ | |
| PAV | Compares phased assemblies with a reference genome | [ | |
| SNV calling | Clair | Uses a convolutional neural net | [ |
| DeepVariant | Neural network-based SNV caller | [ | |
| Longshot | Partitioning reads in haplotypes and calling variants in accordance with those haplotypes | [ | |
| Pepper | Phasing-based SNV calling | [ | |
| SV merging | SURVIVOR | Merging that allows breakpoint inaccuracies | [ |
| SVanalyzer | Assembly based, two samples only | [ | |
| Truvari | Parameterized stepwise merging including sequence similarity | [ | |
| Jasmine | Merging SV based on sequence similarity | [ | |
| SV genotyping | cuteSV | Force-calling of variants from a VCF file | [ |
| Sniffles | Uses split reads to identify known SVs over shared breakpoints | [ | |
| SVJedi | Compares the alignment of reads against the reference genome and alternative contigs representing the SV to determine the best match | [ | |
| LRcaller | Genotypes variants of long reads | [ | |
| Other | TRiCoLOR | Detects and genotypes repeat lengths separated by phase | [ |
| Iris | Local assembly of insertions | [ | |
| SVCollector | Optimized sample selection | [ | |
| NanoComp | Comparison of sequencing data | [ |
HiFi, high fidelity; indel, insertions–deletions; ONT, Oxford Nanopore Technologies; PacBio, Pacific Biosciences; SNV, single-nucleotide variant; SV, structural variant; VCF, variant call format.
Fig. 3Potential problems for different genome comparison approaches.
a | Schematic depiction of a potential problem in a de novo assembly-based approach. The presence of a novel segment N1 in two de novo assemblies, at different locations and, even more so, a sequence variant (red x), poses a challenge to correct reporting by current state-of-the-art methods and variation formats. b | Similar representation of the N1 problem in an alignment-based approach, where the coordinates of N1 are shared, but remain challenging for the identification of the single-nucleotide variant (SNV) or the entire N1 sequence. c | A graph-based representation of N1, which enables a clearer comparison of the variant across the samples, illustrating the potential benefits of graph genomes. R1–R3 represents the backbone of the graph genome and N1, and its SNV represents novel sequencing for a given sample set.
Fig. 4Genotyping of SVs and SNVs across a population set.
a | Graph genome-based genotyping of a region with multiple alleles between two genome segments (green and pink). Insertions of different sizes (yellow) can be genotyped at the same locus using spanning reads (blue and purple) to identify the presence of two different alleles. b | An example of structural variants (SVs) and single-nucleotide variants (SNVs) across different unique and repeat regions being correctly or incorrectly genotyped based on read length. c | A phylogenetically informed filtering approach for SVs. Assuming that after a sufficiently long time (4Ne generations, where e = effective population size) most or all genetic variation should be fully sorted between two clades; variants that do not adhere to this assumption and are polymorphic across clades (for example, variant 3) can be removed. Although this approach is certainly very conservative and ignores the fact that some types of variation exhibit repeated mutations on the same locus, it can be considered a first step towards more reliable genotyping of SVs.