Literature DB >> 34050336

Towards population-scale long-read sequencing.

Wouter De Coster^1,2, Matthias H Weissensteiner³, Fritz J Sedlazeck⁴.

Abstract

Long-read sequencing technologies have now reached a level of accuracy and yield that allows their application to variant detection at a scale of tens to thousands of samples. Concomitant with the development of new computational tools, the first population-scale studies involving long-read sequencing have emerged over the past 2 years and, given the continuous advancement of the field, many more are likely to follow. In this Review, we survey recent developments in population-scale long-read sequencing, highlight potential challenges of a scaled-up approach and provide guidance regarding experimental design. We provide an overview of current long-read sequencing platforms, variant calling methodologies and approaches for de novo assemblies and reference-based mapping approaches. Furthermore, we summarize strategies for variant validation, genotyping and predicting functional impact and emphasize challenges remaining in achieving long-read sequencing at a population scale.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34050336 PMCID： PMC8161719 DOI： 10.1038/s41576-021-00367-3

Source DB: PubMed Journal: Nat Rev Genet ISSN： 1471-0056 Impact factor: 53.242

Introduction

Sequencing the DNA or mRNA of multiple individuals of one or more species (that is, population-scale sequencing) aims to identify genetic variation at a population level to address questions in the fields of evolutionary, agricultural and medical research. Previous population studies, including genome-wide association studies (GWAS), have not been able to exhaustively characterize the genetic factors underlying human traits and diseases[1]. There has been much speculation about the source of this ‘missing heritability’, often pointing to both structural variants (SVs) and rare variants[2,3]. SVs account for a greater total number of nucleotide changes in human genomes than the far more numerous single-nucleotide variants (SNVs)[4]. To date, such population studies have relied mostly on high-throughput short-read sequencing technologies, which produce reads ranging from 25 bp to 400 bp in length[5]. However, short reads have important limitations in characterizing repetitive regions[6,7]. DNA repeats act as the genomic substrate to facilitate SV formation[8] while also hampering SV discovery owing to read alignment inaccuracies. Even in a non-repetitive genome, variations such as insertions (especially for alleles longer than the read length[7]) or other modifications (for example, methylation) would be missed by an approach relying solely on short reads. Long-read sequencing has emerged as superior to short-read sequencing and other methods (for example, arrays) for the identification of structural variation, as shown by the Genome in a Bottle (GIAB) and Human Genome Structural Variation (HGSV) consortia, which combined multiple technologies to comprehensively characterize structural variation in human genomes[9,10]. These studies highlighted that a substantial proportion of hidden variation can be discovered with long-read sequencing. Indeed, recent long-read sequencing studies of Icelandic and Chinese populations have already identified previously undetected variants associated with height, cholesterol level and anaemia[11,12]. Analysis of 26 maize genomes[13] revealed that more SVs are involved in causing diseases than in conferring agronomically important traits. In addition, long-read sequencing is beneficial for improving the continuity, accuracy and range of variant phasing[14-16], assessing complex small variants[17] and has been applied to find disease-associated alleles[18-20]. For de novo assemblies, multiple methods have been published over recent years to promote the use of long reads[21-25]. Ongoing advances in sequencing technology and bioinformatics have paved the way to achieving long-read sequencing on a population scale[26]. The two main competitors driving innovation in the field are Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). PacBio high fidelity (HiFi) reads are generated by their Sequel II system; HiFi reads are both long (15–20 kbp) and highly accurate[27]. The ONT PromethION platform can produce much longer reads (up to 4 Mbp[28]), has a higher throughput at lower cost, but produces less accurate reads than the Sequel II system. Recent comparisons show an equivalent performance for SV calling with the two platforms[29,30] (in-depth technical review and further comparison of long-read sequencing platforms available elsewhere[31]). Within the past 2 years, multiple studies have applied long-read sequencing to answer various questions in multiple different organisms[32-35] (Fig. 1; Table 1). The largest human-focused long-read sequencing study to date investigated the genomic diversity of 3,622 Icelandic genomes[11], with many other studies to follow, such as the NIH All of Us research programme and the NIH Center for Alzheimer’s and Related Dementias (CARD) in the USA and similar efforts in China, Abu Dhabi and Qatar. Long-read sequencing of a global diversity cohort is also being carried out as part of the Human Pangenome project[36]. Aside from human studies, long-read sequencing has been applied on a population scale to discover structural variation associated with phenotypes in crops[32,33], fruitflies[34] and songbirds[35], and increasingly has a role in metagenomic studies (Box 1). Here, we restrict our discussion to eukaryotic organisms, as long-read sequencing studies of bacteria and other prokaryotes require specific laboratory and bioinformatics approaches, and the challenges are inherently different.

Fig. 1

Overview of population-scale studies using long-read sequencing.

Table 1

An overview of long-read-based population studies

Study	Organism and category	Technology^a and analysis approach	Sample size^b	Genome size (Mbp)	Ref.
Kou et al. (2020)	Rice Agriculture	PacBio Assembly comparison and read mapping	15 (LR); 393 (SR)	430	[129]
Weissensteiner et al. (2020)	Crow Evolution	PacBio Read mapping	33 (LR); 127 (SR)	1,300	[35]
Chakraborty et al. (2019)	Drosophila Evolution	PacBio Assembly comparison	14 (LR)	180	[34]
Jiao & Schneeberger (2020)	Arabidopsis Evolution	PacBio Assembly comparison	7 (LR)	135	[130]
Alonge et al. (2020)	Tomato Agriculture	ONT Read mapping	100 (LR)	950	[32]
Beyter et al. (2020)	Human Human evolution	ONT Read mapping	3622 (LR)	3,200	[11]
Tusso et al. (2019)	Yeast Evolution	ONT and PacBio Assembly comparison and read mapping	17 (LR); 161 (SR)	12	[30]
Liu et al. (2020)	Soy bean Agriculture	PacBio Assembly comparison	26 (LR)	1,150	[33]
Chawla et al. (2020)	Rapeseed Agriculture	ONT and PacBio Read mapping	12 (LR)	1,132	[131]
Hiatt et al. (2020)	Human Human evolution	PacBio Assembly comparison and read mapping	18 (LR)	3,200	[18]
Mitsuhashi et al. (2020)	Human Human evolution	ONT and PacBio Read mapping	37 (LR)	3,200	[132]
Shafin et al. (2020)	Human Human evolution	ONT Assembly comparison	11 (LR)	3,200	[25]
De Roeck et al. (2020)	Human Human evolution	ONT Read mapping	11 (LR)	3,200	[133]
Chaisson et al. (2019)	Human Human evolution	ONT and PacBio Assembly comparison	9 (LR)	3,200	[10]
Morena-Barrio et al. (2020)	Human Human evolution	ONT Read mapping	19 (LR)	3,200	[19]
Song et al. (2020)	Rapeseed Agriculture	PacBio Assembly comparison	8 (LR)	1,132	[134]
Sone et al. (2019)	Human Human evolution	ONT and PacBio Read mapping	17 (LR)	3,200	[20]
Kim et al. (2020)	Drosophila Evolution	ONT Assembly comparison	101 (LR)	180	[135]
Pauper et al. (2020)	Human Human evolution	PacBio Read mapping	15 (LR)	3,200	[136]
Ebert et al. (2020)	Human Human evolution	PacBio Assembly comparison	64 (LR)	3,200	[46]
Quan et al. (2020)	Human Human evolution	ONT Read mapping	25 (LR)	3,200	[137]
Hufford et al. (2021)	Maize Agriculture	PacBio Assembly comparison	26 (LR)	2,200	[13]
Hu et al. (2021)	Maize Agriculture	PacBio Assembly comparison	6 (LR)	2,200	[138]
Wu et al. (2021)	Human Human evolution	ONT and PacBio Read mapping	405 (LR)	3,200	[12]

aTwo main platforms are used in long-read sequencing projects, Pacific Biosciences (PacBio) high fidelity (HiFi) and Oxford Nanopore Technologies (ONT) PromethION. bSample sizes for long-read (LR) and short-read (SR) sequencing are specified.

Overview of population-scale studies using long-read sequencing.

Studies published in 2019–2021 in which five or more samples were sequenced are included. Genome size of study organisms is viewed in three different categories (<500 Mbp, 500–2,000 Mbp and >2,000 Mbp), and the methodological approach taken to investigate genetic variation (comparison of assemblies, read mapping against a reference or both) is illustrated by the different colours. For further details, see Table 1. An overview of long-read-based population studies Rice Agriculture PacBio Assembly comparison and read mapping Crow Evolution PacBio Read mapping Drosophila Evolution PacBio Assembly comparison Arabidopsis Evolution PacBio Assembly comparison Tomato Agriculture ONT Read mapping Human Human evolution ONT Read mapping Yeast Evolution ONT and PacBio Assembly comparison and read mapping Soy bean Agriculture PacBio Assembly comparison Rapeseed Agriculture ONT and PacBio Read mapping Human Human evolution PacBio Assembly comparison and read mapping Human Human evolution ONT and PacBio Read mapping Human Human evolution ONT Assembly comparison Human Human evolution ONT Read mapping Human Human evolution ONT and PacBio Assembly comparison Human Human evolution ONT Read mapping Rapeseed Agriculture PacBio Assembly comparison Human Human evolution ONT and PacBio Read mapping Drosophila Evolution ONT Assembly comparison Human Human evolution PacBio Read mapping Human Human evolution PacBio Assembly comparison Human Human evolution ONT Read mapping Maize Agriculture PacBio Assembly comparison Maize Agriculture PacBio Assembly comparison Human Human evolution ONT and PacBio Read mapping aTwo main platforms are used in long-read sequencing projects, Pacific Biosciences (PacBio) high fidelity (HiFi) and Oxford Nanopore Technologies (ONT) PromethION. bSample sizes for long-read (LR) and short-read (SR) sequencing are specified. In this Review, we discuss the approach of long-read, population-scale, whole-genome sequencing and highlight its advantages, point out challenges and provide an overview of different experimental setups. We define population-scale sequencing here as sequencing of more than five genomes, although in the case of more limited genomic diversity in some organisms, a lower number of individual genomes may be sufficient. We focus on technologies that produce continuous sequence reads and do not address other long-range technologies, such as linked reads or optical mapping (for example, Bionano Genomics). However, both these technologies may be useful and applicable in a population setting[37,38]. When sequencing of the highest number of samples is required, targeted sequencing may be a cost-efficient alternative to whole-genome approaches (Box 2). Similarly to most population-scale sequencing projects, we focus on germline variants, as somatic variants require higher genome coverage and access to the relevant tissues. Metagenomic studies do not address populations in a traditional sense, yet they nevertheless assess genetic information stemming from separate (organismal) entities and chromosomes. Long-read sequencing is seemingly ideal to study prokaryotic organisms and viruses contained in metagenomic (for example, stool, gut and environmental) samples, since their genomes are usually much smaller than the currently achievable average read length in these technologies[143]. However, for metagenomics, factors such as the generally higher amount of required input DNA, high sequence similarity between taxonomic units and higher cost per base pair have thus far hampered the widespread application of long-read sequencing. Recent improvements in high molecular weight (HMW) DNA extraction specific to metagenomic samples seem to hold the potential to facilitate a more widespread application of long-read sequencing in metagenomics. For example, a workflow to obtain improved yields of HMW DNA from human stool samples and furthermore provide a bioinformatic workflow incorporates base-calling, assembly, error correction and genome circularization with ONT reads[144]. Other efforts have been directed at improving the assembly step. metaFlye[145] is the first metagenomics-specific genome assembler, dealing with highly uneven coverage as well as sequence similarity between closely related genomes typical of metagenomic samples, and it seems to greatly enhance the ability to generate bacterial genomes in single contigs. Furthermore, others have sequenced the 16S rRNA gene as a species identifier, benefitting from the longer read length to improve the classification[146,147]. To improve cost efficiency, a hybrid approach using both short and long reads seems to be a valid approach for assessing metagenomic samples. Overholt et al.[148] have demonstrated that by combining Illumina and ONT reads, twice and four times more high-quality assemblies were recovered from a water column sample than by using each technology alone, respectively. Although these hybrid approaches will continue to be used, long-read-only approaches are likely to succeed in the long run[149]. Sample numbers can be scaled up at a lower cost using target enrichment approaches. Several methods have been introduced to enrich for a particular region of a genome, ranging from traditional capture and PCR amplicons[150] to using the Cas9 system[151] and an in silico sequencer-based selection (for example, Uncalled[152] or Readfish[153]). These approaches typically can target 10–20 kbp regions, although sequencer-based selection methods potentially enable larger targets to be sequenced. The Cas9 system can enrich a region without amplification and thus also enables the assessment of methylation patterns and sequences that are hard to target, such as repeats[151]. All these laboratory enrichment methods work for both long-read sequencing platforms, namely Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). However, the in silico enrichment is unique to the ONT platform and is of interest for many future applications, as it does not require laboratory enrichment. Both Uncalled and Readfish sequence the first ~1 kbp of every read and if this read does not overlap with a targeted region, the DNA molecule is ejected and the next molecule is read. However, if the read matches the sequence of the targeted region, sequencing continues, resulting in a modest on-target enrichment. Multiple projects that use this more cost-efficient methodology to study specific diseases with known gene targets have been published[150,154]. The analysis of these data sets is often very similar to full genome analysis, but is computationally less demanding. The coverage per target typically exceeds that of whole-genome approaches, achieving hundreds of fold coverage for the targeted regions. Furthermore, off-target reads (sequences that have not been fully depleted) must be taken into account and filtered out so that they do not affect the analysis. Depending on the type of targeted sequence (for example, amplicon versus the Cas9 approach), these off-target reads can occur more frequently than others owing to the different efficiencies in off-target depletion. For example, a Cas9 system often has off-target reads as well as sequencer-based targeting of regions (~30% enrichment on target)[151]. By counting the reads within and outside the targeted region, it is possible to assess the efficiency of the chosen method. Another very common application of these targeted sequencing approaches that has recently become very important is enriching for a specific pathogen or virus, such as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus responsible for the coronavirus disease 2019 (COVID-19) global pandemic. The most commonly applied protocol in this context is ARTIC, which aims to amplify ~200 bp RNA segments of the virus[155]. In addition, loop-mediated isothermal amplification (LAMP) and/or capture methods have been very effective in studying the diversity of SARS-CoV-2 isolates[156,157]. Another interesting development from ONT is a targeted approach to detect the presence of SARS-CoV-2 using the LAMP-based assay LamPORE. LamPORE targets three regions of the viral genome (ORF1a and the E and N genes) and a control (human actin), which allows testing of ~96 patients in a single MinION run in ~1 h (ref.[158]).

Project strategies

The total number of sequenced individuals (or rather chromosomes) should in general be as high as possible. However, the different underlying questions that motivate population-scale sequencing studies have vastly different sample size requirements. Although estimating the degree of genetic differentiation or ancestral population size is already possible with a sample size as low as ten chromosomes (five individuals of a diploid organism)[39], the identification of rare variants (and potentially associated diseases) in a population usually requires sample sizes that are many orders of magnitude higher[40]. Regardless of the approach taken, it is crucial to keep track of metadata and control for covariates in the cohort selection. There are multiple commonly applied strategies with specific budget requirements to be considered at the beginning of a large population-scale sequencing project (Fig. 2a). Here, we discuss three main strategies that allow for different scaling and budgeting and thus have an impact on the level of resolution in detecting genetic variation. Across virtually all sequencing technologies, the cost per sequenced base pair is consistently decreasing. To be able to compare the strategies discussed below, we use the required long-read sequencing output as a proxy for costs (Supplementary table 1). Although we assume a diploid genome with a size similar to the haploid human genome (3.2 Gbp), we note that for genomes with higher ploidy (for example, hexaploid plants), the overall coverage must be adapted to the ploidy of the organism (that is, the number of homologous chromosomes). Furthermore, we assume a sample size of ~2,500 individuals, similar to that of the 1000 Genomes project[41]. At the time of writing (early 2021), the least expensive option to generate long-read data is the ONT PromethION platform, with a yield of roughly 100–150 Gbp per flow cell at a price between US$650 and US$2,100, depending on the discount obtained when multiple flow cells are purchased simultaneously. Of note, PacBio HiFi reads are of adequate length and high accuracy, and although not formally assessed, it is reasonable to expect that lower coverage would be sufficient with this technology. However, at the time of writing (early 2021) this still equates to a higher cost than with the ONT PromethION platform, as one PacBio single-molecule real-time (SMRT) cell costs ~US$1,300 and yields ~500 Gbp (continuous long reads) or ~30 Gbp (HiFi) of data.

Fig. 2

Overview of long-read population study design.

a | The experimental design of three different approaches is outlined. In the first strategy (left), all samples are sequenced at medium to high coverage by long-read sequencing. In the second approach (middle), a proportion of the samples are sequenced with medium to high coverage and the remainder using low coverage by long-read sequencing (similar to the initial 1000 Genomes project). In the third approach (right), a proportion of the samples are sequenced at medium to high coverage by long-read sequencing and the remainder by short-read sequencing. The decision of which approach to take will affect the ability to detect common (red symbols) or rare (grey symbols) events in the population. The decision also depends on the available budget, existing data and the sample DNA availability. b | Overview of current established sequencing technologies based on CHM13 sequencing data[79]: Illumina, Pacific Biosciences (PacBio) High Fidelity (HiFi) reads or ultra-long reads from Oxford Nanopore Technologies (ONT). The N50 read length and average read accuracy are highlighted in orange. Although each technology has advantages and disadvantages, HiFi and ONT are the most promising for future applications. c | Overview of analysis strategies. Although multiple approaches are available, the main decision is whether to use an alignment-based approach or a de novo assembly-based approach, which has implications for sequencing requirements and the approaches, resolution and comprehensiveness of downstream computational analysis.

Overview of long-read population study design.

A full coverage approach

Although the most expensive of the three approaches, the highest level of resolution is obtained with a strategy that aims to sequence every sample of the population with medium to high coverage (a ‘full coverage’ approach; Fig. 2a). The main criterion for deciding on the coverage required per sample is whether a de novo assembly (>40-fold coverage required) or reference-based alignment approach (>12-fold coverage required[42]) is planned. The advantage of this strategy is its comprehensiveness, the simplicity of the study design and the relatively straightforward computational workflow. Furthermore, samples receive similar coverage and are therefore equally well studied, and rare variations in each sample can be easily detected. Sequencing all 2,500 individuals at 20-fold coverage requires 150 Tbp of sequencing data.

A mixed coverage approach

In the ‘mixed coverage’ approach (Fig. 2a), a subset of samples that are representative of the subgroups in the cohort (for example, ethnicities or subpopulations) are sequenced at high coverage (for example, 30-fold) and the remaining samples at low coverage (for example, >5-fold). Although this approach is generally less expensive than the full coverage approach, it still achieves high overall detection sensitivity and is thus particularly suitable for studies with a high number of individuals or a limited budget. However, several analytical challenges remain, especially in achieving high accuracy of genotypes across multiple samples or differentiating somatic from heterozygous germline variants, which is further complicated by regions exhibiting recurrent mutations. In addition, there will certainly be a bias towards common alleles with this mixed coverage approach, as many rare alleles can be missed, especially if a locus is heterozygous and the alternative allele is thus sparsely covered. Assuming that in this second strategy 200 individuals are sequenced at 30-fold coverage and the remainder of the cohort at 8-fold coverage, this approach requires 73 Tbp of data and is thus potentially half as expensive as the full coverage strategy.

A mixed sequencing approach

The ‘mixed sequencing’ approach (Fig. 2a) involves long-read sequencing of just a few samples (for example, 10–20% of all samples) and short-read sequencing of the remaining samples to genotype variants that are discovered by long-read sequencing. The rationale behind this approach, similar to the selection of individuals for high coverage in the mixed coverage strategy, is to identify a small subset of samples (either randomly or by known diversity[43], ethnicity or phenotype) and sequence only these to higher coverage. This mixed sequencing approach was effective in elucidating germline SVs that predispose to cancer, whereby short-read sequencing was used to identify evidence of SVs followed by long-read sequencing of selected samples[44]. Phylogenetic analysis of variants detected by short-read sequencing has also been used to select a representative set of soybean accessions for long-read sequencing and de novo assembly[33]. Other studies have used SVCollector[43] to automatically select samples (this is done over iterations by selecting the most diverged sample and re-ranking remaining samples based on non-selected variation) for long-read sequencing to complement existing short-read sequencing data[25,32]. Once a subset of samples have been sequenced with long-read technologies, yielding a set of identified SVs, their breakpoint coordinates can be genotyped (for example, insertions) across the short-read sequence data sets. In this way, robust allele frequencies for the identified variants can be obtained, albeit with a bias towards variants identified by long-read sequencing, which means that rare variants contained in other samples may be missed. It may not be possible to directly genotype all types of SV using short reads, especially in repetitive regions, but knowledge of the haplotypes on which the SVs of interest are found will enable imputation of these variants based on short-read SNV genotypes[11]. This strategy has already been applied using diversity panels of human SVs to discover novel expression quantitative trait loci (eQTLs)[45,46] and signatures of evolutionary adaptation[47]. If for this strategy no additional short-read data need to be generated, then this approach is likely to be the most affordable, as sequencing 200 of the 2,500 individuals to 30-fold coverage only requires 18 Tbp of data.

Sequencing logistics

Efficiently operating long-read sequencers at scale, from logistics to sample preparation, loading optimizations and run monitoring, is not a trivial task. ONT and PacBio have different advantages but also challenges in almost every step in this process given their different designs of flow cells and sequencing instruments (Fig. 2b). The per-sample sequencing process and the characteristics of each technology are reviewed elsewhere[31]. A substantial amount of high molecular weight DNA (HMW DNA) and highly pure input DNA is of crucial importance in these methods. Achieving this DNA quality requires specific extraction methods and is often challenging for samples for which only limited or degraded material is available (for example, non-contemporary samples or samples from very small organisms). Amplification-free low-input DNA kits exist for both PacBio[48] and ONT (https://nanoporetech.com/products/kits) sequencing platforms, with a minimum input DNA amount of 150 ng and 400 ng, respectively. However, these machines frequently require much more DNA to produce optimal sequencing yields. At the time of writing, it is often necessary to perform a nuclease flush and library reloading on an ONT flow cell to recover blocked pores to obtain the highest yield, which is an additional preparation step that is not necessary for PacBio cells. Importantly, ONT flow cells and PacBio SMRT cells have a limited shelf life, which is logistically challenging when sequencing many samples. Depending on the organism and its features, such as its physical size, the presence of a cell wall or secondary metabolites, high-quality DNA extraction can be a major constraint. Variability in DNA quality and molecular weight is a common issue and pre-sequencing quality control is necessary to ensure that inadequate samples are omitted and other technical covariates are recorded to be taken into account in downstream statistical analysis. ONT sequencers store the raw data as hdf5 files (in the fast5 format), requiring base calling to obtain the more commonly used and much smaller fastq and BAM formats. Currently, incremental updates to the ONT base-calling algorithm regularly improve the read accuracy[49], which suggests that repeating the base calling of older data is valuable. This reanalysis requires long-term storage of the fast5 files, which can be up to 1.5 TB for a single PromethION flow cell, although further compression is possible[50]. By contrast, the PacBio base-calling process is highly mature, and BAM files containing unaligned reads are produced directly from the sequencing machine. For HiFi reads, post-processing of the subreads is essential to collapse consecutive sequenced DNA molecules down to a high-quality consensus sequence, which is also done on the latest version of the machine (Sequel IIe system), and thus the overall data storage requirement is much reduced.

Analytical considerations

Arguably the main challenge in population-level studies is a scalable and streamlined analysis. Multiple recent reviews have discussed approaches at the single sample level[6,7,21]. Table 2 lists computational tools that are commonly used in long-read sequencing projects and these are reviewed in-depth elsewhere[6,7]. Of note, in this very rapidly developing area of genomics, new tools are introduced constantly while established ones quickly become outdated. As we do not assume that matching short-read sequencing data are available for every individual, the integration of long-read and short-read data is not discussed. Nevertheless, we highlight the important role of short reads for the polishing of long reads[51] and assemblies[52] or in fine-scale resolution of SV breakpoints[11]. These applications may lose their relevance as the accuracy of long-read sequencing improves, as is already the case for PacBio HiFi data.

Table 2

An overview of software tools for analysing long-read sequencing data

Category	Tool name	Description	Ref.
De novo assembly	(Hi)Canu	Versatile de novo assembler	[23]
	Flye	Fast de novo assembler that can also operate on low coverage data	[24]
	Shasta	Fast ONT assembler	[25]
	Falcon Unzip	PacBio assembler for phased assemblies	[22]
	Peregrine	Optimized assembler for HiFi data only	[128]
	hifiasm	Optimized assembler for HiFi data only	[139]
	PGAS	Phased assembly including strand seq	[46]
Genomic alignment	LAST	Versatile method to align contigs or genomes	[57]
	MUMmer	Long-standing genomic aligner	[87]
	minimap2	Pairwise alignment method for long reads up to genomes	[58]
	Cactus	Progressive genomic alignment method allowing integration of more than two genomes at a time	[90]
	SibeliaZ	Fast genome aligner of multiple genomes	[140]
Read alignment	minimap2	Pairwise alignment method for long reads up to genomes	[58]
	NGMLR	Convex gap cost implementation	[42]
	Winnowmap	Improvements for mapping in repetitive regions	[59]
	lra	Efficient convex-cost gap penalty sequence and contig aligner	[60]
Graph genome methods	Giraffe	Rapid reads to graph aligner	[45]
	vg	Toolkit to construct and convert graphs with methods to genotype and call variants	[96]
	minigraph	A sequence-to-graph mapper and graph constructor based on minimap2	[97]
	GraphAligner	Sequence-to-graph aligner for long reads	[141]
	GraphTyper2	Genotyping variants in a graph genome from short reads	[100]
	Paragraph	Genotyping structural variants in a regional graph genome from short reads	[101]
	PanGenie	k-mer-based genotyping of short reads in a haplotype-resolved graph	[99]
Phasing	WhatsHap	Phasing method for SNVs and smaller indels	[15]
Phasing	HapCut2	Phasing method for SNVs	[16]
SV calling from alignment	pbsv	Joint calling of SVs across samples	[62]
	Sniffles	Automatic parameter estimation	[42]
	CuteSV	Highly parallelized SV calling	[63]
	SVIM	Uses graph-based clustering of candidates	[61]
SV calling from assemblies	dipcall	Deletion and insertion calling from de novo assembly	[89]
	SVIM-asm	SV calling from (diploid) de novo assembly	[142]
	PAV	Compares phased assemblies with a reference genome	[46]
SNV calling	Clair	Uses a convolutional neural net	[69]
	DeepVariant	Neural network-based SNV caller	[67]
	Longshot	Partitioning reads in haplotypes and calling variants in accordance with those haplotypes	[70]
	Pepper	Phasing-based SNV calling	[68]
SV merging	SURVIVOR	Merging that allows breakpoint inaccuracies	[113]
	SVanalyzer	Assembly based, two samples only	[98]
	Truvari	Parameterized stepwise merging including sequence similarity	[9]
	Jasmine	Merging SV based on sequence similarity	[32]
SV genotyping	cuteSV	Force-calling of variants from a VCF file	[63]
	Sniffles	Uses split reads to identify known SVs over shared breakpoints	[42]
	SVJedi	Compares the alignment of reads against the reference genome and alternative contigs representing the SV to determine the best match	[66]
	LRcaller	Genotypes variants of long reads	[11]
Other	TRiCoLOR	Detects and genotypes repeat lengths separated by phase	[76]
	Iris	Local assembly of insertions	[32]
	SVCollector	Optimized sample selection	[43]
	NanoComp	Comparison of sequencing data	[53]

HiFi, high fidelity; indel, insertions–deletions; ONT, Oxford Nanopore Technologies; PacBio, Pacific Biosciences; SNV, single-nucleotide variant; SV, structural variant; VCF, variant call format.

An overview of software tools for analysing long-read sequencing data HiFi, high fidelity; indel, insertions–deletions; ONT, Oxford Nanopore Technologies; PacBio, Pacific Biosciences; SNV, single-nucleotide variant; SV, structural variant; VCF, variant call format. For population-scale projects, the choice of analytical tools often involves balancing sensitivity and computational efficiency. Before downstream analysis, it is crucial to perform quality control of experimental factors that directly affect the performance of assembly, SV detection and read phasing, such as DNA fragment length and sequencing yield. Multiple tools have been developed for this purpose[53,54]. Changes in sequencing chemistry or technical equipment during the project may lead to artefacts in the analysis and can thus potentially affect the findings. As such, it is important to randomly assign samples to batches, for example, sequencing runs, to reduce technical covariates. Two main strategies for downstream analysis are available: aligning reads from individual samples to a single reference genome or comparing de novo assemblies (Fig. 2c). These two approaches are very different in their computational and coverage requirements, which in turn depend to a large extent on genome size and complexity. For both approaches, the goal is to apply the same set and versions of methods to all samples. The results need to be generated in a consistent way using correct version control and reproducible pipelines to avoid additional artefacts in the analysis. In the following sections, we discuss alignment-based and de novo assembly approaches and graph genome-based methods.

Read alignment-based analysis

Alignment-based approaches are often the method of choice for population-scale studies, as they facilitate the comparison of all samples against a common coordinate system (that is, the reference genome), which is illustrated by the fact that more than half of population studies (Fig. 1; Table 1) employ these approaches. Furthermore, these approaches are often less computationally demanding and require substantially less coverage than assembly-based methods. Alignment-based approaches rely on matching sequencing reads with a reference genome, the overall correctness of which will affect the analysis of read data[7]. If the reference genome is incomplete, incorrect, fragmented or too divergent from the focal sample, it will lead to biases in the downstream analysis[55,56]. The software for long-read sequence data analysis is under constant development, and alignment methods in particular have become much faster in recent years (Table 2). The NGMLR[42] and LAST[57] methods speed up the alignment process and improve the accuracy of long-read alignment. The minimap2 aligner is considerably faster than its competitors while often delivering similar results, and thus it is currently the most popular, widely accepted long-read aligner[58]. Two noteworthy recent innovations are Winnowmap, which improves alignments (specifically in repetitive regions)[59], and lra, which improves the alignment in the presence of SVs[60]. The choice of tools for the detection of genetic variation is arguably of equal importance. For SVs, several tools are currently available, such as Sniffles[42], SVIM[61], PBHoney[62], CuteSV[63] and pbsv (Table 2). One of the remaining challenges is the accurate representation of SV breakpoints, which is particularly difficult in the context of more complex events involving multiple variants in repetitive regions, such as segmental duplications or large tandem repeat arrays (SV detection methods are comprehensively reviewed elsewhere[7,64]). Recently developed tools are removing the need for high sequencing coverage by enabling SV calling[42,65] and genotyping[42,66] at lower coverage, although the associated risk of incomplete or erroneous SV detection and genotyping cannot be ignored. Owing to the different error profiles of long reads, naive pile-up approaches or SNV and small insertion–deletion (indel) calling methods that were developed for short-read sequencing are usually inadequate or suboptimal for long reads. Over the past few years, multiple strategies have been developed to improve the detection of small variants with sophisticated machine learning models for each of the long-read sequencing technologies (Table 2). Current methods include, for example, DeepVariant[67] Pepper[68], Clair[69] (both using deep learning) and LongShot[70] (which explicitly requires alleles to be concordant with the haplotype structure), which also outperforms Illumina-based SNV calling[71]. PacBio HiFi, in contrast to ONT, is also competitive with Illumina for small indels. Expansions and contractions of tandem repeat arrays are a highly challenging and frequent type of variation[72]. As these repetitive DNAs, which include short tandem repeats (1–6 bp repeat unit) and minisatellites (>6 bp repeat unit), are known to contain disease-causing alleles, accurate characterization of them is crucial[73]. Some tools have been developed specifically for this purpose[74], such as tandem-genotypes[75] and TRiCoLOR[76]. Similar challenges remain for accurate characterization of other repeats. For example, the LPA locus (encoding apolipoprotein(a)) consists of 8 kbp tandem repeat units (encoding kringle IV domains) that are repeated 5–10 times in human genomes[77], making it notoriously difficult to assess. To date, most reference genomes consist of a haplotype-collapsed representation, in which two or more chromosomal haplotypes are collapsed during assembly to a single artificial consensus sequence. Phased genome assemblies, in which the haplotype structure of each chromosome is fully resolved, have the potential to more accurately represent the genome. The human Telomere-to-Telomere (T2T) consortium effort aims to produce the first full chromosome assembly of the human genome from the essentially haploid complete hydatidiform mole (CHM13) genome and has already completed assembly of chromosome 8 (ref.[78]) and chromosome X (ref.[79]). In another example, a single haplotype from a haplotype-resolved de novo assembly was used as the reference for read alignment in a population genetic study in crows[35].

Population-scale de novo assemblies

Many reference genomes based on short-read sequencing are incomplete or highly fragmented with many gaps[80]. Furthermore, hundreds of megabases of population- and individual-specific sequences are absent from the human reference genome[81]. These missing sequences are often repetitive, but also include coding sequences. As a consequence, a fraction of reads derived from a sample cannot be aligned to the reference genome or they align to paralogous sequences, leading to tens of thousands of false-positive and false-negative variants for each individual[82]. Therefore, creating and comparing de novo assemblies is desirable (Fig. 1). The increased availability and affordability of long-read sequencing data have led to an explosion of faster and more accurate genome assembly tools (Table 2), of which haplotype-resolved de novo assembly is commonly considered the most comprehensive representation of a genome. This competition to produce improved de novo assembly methods has led to the rapid development of new tools, usually focusing on either computational demand, contiguity, completeness or correctness, indicating that genome assembly represents (at present) a trade-off between these key parameters. De novo assembly-based approaches are often more sensitive and better for reconstructing highly diverse regions of the genome than alignment-based approaches, but can also lead to a collapse of highly similar segmental duplications[83]. For such duplicated regions, specific algorithms have been developed that leverage SNVs that differentiate multiple copies of repeats and thereby can recover medically relevant duplicated genes[84,85]. The dependence of de novo assembly on high read coverage and more computationally demanding methods has made it historically very challenging for large population-scale sequencing. However, the ever-increasing yield of sequencing technologies will enable the sequencing of each sample to sufficient coverage to obtain a high-quality de novo assembly[86] (Fig. 1; Table 1). Single-genome projects iteratively test multiple parameters or different methods to optimize a de novo assembly, which is neither realistic nor desirable in a population context. Multiple projects have integrated proximity-ligation or strand-specific short-read sequencing methods for substantial improvements of the contiguity of the assemblies[25,46], but such approaches do not scale well to large populations. De novo assembly-based approaches are typically also more computationally demanding, which becomes especially relevant for large numbers of samples. Large cloud storage infrastructures might improve the scalability, but the computing cost will rise substantially. The recent development of less computationally demanding assemblers may be able to mitigate this limitation[25]. Another important consideration is the scalability of the downstream computational approaches. Although the process of genome assembly already requires considerable computational resources, these demands increase linearly with the addition of more individuals. To infer genomic variation, de novo assemblies are usually compared with a chosen reference genome, yielding a standard variant call format (VCF) file. Currently, genomic alignment tools and dedicated variant callers (such as MUMmer[87], Assemblytics[88], minimap2 or dipcall[89] and SVIM-asm[61]) are designed to provide a pairwise comparison of two genomes, such as the assembled and a reference genome (Table 2). However, in a project with multiple (diploid) genomes, this is clearly not ideal, as a whole-genome alignment-based approach likely suffers from the same biases as a read alignment-based approach. For example, in the case of novel sequence insertions in samples compared with a single reference genome, these variants will often be more challenging to compare across all samples of the population (Fig. 3a). This issue might be further amplified by gaps in the reference assembly, which potentially reduces the number of regions that can be compared across the population. Although troublesome for comparisons across samples, assembly-based SV calling will more likely correctly represent complex SVs that are longer than the read length and therefore harder to correctly identify with alignment-based methods (Fig. 3b). The likely most comprehensive option would be a compare-all-with-all approach (Fig. 3a), in which unique pairwise comparisons increase quadratically, meaning that with 100 samples there are already 4,950 possible ways to compare samples with each other. Clearly, such an approach would currently not be feasible for most projects, and alternative strategies have to be developed. Most recently, the introduction of progressive Cactus[90], a tool that constructs an ancestral genome when comparing two assemblies based on a guide tree, has enabled comparison across multiple genomes. However, to date this tool has mainly been tested across species and not between individuals of a species.

Fig. 3

Potential problems for different genome comparison approaches.

Potential problems for different genome comparison approaches.

a | Schematic depiction of a potential problem in a de novo assembly-based approach. The presence of a novel segment N1 in two de novo assemblies, at different locations and, even more so, a sequence variant (red x), poses a challenge to correct reporting by current state-of-the-art methods and variation formats. b | Similar representation of the N1 problem in an alignment-based approach, where the coordinates of N1 are shared, but remain challenging for the identification of the single-nucleotide variant (SNV) or the entire N1 sequence. c | A graph-based representation of N1, which enables a clearer comparison of the variant across the samples, illustrating the potential benefits of graph genomes. R1–R3 represents the backbone of the graph genome and N1, and its SNV represents novel sequencing for a given sample set. Another, perhaps even greater, challenge in de novo assembly approaches is the correct representation of ploidy. Many organisms have diploid genomes (for example, humans and many animals) and even higher ploidies exist, such as in some crops. Tools optimized for diploid (that is, haplotype-aware) de novo assembly are available to reconstruct both haplotypes[22]. This reconstruction is essential to recover all heterozygous variation, as two different haplotypes may otherwise be collapsed to a single artificial and incorrect representation of the chromosome. However, haplotype-resolved de novo assemblies often require higher coverage and computational cost. The correct genotyping of both heterozygous and homozygous variants is of utmost importance for subsequent population genetic analysis. A recent solution is to first create an unphased assembly, then identify variants and partition reads into haplotypes before creating phased contigs[86,91]. Even if complete and accurate haplotype-resolved assembly is achieved, then SV calling from assembly-to-assembly comparison might not be straightforward in highly complex regions. For example, the human LPA[77] and SMN1 and SMN2 (ref.[92]) loci with their highly repetitive structure lead to problems in genomic alignments. As such, the main challenge may shift to genomic alignments and methods to interpret the detected differences between multiple assemblies.

Graph genome methods

Both read alignment and de novo assembly approaches can have systematic issues with complex structural variation, inserted sequences missing from the reference genome, repeat variability and highly polymorphic loci (Fig. 3). Linear reference genomes only represent one allele and thus, do not incorporate polymorphisms and complexity of a population. Reference pan-genome approaches, which combine genomes from multiple individuals within a species, are a better fit to represent genomic diversity[93,94] (Fig. 3c). Variant catalogues for pan-genome structures are obtained by ongoing projects using high-quality haplotype-resolved assemblies of diversity panels for the discovery of variants[46]. A reduction of the alignment bias against non-reference alleles is achieved by explicitly taking known population variants into account in the read alignment step. As such, the analysis does not rely on a single reference genome. This goal is realized by graph genome-based tools and their associated data formats, as a way to represent a collection of possible (alternative) sequences[95]. Examples of tools for this purpose include vg[96], minigraph[97], the SevenBridges Graph Genome Pipeline[98], the DRAGEN Graph Mapper and PanGenie[99]. These implementations provide tools to build graphs based on the linear reference genome and a collection of known variants, or alternatively use (haplotype-resolved) assembled contigs. Although a detailed discussion of the methods to construct such pan-genome graphs is beyond the scope of this Review, we note that there are important differences in implementation and data format with regard to compatibility with coordinates on the linear reference genome and storing information of the individual haplotypes that contributed to the included variation[97]. An additional benefit of graph genome methods is that they enable a more correct representation of nested variation, such as smaller variants within inserted sequences[94]. A major benefit of graph genomes is the genotyping of SVs using short reads. Multiple tools, such as GraphTyper2[100], Paragraph[101] and tools from the vg package[45,96], have been developed specifically for alignment of short-read sequencing data to graph genome structures. SNVs, small indels or SVs within a sample are genotyped as reads following a certain path (‘walk’) through the pan-genome graph[96,101] (Fig. 4a). Graph genotyping methods enable the assessment of variants that remain undetected by the current state-of-the-art short-read SV discovery methods[46]. In the next step, variants that were not yet explicitly encoded in the graph can be identified, with the option to incrementally augment the graph structure with the newfound variation to further improve accuracy[98,102]. Graph genome methods are reviewed in greater depth elsewhere[94,95,103].

Fig. 4

Genotyping of SVs and SNVs across a population set.

Genotyping of SVs and SNVs across a population set.

a | Graph genome-based genotyping of a region with multiple alleles between two genome segments (green and pink). Insertions of different sizes (yellow) can be genotyped at the same locus using spanning reads (blue and purple) to identify the presence of two different alleles. b | An example of structural variants (SVs) and single-nucleotide variants (SNVs) across different unique and repeat regions being correctly or incorrectly genotyped based on read length. c | A phylogenetically informed filtering approach for SVs. Assuming that after a sufficiently long time (4Ne generations, where e = effective population size) most or all genetic variation should be fully sorted between two clades; variants that do not adhere to this assumption and are polymorphic across clades (for example, variant 3) can be removed. Although this approach is certainly very conservative and ignores the fact that some types of variation exhibit repeated mutations on the same locus, it can be considered a first step towards more reliable genotyping of SVs. With such graph-based approaches, the often discussed dichotomy of either using an existing reference genome for alignment or constructing a novel reference genome through de novo assembly can potentially be avoided for population studies, as downstream of this step all sequences have to be compared with a single (reference) assembly or a backbone of a pangenomic graph, for identification of variation, annotation and statistical evaluation. However, these approaches are less straightforward in practice than the use of a linear reference genome and are not entirely mature, with competing implementations and data formats. Although graph genome methods are good candidates to solve biases when assessing (structural) genomic diversity, it remains unclear whether these methods will become mainstream in clinical or diagnostic applications, in which a single reference genome is an attractive simplification.

Variant validation and genotyping

To determine whether any given variant constitutes the biological reality and is not just an artefact, it is important to perform validation. Ideally, this is done using orthogonal approaches, to capitalize on the strengths of different technologies. Traditionally, PCR validation of variants has been the method of choice[104]; however, for complex SVs that contain highly repetitive regions, other, non-sequencing-based methods such as optical mapping might be more suitable[46]. Visual inspection of alignments and subsequent manual curation of variant sets are arguably a very accurate validation approach but certainly not feasible for more than a few hundred variants. A semi-automated pipeline, SV-plaudit, has been developed to enable rapid, streamlined and efficient curation of thousands of SVs[105]. Of similar importance is variant genotyping, which we define as determining the presence and zygosity of a variant. Although the initial discovery of variation is relatively straightforward, obtaining reliable genotypes for a given variant across a population is usually much more difficult. However, knowing the alleles (that is, the genotypes) of variants for a given sample is particularly important in population genetic and evolutionary studies, in which population size estimation and measures of genetic differentiation (such as the fixation index FST) rely on obtaining accurate allele frequencies of variants[106]. In particular, variants in repetitive regions are more readily genotyped using long reads than using short reads (Fig. 4b). For SNVs, sophisticated genotyping approaches have been developed that consider important parameters such as mutation dynamics (for example, transition to transversion ratios) and information about non-variant sites to improve genotype accuracy[107]. The concept of a genomic variant call format (gVCF) has been implemented in applications such as freebayes[108] and GATK[109], which has improved the efficiency of the comparison and made multiple rounds of genotyping obsolete. Another approach is to completely abandon genotype calling and instead calculate posterior probabilities of genotypes to directly incorporate uncertainty in the downstream analysis (for example, ANGSD[110]). Merging SNVs is typically done with tools such as bcftools[111] and RTGTools[112]. For SVs, the situation is much more complicated, as establishing homology of variants between samples is not straightforward. One of the first approaches to be developed is based on 50% reciprocal overlap, which allows two SVs to be merged if they overlap substantially. Although this works well for large copy number variation events, there may be some limitations for smaller SVs (for example, 50 bp to 1 kbp) with more localized breakpoints. Another approach is to require breakpoints from each individual to be approximately in agreement to establish that a variant in two samples is indeed homologous (for example, SURVIVOR merge[113]). In some cases, such as when two insertions are homologous, but their sequence slightly deviates, an approach based on breakpoints may be too conservative, and some tools have been used to attempt to address this issue (for example, Truvari[9], SVanalyzer and Jasmine[32]). However, at present, no universal standards are available for the thresholds. Thus, approaches rely on arbitrary thresholds of breakpoint distances and sequence similarity. Deletions are arguably the most straightforward type of variation to genotype, but calling heterozygotes for even this seemingly simple type of SV can be difficult[114]. Tools such as Sniffles and SVJedi are capable of genotyping SVs based on a candidate VCF file, following an initial step of SV discovery based on the long-read alignments[66]. Another potentially very powerful approach to improve SV genotypes is to harness the information contained in a sampling scheme consisting of phylogenetically distant populations (Fig. 4c). In this approach, basic population genetic assumptions are made to reduce the number of false positives for genotyped SVs. After a sufficient number of generations (4Ne, where Ne = effective population size), variation is likely fully sorted and no polymorphisms should occur across lineages any more, assuming that there are no repeated mutations at the same locus (that is, the infinite sites model)[115]. Any variants exhibiting polymorphic genotypes across the divergent lineages are excluded. Although this approach neglects the fact that certain types of SV have much higher mutation rates and thus indeed have the potential for repeated mutations (for example, variation within tandem repeat arrays), it provides a first step towards more reliable SV genotyping. This approach has recently been successfully applied in the corvids crows and jackdaws[35].

Prediction of functional impact

The mathematical framework for the analysis of (small) genetic variants predates the advent of high-throughput sequencing by almost a century and is therefore well established. Large-scale single-nucleotide polymorphism (SNP)-array-based GWAS projects enabled the interrogation of thousands of variants and haplotypes for their association with disease. Although quality assessment steps such as principal component analysis and testing for Hardy–Weinberg equilibrium still hold for indel variants (that is, >50 bp), these models do not necessarily cover all types of SV, for example, in the case of a continuous spectrum of repeat lengths[116]. A solution, albeit with loss of resolution, would be to binarize the distribution into ‘reference’ and ‘expanded’ alleles, but historically it has been difficult to unambiguously establish a cut-off length. Association testing of the role of partially overlapping variants for a certain trait requires an approach conceptually similar to that used for burden analysis in rare variant association studies. Whereas classification of the functional impact of small variants on protein function for synonymous, missense and loss-of-function variants is relatively mature with tools such as the Ensembl VEP[117], it is less straightforward to judge the impact of SVs on the expression of nearby genes. This is mainly because it is unclear how the length of an SV impacts the surrounding genomic region and it is often hard to obtain robust allele frequencies for SVs[114]. For functional annotation and pathogenicity prediction, approaches using joint linear models[118], supervised learning[119] and existing databases[120] have been developed, and there are promising examples demonstrating that SVs are indeed associated with important traits of interest[118,119]. ConclusionsOngoing significant technological improvements have paved the way to apply long-read sequencing to population-scale sequencing projects and demonstrate that this sequencing approach is here to stay. This process already started with the first larger data sets generated by targeted sequencing of certain genes (Box 2) and continues with an increasing number of projects that leverage long reads at scale (Fig. 1; Table 1). The analysis of population-scale long-read sequencing data sets remains challenging, with the read alignment-based approach currently being the most feasible. Nevertheless, we anticipate this to change to alignment of either haplotype-resolved de novo assemblies or individual sequencing reads to graph genome structures. This development will have a profound impact on the field and holds the promise of improved variant representation and complexity of the underlying biology, but would require a paradigm shift from a linear to a more complex version of the reference genome. PacBio and ONT lead the current development of long-read sequencing for multiple applications. However, other companies (for example, Base4, Quantapore and Omniome) are developing novel long-read technologies, the viability of which remains to be evaluated in the coming years. Although not discussed here, improved DNA extraction, conservation and library preparation is also adding to the rapid growth of long-read sequencing population studies[31]. Among the biggest achievements in recent years is the generation of sequence reads of 4 Mbp and longer; although this is not yet routinely possible without compromising yield[28]. Once sequencing reads routinely approach chromosome length, the process of de novo assembly seems obsolete; however, whether such reads can be directly used in a framework that is based on de novo assemblies instead of read alignment remains to be seen.

Future directions

The future of long-read population-scale sequencing holds many opportunities for multiple types of omics assays. For example, both the PacBio and ONT platforms are able to simultaneously detect the nucleotide sequence and modifications of DNA such as 5-methylcytosine[121]. The identification of such modifications has unprecedented implications for epigenetics and the analysis of DNA damage. More recent versions of the ONT base callers are trained to detect common nucleotide modifications, which together with the plateauing accuracy potentially alleviates the need to store raw data. Several studies have shown excellent reproducibility and correlation with bisulfite sequencing, suggesting that nanopore sequencing could become the gold standard for detecting methylation patterns[122]. Although methods tailored to short-read bisulfite sequencing exist, there is a lack of statistical methods for differential methylation assessment that leverages the unique features of large distance phasing of modifications in parental haplotypes. Detection of nucleotide modifications further opens up a wealth of opportunities for specialized assays such as chromatin accessibility profiling[123] and replication fork detection[124]. Complementary to DNA-based population sequencing, long-read sequencing of mRNA and complementary DNA (cDNA) also enable the identification of isoform diversity[125]. Multiple pipelines have been developed to investigate known and novel isoforms, but the field is far from mature. A survey of multiple tissues has already been undertaken[125], and an extension of this to the population scale, such as in the short-read GTEx project, is highly likely to yield valuable information about transcript structure and the influence of regulatory (structural) variation. Long-read sequencing approaches have also been extended to the direct sequencing of proteins[126] and single-cell transcriptomics[127]. Although these applications are likely to lead to biologically fascinating insights, the implications for population studies remain unclear[127]. Alongside the technological improvements in long-read sequencing, computational analysis has also improved, which is key to enabling population-scale projects. Analyses that took weeks to months to accomplish a year ago can now be completed within a day to a week and at a lower cost[24,86,128]. However, some conceptual challenges remain, such as the representation of nested and highly complex variation[97]. Recent advances, such as pan-genome graphs, have the potential to address this challenge[97]. Furthermore, the use of pan-genome graphs could indeed improve the analysis itself, as they overcome the problem of a linear reference bias by including different alleles[96,100,101]. Another related computational challenge is the accurate and rapid genotyping of complex alleles. Here, graph genomes have already shown significant benefits, although the process of obtaining a fully genotyped population-level VCF is still far from trivial. This is due to the lack of a gVCF for SV representation, to represent information not only about the alternative alleles (that is, SV) but also about reference alleles. For SNV, this allows the easy comparison of samples and is a requirement for future SV studies. Despite significant advances in long-read sequencing, several challenges remain to be addressed. The frequently discussed issue regarding the lack of precision and lack of sensitivity in identifying SNVs and small indels, especially involving homopolymers, is likely to be resolved by advancements in sequencing accuracy[27,68]. However, difficulties remain in assessing variation in complex regions such as segmental duplications, ribosomal DNA (rDNA) tandem arrays, telomeres or centromeres. Spurred by the efforts led by the T2T consortium, which aims to provide the full linear nucleotide sequence of all human chromosomes, new software tools are being developed that specifically aim to resolve these large tandem arrays and also to assess the allelic variation within them. However, whether this solves the problem completely remains to be determined, as at the time of writing even the T2T reference genome has a few gaps remaining and only represents one ethnicity. In this Review, we provide a snapshot of the present state of large-scale long-read sequencing and discuss the exciting developments in biotechnology and bioinformatics. Despite its challenges, we argue that long-read sequencing has contributed immensely to the advancement of genomics in humans, model organisms and beyond, and that this is the way forward for population-scale studies.

138 in total

1. Fast and accurate genomic analyses using genome graphs.

Authors: Goran Rakocevic; Vladimir Semenyuk; Wan-Ping Lee; James Spencer; John Browning; Ivan J Johnson; Vladan Arsenijevic; Jelena Nadj; Kaushik Ghose; Maria C Suciu; Sun-Gou Ji; Gülfem Demir; Lizao Li; Berke Ç Toptaş; Alexey Dolgoborodov; Björn Pollex; Iosif Spulber; Irina Glotova; Péter Kómár; Andrew L Stachyra; Yilong Li; Milos Popovic; Morten Källberg; Amit Jain; Deniz Kural
Journal: Nat Genet Date: 2019-01-14 Impact factor: 38.330

2. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.

Authors: Sergey Nurk; Brian P Walenz; Arang Rhie; Mitchell R Vollger; Glennis A Logsdon; Robert Grothe; Karen H Miga; Evan E Eichler; Adam M Phillippy; Sergey Koren
Journal: Genome Res Date: 2020-08-14 Impact factor: 9.043

3. Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads.

Authors: Satomi Mitsuhashi; Martin C Frith; Takeshi Mizuguchi; Satoko Miyatake; Tomoko Toyota; Hiroaki Adachi; Yoko Oma; Yoshihiro Kino; Hiroaki Mitsuhashi; Naomichi Matsumoto
Journal: Genome Biol Date: 2019-03-19 Impact factor: 13.583

4. Evolutionary Genomics of Structural Variation in Asian Rice (Oryza sativa) Domestication.

Authors: Yixuan Kou; Yi Liao; Tuomas Toivainen; Yuanda Lv; Xinmin Tian; J J Emerson; Brandon S Gaut; Yongfeng Zhou
Journal: Mol Biol Evol Date: 2020-12-16 Impact factor: 16.240

5. The structure, function and evolution of a complete human chromosome 8.

Authors: Glennis A Logsdon; Mitchell R Vollger; PingHsun Hsieh; Yafei Mao; Mikhail A Liskovykh; Sergey Koren; Sergey Nurk; Ludovica Mercuri; Philip C Dishuck; Arang Rhie; Leonardo G de Lima; Tatiana Dvorkina; David Porubsky; William T Harvey; Alla Mikheenko; Andrey V Bzikadze; Milinn Kremitzki; Tina A Graves-Lindsay; Chirag Jain; Kendra Hoekzema; Shwetha C Murali; Katherine M Munson; Carl Baker; Melanie Sorensen; Alexandra M Lewis; Urvashi Surti; Jennifer L Gerton; Vladimir Larionov; Mario Ventura; Karen H Miga; Adam M Phillippy; Evan E Eichler
Journal: Nature Date: 2021-04-07 Impact factor: 69.504

6. SVIM-asm: Structural variant detection from haploid and diploid genome assemblies.

Authors: David Heller; Martin Vingron
Journal: Bioinformatics Date: 2020-12-21 Impact factor: 6.937

7. PBHoney: identifying genomic variants via long-read discordance and interrupted mapping.

Authors: Adam C English; William J Salerno; Jeffrey G Reid
Journal: BMC Bioinformatics Date: 2014-06-10 Impact factor: 3.169

8. SV-plaudit: A cloud-based framework for manually curating thousands of structural variants.

Authors: Jonathan R Belyeu; Thomas J Nicholas; Brent S Pedersen; Thomas A Sasani; James M Havrilla; Stephanie N Kravitz; Megan E Conway; Brian K Lohman; Aaron R Quinlan; Ryan M Layer
Journal: Gigascience Date: 2018-07-01 Impact factor: 6.524

9. Increased burden of ultra-rare structural variants localizing to boundaries of topologically associated domains in schizophrenia.

Authors: Matthew Halvorsen; Ruth Huh; Nikolay Oskolkov; Jia Wen; Sergiu Netotea; Paola Giusti-Rodriguez; Robert Karlsson; Julien Bryois; Björn Nystedt; Adam Ameur; Anna K Kähler; NaEshia Ancalade; Martilias Farrell; James J Crowley; Yun Li; Patrik K E Magnusson; Ulf Gyllensten; Christina M Hultman; Patrick F Sullivan; Jin P Szatkiewicz
Journal: Nat Commun Date: 2020-04-15 Impact factor: 14.919

10. Long-read sequencing reveals widespread intragenic structural variants in a recent allopolyploid crop plant.

Authors: Harmeet Singh Chawla; HueyTyng Lee; Iulian Gabur; Paul Vollrath; Suriya Tamilselvan-Nattar-Amutha; Christian Obermeier; Sarah V Schiessl; Jia-Ming Song; Kede Liu; Liang Guo; Isobel A P Parkin; Rod J Snowdon
Journal: Plant Biotechnol J Date: 2020-09-06 Impact factor: 9.803

22 in total

Review 1. Multi-Omics Strategies for Investigating the Microbiome in Toxicology Research.

Authors: Ethan W Morgan; Gary H Perdew; Andrew D Patterson
Journal: Toxicol Sci Date: 2022-05-26 Impact factor: 4.109

2. Chromosomal inversion polymorphisms shape the genomic landscape of deer mice.

Authors: Olivia S Harringmeyer; Hopi E Hoekstra
Journal: Nat Ecol Evol Date: 2022-10-17 Impact factor: 19.100

3. Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment.

Authors: Yilei Fu; Medhat Mahmoud; Viginesh Vaibhav Muraliraman; Fritz J Sedlazeck; Todd J Treangen
Journal: Gigascience Date: 2021-09-24 Impact factor: 6.524

4. A Method for Localizing Non-Reference Sequences to the Human Genome.

Authors: Brianna Sierra Chrisman; Kelley M Paskov; Chloe He; Jae-Yoon Jung; Nate Stockham; Peter Yigitcan Washington; Dennis Paul Wall
Journal: Pac Symp Biocomput Date: 2022

5. Fully resolved assembly of Cryptosporidium parvum.

Authors: Vipin K Menon; Pablo C Okhuysen; Cynthia L Chappell; Medhat Mahmoud; Medhat Mahmoud; Qingchang Meng; Harsha Doddapaneni; Vanesa Vee; Yi Han; Sejal Salvi; Sravya Bhamidipati; Kavya Kottapalli; George Weissenberger; Hua Shen; Matthew C Ross; Kristi L Hoffman; Sara Javornik Cregeen; Donna M Muzny; Ginger A Metcalf; Richard A Gibbs; Joseph F Petrosino; Fritz J Sedlazeck
Journal: Gigascience Date: 2022-02-15 Impact factor: 6.524

Review 6. Emerging Insights Into Chronic Renal Disease Pathogenesis in Hypertension From Human and Animal Genomic Studies.

Authors: Isha S Dhande; Michael C Braun; Peter A Doris
Journal: Hypertension Date: 2021-11-10 Impact factor: 10.190

7. Pervasive hybridization with local wild relatives in Western European grapevine varieties.

Authors: Sara Freitas; Małgorzata A Gazda; Miguel Â Rebelo; Antonio J Muñoz-Pajares; Carlos Vila-Viçosa; Antonio Muñoz-Mérida; Luís M Gonçalves; David Azevedo-Silva; Sandra Afonso; Isaura Castro; Pedro H Castro; Mariana Sottomayor; Albano Beja-Pereira; João Tereso; Nuno Ferrand; Elsa Gonçalves; Antero Martins; Miguel Carneiro; Herlander Azevedo
Journal: Sci Adv Date: 2021-11-19 Impact factor: 14.136

8. Long-read single molecule real-time (SMRT) sequencing of GBA1 locus in Gaucher disease national cohort from Argentina reveals high frequency of complex allele underlying severe skeletal phenotypes: Collaborative study from the Argentine Group for Diagnosis and Treatment of Gaucher Disease.

Authors: Guillermo I Drelichman; Nicolas Fernández Escobar; Barbara C Soberon; Nora F Basack; Joaquin Frabasil; Andrea B Schenone; Gabriel Aguilar; Maria S Larroudé; James R Knight; Dejian Zhao; Jiapeng Ruan; Pramod K Mistry
Journal: Mol Genet Metab Rep Date: 2021-11-11

Review 9. Entailing the Next-Generation Sequencing and Metabolome for Sustainable Agriculture by Improving Plant Tolerance.

Authors: Muhammad Furqan Ashraf; Dan Hou; Quaid Hussain; Muhammad Imran; Jialong Pei; Mohsin Ali; Aamar Shehzad; Muhammad Anwar; Ali Noman; Muhammad Waseem; Xinchun Lin
Journal: Int J Mol Sci Date: 2022-01-07 Impact factor: 5.923

10. PRINCESS: comprehensive detection of haplotype resolved SNVs, SVs, and methylation.

Authors: Medhat Mahmoud; Harshavardhan Doddapaneni; Winston Timp; Fritz J Sedlazeck
Journal: Genome Biol Date: 2021-09-14 Impact factor: 13.583