Literature DB >> 33575574

Critical length in long-read resequencing.

Wouter De Coster¹, Mojca Strazisar¹, Peter De Rijk¹.

Abstract

Long-read sequencing has substantial advantages for structural variant discovery and phasing of variants compared to short-read technologies, but the required and optimal read length has not been assessed. In this work, we used long reads simulated from human genomes and evaluated structural variant discovery and variant phasing using current best practice bioinformatics methods. We determined that optimal discovery of structural variants from human genomes can be obtained with reads of minimally 20 kb. Haplotyping variants across genes only reaches its optimum from reads of 100 kb. These findings are important for the design of future long-read sequencing projects.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 33575574 PMCID： PMC7671308 DOI： 10.1093/nargab/lqz027

Source DB: PubMed Journal: NAR Genom Bioinform ISSN： 2631-9268

INTRODUCTION

Long-read sequencing using Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) platforms has profound implications for genomics and genetics (1–4). In contrast to earlier generations of sequencing technologies, the read length routinely reaches tens to hundreds of kilobases and even up to megabases (5,6). Long-read sequencing leads to more continuous de novo genome assemblies, but this does not necessarily make assembly a trivial task, especially for long segmental duplications and heterozygosity. Long reads also have advantages for genome resequencing in the context of structural variant (SV) discovery and variant phasing. It enables more comprehensive detection of genome-wide structural variation, owing to their higher mappability in repetitive regions and their ability to anchor alignments to both sides of the break point (7–9). SVs are defined as genomic variability of at least 50 bp with a change in copy number or location and include deletions, insertions, inversions and translocations (10). It has been shown that ∼29 000 SVs can be identified per human genome by combining multiple technologies (11), showing that current short-read sequencing approaches leave thousands of variants undiscovered. Phasing variants gains from long reads because of the higher chance of finding variants inherited from the same haplotype on a single read. Phasing has important implications in determining the pathogenicity of pairs of compound heterozygous variants and the effect of cis-regulation (12). To our knowledge, the dependence of SV detection and variant phasing on the read length has not been formally assessed. In this work, we evaluated the influence of the read length on the accuracy and sensitivity of SV detection and on the length of contiguous stretches of phased nucleotides based on simulated long-read data from recent human genome assemblies.

MATERIALS AND METHODS

Commands for processing and analysis are included in the Supplementary Data. All scripts and commands are available at https://github.com/wdecoster/read_length_SV_discovery. We include a high-quality phased genome assembly (2.9 Gb) of the Puerto Rican reference individual HG00733, obtained by combining 75× genome coverage of PacBio long reads with additional long-range information of conformational capture sequencing (Hi-C) assembled with FALCON-Phase (13,14) (NCBI Assembly identifier GCA_003634875.1, BioProject PRJNA483067), CHM13hTERT (draft v0.6), a complete hydatidiform mole (46,XX karyotype) (15) combining ONT and PacBio long-read sequencing, 10X Genomics linked-reads and Bionano Genomics optical maps assembled with Canu (16) and NA12878 (OCVW02, PRJEB23027), a well-characterized genome from European descent assembled using nanopore sequencing data assembled with Canu (6,16). The quality metrics and contiguity of the genome assemblies used in this work were evaluated using D-GENIES (17) and QUAST (18) (Supplementary Table S1). We used SimLoRD (v1.0.2) (19) for simulation of 40× coverage of PacBio reads with a defined length between 100 bp and 700 kb. Reproducibility was assessed by simulating reads for HG00733 in triplicate. Simulated reads were aligned to the GRCh38 reference genome using minimap2 (v2.14) (20) followed by SV calling using Sniffles (v1.0.10) (21). Alignment files were sorted, indexed and downsampled using SAMtools (22). The obtained read depth was assessed with mosdepth (0.2.3) (23). The truth set of SVs was determined using paftools variant calling based on the alignment of the assembly to the GRCh38 reference alignment using minimap2 with the asm5 alignment presets and specific parameters for assembly-to-reference alignment, after splitting the diploid assembly by parental haplotypes when applicable (20,24). The performance metrics precision, recall and F-measure (harmonic mean of precision and recall) were evaluated for SVs with a minimal length of 50 nucleotides using surpyvor (7), which uses SURVIVOR (v1.0.5) for merging VCF (variant call format) files of SVs (25) and cyvcf2 (0.10.0) for parsing VCF files (26). VCF files of SVs were annotated with the read depth of the variants and their flanking sequences using duphold (v0.0.9) (27) and filtered based on the fold change for the read depth of copy number variants (CNVs) relative to its flanking regions using bcftools (v1.9) (28), to enrich for CNVs supported by deviation in read depth. Single-nucleotide variants (SNVs) from HG00733 as identified in the 1000 Genomes Project (29,30) were phased with the simulated reads by WhatsHap (31), after which contiguous haplotyped segments (phase blocks) were compared to the Ensembl transcript annotation (GRCh38, v95) (32) using BEDTools (33). Data analysis and visualization was performed in Python and Jupyter Notebooks (34) using pandas (v0.23.4) (35), matplotlib (v3.0.0) (36) and joypy (L. Taccari; https://github.com/sbebo/joypy). Commands were parallelized using GNU Parallel (v20181022) (37).

RESULTS AND DISCUSSION

Long-read resequencing has promising applications for genomics, as it enables direct observation of SVs and inference of haplotypes (7,11,38). In this work, we formally assess the impact of increasing read length on the accuracy of SV identification and haplotyping of SNVs. While current long-read sequencing platforms allow sequencing of tens of kb to Mb reads, longer read lengths come with a number of disadvantages: They require more laborious manual DNA extraction from fresh tissue, which may not always be available. Avoiding fragmentation prior to and during library preparation is also essential. Furthermore, striving for ultra-long read lengths also seems to reduce the total yield (6). Due to these limitations and challenges, it is valuable to assess what the required and sufficient read length is to obtain an optimal balance between read length, accuracy and sensitivity. We approach this problem by simulating long reads based on recently assembled human genomes of HG00733 (14), CHM13 (15) and NA12878 (6). The highest quality assembly in terms of contiguity is the one from HG00733 (Supplementary Table S1), which is also the only separated in maternal and paternal haplotypes. CHM13 is effectively a haploid genome, and NA12878 has both haplotypes superimposed, raising the possibility of incorrectly represented SVs. Notably, repetitive regions play a significant role in this analysis. These regions cannot always be resolved in assemblies, leading to separate contigs. Furthermore, these regions are known hot spots of structural variation, while spurious read alignment also leads to an inflated occurrence of false-positive and false-negative variants. As the HG00733 assembly is the most complete and thus presumably most correctly represents SVs, we mainly base our conclusions on this dataset. The truth set of SVs was based on a direct comparison of the assembly with the reference genome. In the case of HG00733, this alignment and variant calling was done separately per haplotype. This results in the identification of 25 139, 16 776 and 24 653 SVs larger than 50 bp for HG00733, CHM13 and NA12878, respectively, with a variant length distribution comparable to earlier reports (Supplementary Figure S1) (7,11,39). For each genome, multiple datasets with 40× target coverage and a specific read length starting at 100 bp and up to 700 kb were simulated. A limitation of our analysis is that we use a fixed read length per dataset, while real long-read sequencing experiments typically produce a long-tailed log-normal distribution. Simulation of log-normal distributed reads requires additional complexity of specifying three parameters for the shape of the read length distribution. Testing of this variable read length distribution for HG00733 resulted in the same conclusion as using a single fixed read length (Supplementary Figure S2). We anticipate this simplification is therefore justified to provide approximate guidelines of optimal read length. After alignment of the simulated reads to the human reference genome GRCh38 with minimap2 (20), we obtained the expected read coverage of ∼40× (Supplementary Figure S3). SVs from simulated reads were called using Sniffles and compared to the truth set to calculate the precision, recall and F-measure (Figure 1). An SV is considered concordant (true positive) if it is of the same type and has maximally a pairwise distance of 500 bp between the beginning and end coordinates in the test set and the truth set.

Figure 1.

Precision (with and without filtering on duphold annotation), recall and F-measure (y-axis) for SV call sets of simulated reads from the HG00733 assembly with increasing read length (x-axis). Average of triplicate simulations. For the HG00733 assembly, the SV precision reaches its maximum already at reads of 1000 bp, while recall no longer increases substantially after 20 kb (Figure 1). The F-measure indicates that optimal performance is reached approximately from reads of 15 kb and longer. Replicate simulations showed a high correlation of performance (Pearson's correlation coefficient >0.97). For the lesser contiguous assemblies, a shorter read length is sufficient to saturate the F-measure. For the haploid CHM13 genome, the performance metrics follow the same shape, but the F-measurement is already saturated at 4-kb reads (Supplementary Figure S4). For this dataset, the precision is higher than HG00733 at shorter read lengths and similar from 1-kb reads onward, while recall saturates at 20 kb, analogous to the findings for HG00733. CHM13 provides a valuable simplification as no heterozygous events are expected, while a diploid genome has been shown before to complicate SV discovery (40). The NA12878 assembly has both haplotypes compressed in a single haploid assembly, as such likely misrepresenting SVs. For this dataset, the F-measure already reaches its maximum from 1-kb reads (Supplementary Figure S5), while strikingly the precision is low, suggesting many false positives, and recall substantially higher. Interestingly, for all datasets recall decreases after 150 kb (discussed later). It is also worth considering that our results might be affected by differences in structural variability between individuals and populations relative to the reference genome (41). For HG00733, two additional evaluations were performed using (i) lower coverage and (ii) variants filtered based on duphold annotation (27), which adds confidence to CNVs based on read depth information. The conclusion after downsampling the HG00733 alignments to 20× coverage is similar, although recall in general is lower (Supplementary Figure S6). Filtering false-positive CNVs on duphold annotation of read depth changes versus their flanking sequences substantially improved precision, especially for shorter read lengths (Figure 1), while only mildly penalizing recall (Supplementary Figure S7). As read depth changes are only applicable to CNVs, only the accuracy of deletions and duplications is improved, of which the latter is rarely identified by Sniffles in favor of more common insertion variants. It is worth mentioning that in none of the variant sets from simulated data all variants from the assembly-based truth set are identified, highlighting the limitations of the variant caller or suggesting that some variants can only be identified using de novo assembly- or read depth-based methods. Alternatively, it cannot be fully excluded that the assembly-based SV identification contains false-positive events, is incomplete, or that coordinates of events that are inferred differently. Notably, as the size of the assemblies is 86.6–89.5% compared to the human reference genome GRCh38, some genomic content remained unassembled, presumably containing complex repetitive sequences for which longer reads are beneficial for both assembly and SV calling. As such, our estimate of minimal read length is probably an underestimation for these very long segmental duplications, but a sufficiently accurate guideline for the majority of the SVs in the nonrepetitive genome. With chromosome-scale, haplotype-resolved assemblies, a more accurate guideline could be calculated, a feature that is the promise of more recent ultra-long read libraries, highly accurate reads and tailored genome assembly methods (15,42–44). Phasing 3.5 million SNVs called from short-read sequencing data shows a continuous increase in the length of phase blocks (contiguous haplotyped genomic fragments) with increasing read length, without reaching a point of saturation within the sizes we tested (Figure 2). Phasing variants across the length of genes is an important application to assess pathogenicity. With reads of 10 kb, ∼50% of the genes can be completely phased. This fraction of completely phased genes increases with read length up to a maximum of 90% with reads of 100 kb or longer (Figure 3). The longest phase blocks are megabases long but are limited by repetitive sequences, regions without identified small variants and structural variation leading to split read alignment. We anticipate that accurate SNV calling methods for long reads would further improve the length of phase blocks, as variants in repetitive sequences cannot be identified by short reads due to ambiguous alignments.

Figure 2.

Figure 3.

The fraction of genes entirely contained in a single phase block, reaching a plateau at 100 kb, enabling phasing of variants across the entire gene.

Ridge plot showing the distribution of the length of phase blocks with increasing read length simulated from HG00733. The x-axis is the genomic size of phase blocks, and the y-axis shows the length distribution. Datasets are stacked vertically on separate lines. The fraction of genes entirely contained in a single phase block, reaching a plateau at 100 kb, enabling phasing of variants across the entire gene. With very long reads (>150 kb), we surprisingly see that SV recall decreases while precision increases, and to a lesser extent, the proportion of phased genes decreases. As software, including aligners and SV callers, was not developed based on such extremely long read sizes, we hypothesize this reduction in performance is an analytic limitation and not due to the increased fragment length itself; e.g. highly complex combinations of SVs with multiple break points per read may be missed, leading to inaccurate alignment or break phase blocks. We can assume this can be mitigated by changing the assumptions of tools used for alignment, variant calling and phasing.

CONCLUSION

In the context of human long-read resequencing, our results show optimal performance for SV discovery for read lengths of 20 kb and longer and best phasing across genes from reads of 100 kb only, crucially guiding the experimental design of future long-read sequencing studies. Click here for additional data file.

2 in total

1. Phasing analysis of lung cancer genomes using a long read sequencer.

Authors: Yoshitaka Sakamoto; Shuhei Miyake; Miho Oka; Akinori Kanai; Yosuke Kawai; Satoi Nagasawa; Yuichi Shiraishi; Katsushi Tokunaga; Takashi Kohno; Masahide Seki; Yutaka Suzuki; Ayako Suzuki
Journal: Nat Commun Date: 2022-06-16 Impact factor: 17.694

2. Long-Read Sequencing Improves the Detection of Structural Variations Impacting Complex Non-Coding Elements of the Genome.

Authors: Ghausia Begum; Ammar Albanna; Asma Bankapur; Nasna Nassir; Richa Tambi; Bakhrom K Berdiev; Hosneara Akter; Noushad Karuvantevida; Barbara Kellam; Deena Alhashmi; Wilson W L Sung; Bhooma Thiruvahindrapuram; Alawi Alsheikh-Ali; Stephen W Scherer; Mohammed Uddin
Journal: Int J Mol Sci Date: 2021-02-19 Impact factor: 5.923

2 in total