| Literature DB >> 36028900 |
Kar-Tong Tan1,2,3, Michael K Slevin1,4, Matthew Meyerson5,6,7,8, Heng Li9,10.
Abstract
Nanopore long-read sequencing is an emerging approach for studying genomes, including long repetitive elements like telomeres. Here, we report extensive basecalling induced errors at telomere repeats across nanopore datasets, sequencing platforms, basecallers, and basecalling models. We find that telomeres in many organisms are frequently miscalled. We demonstrate that tuning of nanopore basecalling models leads to improved recovery and analysis of telomeric regions, with minimal negative impact on other genomic regions. We highlight the importance of verifying nanopore basecalls in long, repetitive, and poorly defined regions, and showcase how artefacts can be resolved by improvements in nanopore basecalling models.Entities:
Keywords: Basecalling; Long-reads; Nanopore-sequencing; Telomere
Mesh:
Year: 2022 PMID: 36028900 PMCID: PMC9414165 DOI: 10.1186/s13059-022-02751-6
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 17.906
Fig. 1Strand-specific nanopore basecalling errors are pervasive at telomeres. a, b IGV screenshot illustrating the three types of basecalling errors found on the forward and reverse strands of telomeres for nanopore sequencing. (TTAGGG)n on the forward strand of nanopore sequencing data was basecalled as (TTAAAA)n while (CCCTAA)n on the reverse strand was basecalled as (CTTCTT)n and (CCCTGG)n. PacBio HiFi data generated from the same cell line (CHM13) is depicted as a control. Reference genome indicated in the plot corresponds to the chm13 draft genome assembly (v1.0). c Co-occurrence heatmap illustrating the frequency of co-occurrence of repeats corresponding to natural telomeres, or to basecalling errors in PacBio HiFi and nanopore long-reads found at chromosomal ends (within 10kb of annotated end of the reference genome). Diagonal of co-occurrence matrix represents counts of long-reads with only a single type of repeats observed. d Basecalling errors at telomeres are observed across different nanopore datasets and sequencing platforms. e Basecalling errors at telomeres are observed for different nanopore basecallers and basecalling models. Guppy5 and the Bonito basecallers, and different bascalling models for each basecaller, were used to basecall telomeric reads in the CHM13 PromethION dataset (reads that mapped to flanking 10kb regions of the CHM13 reference genome). f Basecalling errors share similar nanopore current profiles as telomeric repeats. Current profiles for telomeric and basecalling error repeats were plotted based on known mean current profiles for each k-mer (“Methods”). g Summary of organisms assessed and the types of repeat errors observed. Note that S. pombe and D. melanogaster could not be readily assessed for the presence of error repeats by visualization in IGV as these sequences are more complex
Fig. 2Selective re-basecalling of telomeric reads resolves basecalling errors at telomeres. a Approach for tuning the bonito basecalling model for improving basecalls at telomeres. b Tuned bonito basecalling model leads to improvement in basecalls at telomeric regions. IGV screenshots of the telomeric region (chr2q) in the CHM13 dataset basecalled using the default bonito basecaller, and the tuned bonito basecalling model is as depicted. c Overall approach for selecting and fixing telomeric reads in nanopore sequencing datasets. Telomeric reads are selected (“Methods”) and rebasecalled using the tuned bonito basecalling model. d The selective tuning approach leads to improved recovery of telomeric reads, and a decrease in the number of reads with basecalling artefacts. Evaluation was performed on the held-out test dataset (run226). e The “selective tuning” approach leads to little detected negative impact on basecalling of other genomic regions. The sequence similarity of all reads to the reference genome for three approaches for basecalling of nanopore reads was evaluated. They are applying the default bonito basecalling model to all reads (untuned bonito model), applying the tuned bonito basecalling model to all reads (tuned bonito model), and applying the tuned bonito basecalling model selectively to telomeric reads only (selective tuning of telomeric reads). The density plot depicts the sequence similarity of each read against the CHM13 reference genome as assessed using minimap2