| Literature DB >> 32127024 |
Cheng Yong Tham1, Roberto Tirado-Magallanes1, Yufen Goh1, Melissa J Fullwood1,2, Bryan T H Koh3, Wilson Wang3,4, Chin Hin Ng5, Wee Joo Chng1,5,6, Alexandre Thiery7, Daniel G Tenen1,8, Touati Benoukraf9,10.
Abstract
The recent advent of third-generation sequencing technologies brings promise for better characterization of genomic structural variants by virtue of having longer reads. However, long-read applications are still constrained by their high sequencing error rates and low sequencing throughput. Here, we present NanoVar, an optimized structural variant caller utilizing low-depth (8X) whole-genome sequencing data generated by Oxford Nanopore Technologies. NanoVar exhibits higher structural variant calling accuracy when benchmarked against current tools using low-depth simulated datasets. In patient samples, we successfully validate structural variants characterized by NanoVar and uncover normal alternative sequences or alleles which are present in healthy individuals.Entities:
Keywords: Long reads; Low depth; Oxford Nanopore sequencing; SV characterization; Structural variants; Third-generation sequencing; WGS
Mesh:
Year: 2020 PMID: 32127024 PMCID: PMC7055087 DOI: 10.1186/s13059-020-01968-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1The NanoVar workflow. a About 2 μg of human genomic DNA is set aside for library preparation and nanopore sequencing to generate 3GS long sequencing reads. Long reads from all sequencing runs are combined into a single FASTQ/FASTA file (at least 12 Gb, recommended 24 Gb) which is used as input into NanoVar. b NanoVar SV characterization process. (Left) Long reads are aligned to a reference genome using HS-BLASTN to identify anchor sequences (blue) and divergent sequences or gaps (red) within each read. Next, the alignment information is used to detect and characterize the different SV classes. (Right) For each characterized SV, read-depth coverage is calculated at their breakend(s) site for the number of breakend-supporting and breakend-opposing reads. The breakend read depth, together with other alignment information, is employed as features in a neural network model to infer a confidence score for each SV. c NanoVar outputs all characterized SVs in a VCF file and produces an HTML report for QC and result visualization. The following figures can be found in the HTML report. (Top-left) Histogram showing the length distribution QC of the input sequencing reads. (Top-middle) Donut chart showing the distribution of SV classes characterized in the dataset (after confidence score filtering). Breakends represent translocation or genomic insertion SV. (Top-right) Scatter plot displaying the confidence score and breakend read ratio (fraction of breakend-supporting reads at a breakend) of each SV, also showing the confidence score threshold parameter used for filtering (red line). (Bottom) Table showing the details of all characterized SVs, which can be sorted, filtered, and extracted in CSV or MS Excel formats
Fig. 2NanoVar performance benchmarking. a, b SV breakend precision and recall by SV caller tools in simulation data with homozygous or heterozygous SVs. There are three homozygous SV datasets with 42,000 SVs each and one heterozygous SV dataset with 42,000 SVs at different sequencing depths. For tools with SV scoring, the optimal score was selected for them based on the F1 score at 4X sequencing depth (NanoVar, 1.0; NanoSV, 0; SVIM, 0; novoBreak, 27.5). The markers of different tools are separated by color, while different sequencing depths are separated by shapes. b, c Radar charts showing the F1 scores for each SV class characterized by each tool for homozygous and heterozygous SVs in simulation 1. DUP tandem duplication, DEL deletion, INS insertion, BND breakend, INV inversion. We presented the tools separately according to their utilization of 3GS and 2GS data. SV class annotation evaluation was included in this analysis. e, f Recall of homozygous and heterozygous SVs intersecting with SINE and LINE regions as detected by the different tools. The tools are separated by the same color code as the other plots in this figure. SV analysis of the other repetitive sequence families can be found in Additional file 1: Fig. S3
Fig. 3Precise patients’ SV characterization by NanoVar. a Scatter plots showing the confidence score and breakend read ratio of each SV characterized in patient 1 (top) and patient 2 (bottom). SVs selected for validations are labeled on the plots by their SV id. The red horizontal line indicates the confidence score threshold used for filtering. b Table displaying the details of SVs selected for validation for patient 1 and patient 2. The in silico PCR size refers to the expected size of the PCR amplicon without the SV. c Gel electrophoresis images of PCR products corresponding to each of the SVs in table b, amplified from the genomic DNA of patients 1 and 2, normal donors (normal A and normal B) and cell lines (HCT116 and MCF10A). Sample names in red (left image lane 1, right image lane 2) indicate the sample where the SV was initially detected. d Schematic illustrating a 409-bp deletion (SV 1-2) in the intronic region of the gene BPGM in patient 1, supported by 3GS nanopore reads (top), 2GS Illumina reads (middle), and 1GS Sanger sequencing chromatogram (bottom). Blue and red arrows represent the primer locations used for PCR amplification. For each nanopore read, base substitutions and base insertions are represented by red and orange markers respectively. Base deletions are represented by gaps. All nanopore reads have at least 90% alignment identity. Illumina paired-end short reads are represented by pink (forward) and blue (reverse) small rectangles, and the read coverages are displayed in gray above all the reads. The red dotted line on the sequencing chromatogram marks the precise breakpoint of the deletion at single nucleotide resolution