| Literature DB >> 35294336 |
Lavanya Singh1, James E San1, Houriiyah Tegally1, Pius M Brzoska2, Ugochukwu J Anyaneji1, Eduan Wilkinson3, Lindsay Clark4, Jennifer Giandhari1, Sureshnee Pillay1, Richard J Lessells1, Darren Patrick Martin5, Manohar Furtado2, Anmol M Kiran6,7, Tulio de Oliveira1,3,8.
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is adaptively evolving to ensure its persistence within human hosts. It is therefore necessary to continuously monitor the emergence and prevalence of novel variants that arise. Importantly, some mutations have been associated with both molecular diagnostic failures and reduced or abrogated next-generation sequencing (NGS) read coverage in some genomic regions. Such impacts are particularly problematic when they occur in genomic regions such as those that encode the spike (S) protein, which are crucial for identifying and tracking the prevalence and dissemination dynamics of concerning viral variants. Targeted Sanger sequencing presents a fast and cost-effective means to accurately extend the coverage of whole-genome sequences. We designed a custom set of primers to amplify a 401 bp segment of the receptor-binding domain (RBD) (between positions 22698 and 23098 relative to the Wuhan-Hu-1 reference). We then designed a Sanger sequencing wet-laboratory protocol. We applied the primer set and wet-laboratory protocol to sequence 222 samples that were missing positions with key mutations K417N, E484K, and N501Y due to poor coverage after NGS sequencing. Finally, we developed SeqPatcher, a Python-based computational tool to analyse the trace files yielded by Sanger sequencing to generate consensus sequences, or take preanalysed consensus sequences in fasta format, and merge them with their corresponding whole-genome assemblies. We successfully sequenced 153 samples of 222 (69 %) using Sanger sequencing and confirmed the occurrence of key beta variant mutations (K417N, E484K, N501Y) in the S genes of 142 of 153 (93 %) samples. Additionally, one sample had the Y508F mutation and four samples the S477N. Samples with RT-PCR C t scores ranging from 13.85 to 37.47 (mean=25.70) could be Sanger sequenced efficiently. These results show that our method and pipeline can be used to improve the quality of whole-genome assemblies produced using NGS and can be used with any pairs of the most used NGS and Sanger sequencing platforms.Entities:
Keywords: Illumina; S-gene; SARS-CoV-2 spike gene; Sanger; diagnostic failure; mutations; primer binding site; whole-genome sequencing
Mesh:
Year: 2022 PMID: 35294336 PMCID: PMC9176282 DOI: 10.1099/mgen.0.000774
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Visual representation of the SeqPatcher workflow. (a) If the input is an .ab1 file then SeqPatcher analyses and infers consensus sequences in fasta format. (b) fasta sequences are aligned to the reference gene and (c) the longest overlap is determined. The location of the overlap relative to the gene is also determined. (d) The reference gene is then aligned to the respective whole genome to determine the location and the precise point of insertion is calculated. (e) Sanger sequence is inserted into the whole genome. SeqPatcher tries to report the correct/highly probable base in cases of conflict and keeps the locations of indels (insertions and deletions) intact during integration. r, ref gene length; rl, ref gene start position; rr, ref gene end position; gl, gene start position in assembly; gr, gene end position in assembly; black solid line, reference gene; dots, missing or gapped region in the sequence.
Derivation of the Sanger consensus sequence from the trace files. A fasta sequence is determined for each read and a consensus calculated by alignment of the reads to the reference gene sequence. Base refers to nucleotides A, C, T and G and A/B refer to different bases at that position. A dash (–) represents an indel
|
Sanger sequence with forward and reverse reads | |||
|---|---|---|---|
|
Reference |
Forward read |
Reverse read |
Final outcome |
|
– |
Any base |
Any base |
Any base |
|
Base A |
Base A |
Base A |
Based on user input, i.e. based on peak heights or ambiguous base (default: ambiguous base) |
|
Base A |
Base A |
Base B |
Based on user input, i.e. based on peak heights or ambiguous base (default: ambiguous base) |
|
Base A |
Base B |
Base A |
Base A |
|
Base A |
Base B |
Base B |
Base B |
|
Base A |
Base B |
– |
Base B |
|
Base A |
Base A |
– |
Base A |
|
Base A |
Ambiguous |
Base B |
Base B |
|
Base A |
Base B |
Ambiguous |
Base B |
|
Sanger sequence with only one read (either forward or reverse) | |||
|
– |
Any base |
Any base | |
|
Base A |
– |
|
– |
|
Base A |
Base B |
|
Base B |
|
Base A |
Ambiguous |
|
Based on user input, base a/ peak max base/ neighbour base/ base ambi. (Default: base ambi.) |
Fig. 2.Sanger sequencing and analysis. (a) The Sanger sequencing workflow. cDNA, complementary DNA; PCR, polymerase chain reaction; BDT, BigDye Terminator; BDX, BigDye Xterminator; CE, capillary electrophoresis. (b) Violin plot showing the distribution of RT-PCR cycle threshold scores for 222 samples; the C t score for each sample was based on the mean C t of the three SARS-CoV-2 targets (S gene, N gene, Orf1ab). Quartiles are represented by different shades within each plot, mean and median C ts are represented by yellow and pink points, respectively. The pink line shows the trend line.
Fig. 3.Summary of the workflow for improving the coverage of WGS data using Sanger sequencing. Data from the sequencer are preprocessed to determine coverage. Sequences with gaps in regions of interest are sent for Sanger sequencing. The results from Sanger sequencing are inserted into the NGS whole genomes using SeqPatcher and the improved consensus of these is published to support genomic epidemiology.