| Literature DB >> 33471068 |
Susana Posada-Céspedes1,2, David Seifert1,2, Ivan Topolsky1,2, Kim Philipp Jablonski1,2, Karin J Metzner3,4, Niko Beerenwinkel1,2.
Abstract
MOTIVATION: High-throughput sequencing technologies are used increasingly, not only in viral genomics research but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which affects transmission, virulence, and pathogenesis of viral infections. However, there are two major challenges in analysing viral diversity. First, amplification and sequencing errors confound the identification of true biological variants, and second, the large data volumes represent computational limitations.Entities:
Year: 2021 PMID: 33471068 PMCID: PMC8289377 DOI: 10.1093/bioinformatics/btab015
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Workflow of V-pipe for the analysis of viral HTS data. As input, the pipeline requires raw sequencing data (FASTQ format) and a reference sequence (FASTA format), which defines the indexing frame for the reporting of variants. (A) For quality control, low-quality bases are removed from both termini. (B) Reads are aligned employing a reference-guided approach using ngshmmalign. (C) For the read alignment, the reference sequence may be provided or it can be built de novo from the read data. (D) Alternatively, reads can be aligned using BWA MEM or Bowtie2. (E) Intermediate results are provided in the form of a consensus sequence per sample, a multiple sequence alignment of all consensus sequences, and the minor allele frequencies (MAFs) for all samples and all loci. (F) Single-nucleotide variants (SNVs) are identified, and (G) local and (H) global haplotypes are reconstructed. Whenever applicable, the pertinent reporting frame (con.: consensus, prof.: profile, ref: reference) is indicated in red. Dotted lines indicate alternative processing steps included in V-pipe
Fig. 2.Performance of SNV detection on simulated datasets. (A) We compare ngshmmalign, BWA MEM and Bowtie 2 for read alignment, and fix ShoRAH for mutation calling. (B) We use ngshmmalign for the read alignment, and compare ShoRAH with LoFreq for mutation calling. In both panels, scores are averaged over datasets with various numbers of haplotypes based on HIV-1 subtype B sequences from subjects 1051 and BORI0637, and HCV genotype 1a sequences. Results are shown for a read coverage of 10 000× and for different distributions of haplotype frequencies as described in the Methods Section 2.4 (Dirichlet: . Uniform: . Equal prop.: all haplotype frequencies equal). The error bar corresponds to the standard error
Evaluating mutation calling using V-pipe
| Dataset | Recall | Precision | Specificity |
|---|---|---|---|
| A-10k | 0.860 | 0.944 | 0.992 |
| A-100k | 0.873 | 0.631 | 0.925 |
| B-10k | 1 | 0.770 | 0.977 |
| B-100k | 1 | 0.403 | 0.885 |
Note: Datasets A and B result from using five overlapping amplicons (A) or only the second amplicon covering mainly HIV-1 pol (B), respectively, and the suffix indicates the initial amount of RNA copies.
Fig. 3.Evaluation of the performance of V-pipe on HTS data derived from the five-virus-mix. (A) Precision of SNV calls as a function of SNV frequency. (B) Distribution of inferred frequencies of unique SNVs for each haplotype in the mix for dataset A-10k. Green diamonds show the corresponding average frequency. The values reported in Di Giallonardo for the mean and standard deviation of the frequencies estimated from Illumina reads are indicated in red
Fig. 4.Representation of the reconstructed viral haplotypes in a region of the HIV-1 p17 gene from longitudinal samples of patient p2 of Zanini . Discs represent inferred haplotypes, and their size reflect the relative abundances. Haplotypes are placed in a two-dimensional plane with the aim of preserving all pairwise Hamming distances as much as possible, using multi-dimensional scaling as implemented in scikit-learn (Pedregosa, 2011). The colours indicate the number of days after infection and encircled discs denote haplotypes identified by V-pipe but not by Zanini )