Literature DB >> 33471068

V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput data.

Susana Posada-Céspedes^1,2, David Seifert^1,2, Ivan Topolsky^1,2, Kim Philipp Jablonski^1,2, Karin J Metzner^3,4, Niko Beerenwinkel^1,2.

Abstract

MOTIVATION: High-throughput sequencing technologies are used increasingly, not only in viral genomics research but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which affects transmission, virulence, and pathogenesis of viral infections. However, there are two major challenges in analysing viral diversity. First, amplification and sequencing errors confound the identification of true biological variants, and second, the large data volumes represent computational limitations.
RESULTS: To support viral high-throughput sequencing studies, we developed V-pipe, a bioinformatics pipeline combining various state-of-the-art statistical models and computational tools for automated end-to-end analyses of raw sequencing reads. V-pipe supports quality control, read mapping and alignment, low-frequency mutation calling, and inference of viral haplotypes. For generating high-quality read alignments, we developed a novel method, called ngshmmalign, based on profile hidden Markov models and tailored to small and highly diverse viral genomes. V-pipe also includes benchmarking functionality providing a standardized environment for comparative evaluations of different pipeline configurations. We demonstrate this capability by assessing the impact of three different read aligners (Bowtie 2, BWA MEM, ngshmmalign) and two different variant callers (LoFreq, ShoRAH) on the performance of calling single-nucleotide variants in intra-host virus populations. V-pipe supports various pipeline configurations and is implemented in a modular fashion to facilitate adaptations to the continuously changing technology landscape. AVAILABILITY: V-pipe is freely available at https://github.com/cbg-ethz/V-pipe. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 33471068 PMCID： PMC8289377 DOI： 10.1093/bioinformatics/btab015

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

RNA viruses are regarded as important models in evolutionary biology, because they exhibit short generation times and higher mutation rates compared to cellular organisms (Duffy ). As a result, they evolve rapidly and often exist within their host as a collection of distinct, yet genetically related, viral strains (Lauring and Andino, 2010). This heterogeneity impacts viral transmission, virulence and pathogenesis (Rozera ; Tsibris et al., 2009; Vignuzzi et al., 2006), and is of relevance in the context of immune escape (Kuroda ; Nowak ), vaccine design (Gaschen, 2002) and drug resistance (Mason ). High-throughput sequencing (HTS) technologies have opened up new possibilities for in-depth characterization of the genetic diversity of virus samples (Barzon ; Capobianchi ). However, analysing viral HTS data is complicated by large volumes of data, short length of the sequencing reads and the high amplification and sequencing error rates relative to the expected intra-host viral diversity (Goodwin ). Therefore, statistical and computational challenges remain in disentangling true biological variation from technical errors, especially for low-frequency variants (Beerenwinkel ). Several methods have been proposed for studying genetic diversity in virus populations (Eliseev ; Posada-Céspedes ), but some of these tools appear no longer actively maintained or limited in terms of robustness and usability. Also, the analysis of HTS data involves additional steps implemented by separate tools, e.g. quality control and read alignment. A prerequisite for incorporating such HTS data into routine diagnostics is to standardize the processing steps end-to-end from raw data input to final output. To this end, several bioinformatics pipelines have been developed for, e.g. virus discovery and metagenomics applications (Ho and Tzanetakis, 2014; Li ; Maarala ; Naccache ; Zhao et al., 2017; Zheng et al., 2017). Other pipelines focus on either (i) the construction of the consensus sequence via reference-guided assembly (Wan et al., 2015; Wymant ), (ii) the identification of single nucleotide variants (SNVs) (Howison ; Huber ; Taylor et al., 2019) or (iii) the reconstruction of viral haplotypes (Jayasundara ; Mangul ). Often these tools either target specific viruses (Howison ; Huber ; Taylor et al., 2019; Wymant ) or provide limited support for broader applications. For HIV drug resistance testing, Lee have recently evaluated various bioinformatics pipelines for this specific application. An important step in inferring viral genetic diversity is the alignment of HTS reads. There are two general strategies for read alignment, namely reference-based approaches and de novo assembly. Shortcomings of the former are the introduction of biases due to low similarity between the reference genome and viral haplotypes (Archer ), as well as inaccurate alignment of reads containing insertions or deletions (indels) (Posada-Céspedes ). On the other hand, limitations of de novo assembly include increased sensitivity to chimeric reads leading to erroneous contigs, and a segmented coverage of the genome by the assembled contigs (Wymant ). To address these limitations, we have developed a read aligner, called ngshmmalign. The aligner borrows ideas from the alignment of protein families to align HTS reads from small and highly diverse genomes. Ngshmmalign models the multiple read alignment as a profile hidden Markov model (HMM) and aligns all reads to this profile. By doing so, ngshmmalign accounts for local heterogeneity, including structural variations. To support reproducible viral genomics studies for basic research and clinical diagnostics, we have developed V-pipe, a flexible bioinformatics pipeline integrating several tools for analysing viral HTS data. V-pipe allows for assessing viral diversity at the level of SNVs, short variant sequences (or local haplotypes) and long-range haplotypes (or global haplotypes). To provide the flexibility required for adapting to future methodological and technological developments, V-pipe provides a modular and extensible framework which facilitates the introduction of new tools. V-pipe also supports the comparative assessment of different workflows in a standardized environment. To this end, it contains modules to generate synthetic data and to assess the accuracy of the computational inference. We demonstrate the benchmark capabilities of V-pipe by assessing the impact of different read aligners and variant callers on the accuracy of SNV calling. While our focus here is on SNV calling, the performance of various methods for viral haplotype reconstruction has been recently evaluated by Eliseev . We validate V-pipe using sequencing data from a control sample composed of five well-defined HIV-1 strains (Di Giallonardo ), and to demonstrate its applicability, we process 92 HIV-1 whole-genome sequencing samples from 11 patients previously reported in Zanini et al. (2015).

2 Materials and methods

We first summarize the core components of V-pipe. We then present the read aligner ngshmmalign and explain the benchmarking functionalities of the pipeline. Lastly, we describe the simulation setup and the control sequencing datasets used to assess the performance of different components of V-pipe.

2.1 Computational pipeline

V-pipe uses the Snakemake workflow management system (Köster and Rahmann, 2012), which enables the well-controlled and scalable execution of the pipeline in local as well as in high-performance computing (HPC) environments. The pipeline integrates various open-source software packages developed for analysing virus samples. For deployment, we provide Conda (https://conda.io) environments to automatically download and install the required tools. The environments include all dependencies and versions which is key for full reproducibility of the analysis settings. As input, V-pipe requires the raw sequencing data, a reference sequence and a configuration file containing user-defined options. V-pipe supports both single-end and paired-end FASTQ files. In short, the analysis workflow implemented by V-pipe involves the following main steps: (i) quality control, (ii) reference-guided mapping and alignment of sequencing reads and (iii) identification of SNVs and reconstruction of viral haplotypes (Fig. 1).

Fig. 1.

Workflow of V-pipe for the analysis of viral HTS data. As input, the pipeline requires raw sequencing data (FASTQ format) and a reference sequence (FASTA format), which defines the indexing frame for the reporting of variants. (A) For quality control, low-quality bases are removed from both termini. (B) Reads are aligned employing a reference-guided approach using ngshmmalign. (C) For the read alignment, the reference sequence may be provided or it can be built de novo from the read data. (D) Alternatively, reads can be aligned using BWA MEM or Bowtie2. (E) Intermediate results are provided in the form of a consensus sequence per sample, a multiple sequence alignment of all consensus sequences, and the minor allele frequencies (MAFs) for all samples and all loci. (F) Single-nucleotide variants (SNVs) are identified, and (G) local and (H) global haplotypes are reconstructed. Whenever applicable, the pertinent reporting frame (con.: consensus, prof.: profile, ref: reference) is indicated in red. Dotted lines indicate alternative processing steps included in V-pipe Relatively high error rates are a limitation of HTS technologies. Error sources include the reverse transcription of viral RNA into cDNA, the amplification of the material by multiple cycles of PCR, and the sequencing process itself (Beerenwinkel ). To avoid propagating biases to the downstream analysis steps, it is imperative to include quality checks and filtering steps. To this end, we include FastQC (Andrews, 2019) to provide quality control reports and PRINSEQ (Schmieder and Edwards, 2011) for removing low-quality or ambiguous bases from both termini of reads. For aligning sequencing reads, we developed a novel read aligner called ngshmmalign which is described below. This aligner requires as input a reference sequence to approximately locate reads. The reference sequence can be provided by the user or constructed de novo using the VICUNA software (Yang et al., 2012). We also include two alternative aligners, namely BWA MEM (Li, 2013) and Bowtie2 (Langmead and Salzberg, 2012), to choose from according to particular needs and computational resources. In addition to the alignment file, ngshmmalign outputs two types of consensus sequences constructed from the aligned reads, namely by (i) using a majority vote at each position, and (ii) incorporating ambiguous bases. V-pipe produces these consensus sequences in case an alternative aligner is chosen. We use lowercase characters in the consensus sequences to mark positions with read counts below 50 reads, by default. To report ambiguous bases, every base with a relative frequency greater or equal to a certain threshold, 5% by default, is accounted for, and the corresponding IUPAC ambiguity code is used. V-pipe also reports basic summary statistics, including the number of reads retained after quality control, the number of aligned reads, the region of the genome covered with a minimum number of aligned reads, and the minor allele frequencies per locus for all analysed samples. We derive SNV calls from local haplotype reconstruction using ShoRAH (Zagordi et al., 2011). By considering co-occurring variants at multiple loci simultaneously, ShoRAH can effectively lower the SNV detection limit, i.e. the minimum frequency at which variants can be called reliably (McElroy ). Alternatively, V-pipe also includes the variant caller LoFreq (version 2) (Wilm et al., 2012), which uses a position-wise and hence faster approach. A more complete characterization of the structure of a virus population consists in reconstructing the sequences and relative abundances of all viral haplotypes. For this purpose, we incorporate the quasispecies assembly tools HaploClique (Töpfer et al., 2014) and SAVAGE (Baaijens ) to support reference-based and de novo haplotype reconstruction, respectively. Both methods use the overlap between reads to iteratively extend local haplotypes into full-length haplotypes. As an additional feature, V-pipe includes a module to detect flow-cell cross contamination. To do so, reads are aligned to a panel of reference sequences containing suspected contaminants. We then detect and report the number of reads preferentially mapped to other references. The code developed to support this functionality, as well as various other steps of our pipeline, are maintained as an independent Python package called smallgenomeutilities (Supplementary Section S1.1). V-pipe allows for constructing, maintaining and using reproducible and traceable data analysis pipelines. In addition to providing a reproducible workflow and pinning dependency versions, V-pipe supports numerical reproducibility by fixing the seeds of all processing steps involving random sampling. Traceability is achieved by providing a well-documented pipeline in a well-defined framework and by ensuring meaningful log output in all processing steps. We have also made additional efforts to improve the reliability of individual tools used as part of our pipeline, such as ShoRAH and HaploClique (Supplementary Section S1.2). V-pipe is an open source project, the source code is freely available at https://github.com/cbg-ethz/V-pipe. Documentation including user guides can be found on the Wiki (https://github.com/cbg-ethz/V-pipe/wiki) and tutorials for specific viruses are available on our project page (https://cbg-ethz.github.io/V-pipe/). We also promote community involvement by maintaining a mailing list, providing user support and offering workshops.

2.2 ngshmmalign: read aligner

Ngshmmalign performs a three-step alignment (Supplementary Fig. S1). In the first step, reads are mapped to the reference genome to roughly determine their position. This initial mapping is done using a k-mer index of the reference genome. We then determine the mean and standard deviation of the returned location on the genome. If the standard deviation is above a certain threshold, the k-mer index match is considered suboptimal, and we perform a full genome-wide exhaustive alignment (Supplementary Section S1.3.2). After this initial mapping, reads are aligned in a semi-global mode using the Smith-Waterman algorithm. In the second step, the genome is partitioned into overlapping windows, and reads are assigned to windows based on the read-window overlaps. A multiple sequence alignment of the reads is performed independently for each of the windows by employing the L-INS-i iterative refinement approach implemented by MAFFT (Katoh and Standley, 2013) (Supplementary Section S1.3.3). We then infer the parameters of the profile HMM in a supervised manner, by assuming that the multiple read alignment represents a local sample of the profile HMM. In the third step, the final read alignment is obtained by re-aligning all reads to the profile HMM (Supplementary Section S1.3.4). The source code of ngshmmalign is available at https://github.com/cbg-ethz/ngshmmalign. For every dataset analysed within a single execution of the pipeline, ngshmmalign reports the read alignment using a position numbering relative to the profile. To standardize the position numbering, V-pipe performs a lift-over to report the alignment relative the user-specified reference sequence. To do so, we construct and use a multiple sequence alignment, containing the reference sequence and all consensus sequences from datasets included in the data analysis. This multiple sequence alignment is also an intermediate result of V-pipe, which can be used for other applications, such as phylogenetic analyses.

2.3 Benchmarking

V-pipe provides benchmarking functionality by including two additional modules: (i) simBench can simulate sequencing reads from a mock virus population, and (ii) testBench can evaluate the accuracy of the inference results. SimBench supports three modes. The first mode can be regarded as a simple quasispecies model (Domingo ). A master sequence may be provided or generated by sampling a user-defined number of nucleotides uniformly at random. Simulated haplotype sequences are generated from the master sequence by introducing substitutions, insertions and deletions with user-configurable rates. Additionally, hyper-variable regions are simulated in the form of long deletions of variable length. In the second mode, the simulated haplotype sequences are sampled from a perfect binary tree. Every child node is generated from the parent sequence by introducing substitutions, insertions and deletions with user-configurable rates. Here, we emulate explicitly the hierarchical evolutionary relationships among viral haplotypes. In the third mode, haplotype sequences are defined a priori and given as input to the pipeline, e.g. based on other models of viral evolution or on known viral sequences. The haplotype relative abundances are either (i) set to equal proportions (i.e. where n is the number of haplotypes), (ii) drawn from a Dirichlet distribution with concentration parameters for all haplotypes by default or (iii) obtained using a geometric series with a given common ratio (0.75 by default). After haplotype sequences have been generated, we use the ART software (Huang ) to simulate either single-end or paired-end reads with configurable read length. ART is a read simulator with various built-in, technology-specific read error models and supports common sequencing errors, such as base substitutions, insertions and deletions. We simulate reads from every individual haplotype with read coverage proportional to its relative abundance. Instead of simulating sequencing reads, a user can also provide FASTQ files as input for the benchmark. This option allows for providing sequences simulated by different means, and it supports the analysis of control sequencing experiments with known haplotypes and relative abundances. The evaluation module testBench is used to assess the accuracy of the inference results by comparing them to the ground truth. For SNV calling, we obtain a list of true SNVs by constructing a multiple sequence alignment of the underlying haplotypes and the reference sequence. Positions of expected SNVs are reported relative to the reference sequence. We then report the number of true positive, false positive, false negative and true negative SNVs, as well as the inferred versus the expected SNV frequencies, the frequencies of false positives and the number of false negatives per underlying haplotype. V-pipe supports two modes for carrying out the benchmark, either evaluating a single pipeline arrangement at a time (vpipeBench), or multiple pipeline arrangements simultaneously (vpipeBenchRunner). Although the generation of synthetic datasets is reproducible, the latter mode should be preferred to ensure that other configuration settings are kept constant across different pipeline arrangements (i.e. combinations of processing steps).

2.4 Simulated datasets

We employ simBench to generate simulated reads, varying the number of haplotypes (ranging from 8 to 60), their relative abundances and the total read coverage (either 10 000× or 40 000×, Supplementary Section S1.4). We use equal proportions for the haplotype abundances (denoted as Equal prop.) or sample their frequencies from a Dirichlet distribution. We either use a symmetric Dirichlet distribution with for all i (denoted as Uniform), or choose one haplotype at random and assign to it a greater weight ( and for all ) to emulate a dominant viral strain (denoted as Dirichlet). For all datasets, we simulate paired-end reads with 250 bp read length by using ART with the built-in quality profile for the MiSeq platform. The datasets are based on HIV-1 or HCV sequences (Supplementary Table S1). For HIV-1, we employ sequences of the subtype B envelope glycoprotein (env) gene obtained for subjects 1051 and BORI0637 in Lee by single-genome amplification. For HCV, we use sequences from naturally occurring HCV genotype 1 subtype a (1a) E1E2 genes (El-Diwany ). Emulating populations using sequences derived from plasma samples of individual patients allows us to mimic the structure of viral populations more faithfully. Moreover, testing different viruses is crucial, as they may display different mechanisms of evolution and, hence, distinct forms of genetic variation.

2.5 Control sample for pipeline validation on real datasets

The control sample consists of an in vitro mixture of five known HIV-1 strains mixed at equal proportions. Four sequencing experiments were carried out starting with approximately 104 (denoted 10 K) and 105 (denoted 100 K) HIV-1 RNA copies. The samples were sequenced by using the Illumina MiSeq platform in paired-end read mode (2 × 250 bp length, v2 kit). The protocol described by Di Giallonardo was employed for the amplification and sequencing. Primers were designed to cover almost the full HIV genome in five overlapping segments. Two types of sequencing experiments were carried out: one including all five amplicons (denoted A) and another one using only amplicon B (denoted B). This 2 × 2 design gives rise to four datasets referred to as A-10K, A-100K, B-10K and B-100K.

3 Results

We demonstrate the benchmarking capabilities of V-pipe and evaluate the accuracy of SNV detection using different read aligners and mutation callers. In addition, we analyse an in vitro mixture of well-defined viral strains to assess the performance of one particular pipeline configuration on actual sequencing data. We also employ V-pipe to process publicly available longitudinal data from 11 patients (Zanini ) to demonstrate the applicability of the pipeline as well as its run time performance.

3.1 Simulation studies

We simulate reads from in silico mixtures of sequences derived from two different viruses, namely HIV-1 and HCV (Section 2.4). We align simulated reads using ngshmmalign and compare it against two widely used read aligners, namely BWA MEM and Bowtie 2. For this comparison, we employ vpipeBenchRunner for the simultaneous execution of all pipeline configurations. Averaged over all simulated datasets, the evaluated tools align more than 89% of the read pairs concordantly (Supplementary Table S2). Although BWA MEM reports the highest percentage of aligned reads, ngshmmalign aligns a larger portion of sequenced bases resulting in a higher average coverage (Supplementary Fig. S2). To investigate potential read alignment bias due to differences in sequence similarity to the reference sequence, we report the fraction of aligned reads and aligned bases per haplotype. For the HCV-based datasets, sequences exhibit a broad range of divergence from the reference strain (0.005–0.112). BWA MEM aligns most of the reads regardless of the divergence from the reference (Supplementary Fig. S3A), but a large fraction of the bases are soft-clipped, whereas ngshmmalign aligns a higher fraction of the bases for all haplotypes (Supplementary Fig. S3B). Since the main focus of V-pipe is to infer viral genetic diversity, we evaluate the accuracy of the read aligners based on the score of detecting SNVs using ShoRAH. The score is the harmonic mean of precision and recall. To make the scores comparable, we report the performance metric for the union of all genomic loci covered by at least one of the aligners for each of the evaluated conditions. We evaluate two read coverages (10 000× and 40 000×) and three strategies to generate the underlying haplotype abundances. In most cases, ngshmmalign outperforms BWA MEM and Bowtie 2 in terms of the score (Fig. 2A), and the difference in the scores is statistically significant (corrected P-value < 0.05, Wilcoxon signed-rank test, Supplementary Section S1.5.1). On the other hand, we do not observe substantial differences in the performance of the individual aligners while varying the distribution of haplotype frequencies, the number of haplotypes or the evaluated coverages (Supplementary Figs S4–S6).

Fig. 2.

Performance of SNV detection on simulated datasets. (A) We compare ngshmmalign, BWA MEM and Bowtie 2 for read alignment, and fix ShoRAH for mutation calling. (B) We use ngshmmalign for the read alignment, and compare ShoRAH with LoFreq for mutation calling. In both panels, scores are averaged over datasets with various numbers of haplotypes based on HIV-1 subtype B sequences from subjects 1051 and BORI0637, and HCV genotype 1a sequences. Results are shown for a read coverage of 10 000× and for different distributions of haplotype frequencies as described in the Methods Section 2.4 (Dirichlet: . Uniform: . Equal prop.: all haplotype frequencies equal). The error bar corresponds to the standard error For the evaluations, in addition to single-nucleotide substitutions, we account for position-wise deletions, by including the gap symbol as an alternative base. We further compare the performance in identifying such deletions and find that ngshmmalign outperforms BWA MEM and Bowtie2 for all evaluated conditions (Supplementary Fig. S10A). In addition to the position-wise performance, we also account for the length of individual deletion events and again find ngshmmalign to perform best (Supplementary Fig. S11A). Next, we focus on mutation calling and compare the accuracy of SNVs obtained by using ShoRAH versus LoFreq, while fixing ngshmmalign for the read alignment. We observe significant differences in the scores for most datasets based on HIV-1 sequences from subject 1051 (corrected P-value < 0.05, Wilcoxon signed-rank test, Supplementary Table S3); otherwise both tools appear to perform equally well (Fig. 2B and Supplementary Fig. S7). Similar results are obtained when comparing performance based on detection of deletions; ShoRAH displays higher scores than LoFreq for most datasets based on HIV-1 sequences from subject 1051 (Supplementary Figs S10B and S11B). Small discrepancies in the scores can be attributed to differences in recall, whereas both tools show almost perfect precision (Supplementary Figs S8 and S9). Although aligning reads with ngshmmalign and performing mutation calling with ShoRAH resulted in better scores in most cases, we note that there is a trade-off between accuracy and computational resources (Supplementary Section S1.5).

3.2 Validation of V-Pipe on a mixture of five HIV-1 strains

We employ V-pipe using ngshmmalign with a de novo constructed reference sequence, and ShoRAH for SNV calling. In all cases, V-pipe detected more than 86% of the expected SNVs, with almost perfect specificity (Table 1). In datasets for which only amplicon B is sequenced (B-10k and B-100k), V-pipe reports perfect recall, whereas for datasets sequenced with all sets of primers (A-10k and A-100k), V-pipe misses a small fraction of the expected SNVs. The missed variants are predominately located at the genome termini, which correspond to regions of lower coverage. We also observe a decrease in precision for datasets A-100k, B-10k and B-100k. Most of the falsely reported SNVs correspond to single-nucleotide deletions at very low frequencies, whereas such errors are less prominent in sample A-10k. When inspecting the precision of the mutation calls as a function of the variant frequencies, we find that a precision higher than 98% can be attained for SNVs with frequencies at least 0.5% for all the analysed datasets (Fig. 3A). In addition to detecting most of the expected variants, we find the frequencies of unique SNVs per haplotype reported by ShoRAH to be in good agreement with the relative abundances originally reported by Di Giallonardo (Fig. 3B).

Table 1.

Evaluating mutation calling using V-pipe

Dataset	Recall	Precision	Specificity
A-10k	0.860	0.944	0.992
A-100k	0.873	0.631	0.925
B-10k	1	0.770	0.977
B-100k	1	0.403	0.885

Note: Datasets A and B result from using five overlapping amplicons (A) or only the second amplicon covering mainly HIV-1 pol (B), respectively, and the suffix indicates the initial amount of RNA copies.

Fig. 3.

Evaluation of the performance of V-pipe on HTS data derived from the five-virus-mix. (A) Precision of SNV calls as a function of SNV frequency. (B) Distribution of inferred frequencies of unique SNVs for each haplotype in the mix for dataset A-10k. Green diamonds show the corresponding average frequency. The values reported in Di Giallonardo for the mean and standard deviation of the frequencies estimated from Illumina reads are indicated in red Evaluating mutation calling using V-pipe Note: Datasets A and B result from using five overlapping amplicons (A) or only the second amplicon covering mainly HIV-1 pol (B), respectively, and the suffix indicates the initial amount of RNA copies.

3.3 Application to clinical samples

To test V-pipe on clinical samples, we analyse publicly available datasets corresponding to longitudinal samples from one of the patients studied by Zanini (ENA study accession number PRJEB9618). We employ V-pipe to infer local haplotypes on the 6 longitudinal datasets available for patient p2, using ngshmmalign and ShoRAH for local haplotype reconstruction. We compare the inferred haplotypes to the haplotype sequences reported by Zanini on a region of the p17 gene spanning nucleotides 872–1072 with respect to the HXB2 reference genome. This region has been arbitrarily chosen from regions displaying genetic diversity. After filtering out inferred haplotypes with a posterior probability smaller than 0.9 and an average read count of less than 10 reads, we find a perfect match between haplotype sequences reconstructed using V-pipe and the sequences previously reported by Zanini using manually curated read alignments. However, V-pipe finds 1, 11, 3 and 8 additional haplotypes in the samples taken 561, 1255, 1628 and 2018 days after the estimated date of infection, respectively. The estimated frequencies of these haplotypes range from 0.08% to 6%. For the matching haplotypes, the absolute error in the estimated abundances is below 4% for all the considered datasets (Supplementary Fig. S12). We visualise the reconstructed local haplotypes by representing pairwise Hamming distances in a two-dimensional plane (Fig. 4). We observe a drift in sequence space from the initial haplotype (in red, Fig. 4) as time progresses. Sequences of haplotypes recovered after 561 and 1255 days of infection are situated closer to the initial haplotype sequence, whereas sequences corresponding to later time points (after 1628 and 2018 days of infection) are further away.

Fig. 4.

Representation of the reconstructed viral haplotypes in a region of the HIV-1 p17 gene from longitudinal samples of patient p2 of Zanini . Discs represent inferred haplotypes, and their size reflect the relative abundances. Haplotypes are placed in a two-dimensional plane with the aim of preserving all pairwise Hamming distances as much as possible, using multi-dimensional scaling as implemented in scikit-learn (Pedregosa, 2011). The colours indicate the number of days after infection and encircled discs denote haplotypes identified by V-pipe but not by Zanini ) To illustrate the large-scale applicability of the pipeline, we note that using V-pipe to process all 92 datasets reported in Zanini yielded an average run time per sample of 16 h on 12-core Intel Xeon E5-2680v3 processors (2.5–3.3 GHz), for an average coverage across samples of . Depending on the computational resources available, the total throughput can be largely independent of the number of samples, because the processing steps for individual datasets can be executed in parallel. Moreover, when computational resources are limited, V-pipe can be executed using alternative aligners, such as BWA MEM. For these particular datasets, BWA MEM aligned reads within 86 s on average, whereas ngshmmalign took on average 4 h 38 min. Similarly, LoFreq can be used for mutation calling as opposed to ShoRAH. In this case, ShoRAH took on average 11 h 7 min using 12 threads, whereas LoFreq took 2 h 42 min executed as a single-threaded program.

4 Discussion and conclusions

Incorporating HTS technologies into viral genomics studies provides new opportunities to characterize intra-host virus populations in unprecedented depth. However, the applicability of these technologies is challenged by the amount and quality of the resulting data. In a clinical diagnostics setting, these large volumes of data need to be analysed accurately and efficiently within a short period of time. To ensure reproducibility and traceability of the analysis workflows, we developed and implemented the bioinformatics pipeline V-pipe. V-pipe is tailored to studying intra-host diversity from viral HTS data. Although workflows for either the identification of minor variants (Huber ) or the reconstruction of viral haplotypes (Jayasundara ) have been proposed, to our knowledge, V-pipe is the first pipeline assessing viral diversity at all spatial levels of the viral genome, i.e. SNVs, as well as local and global haplotypes. We also present a novel read alignment approach tailored to relatively small, yet highly diverse genomes. The aligner prioritizes accuracy over speed and is based on profile hidden Markov models, which allow for capturing features shared among related sequences, such as mutations and indels, through position-specific probabilities. We evaluated ngshmmalign on simulated reads from mock viral populations that attempt to mimic hyper-variable regions of HIV-1 (env gene) and HCV (E1E2 region). The performance of the aligner was indirectly evaluated based on the detection of SNVs from the aligned reads. In most cases, the F1 score was found to be larger than 0.8, and SNV calls based on alignments obtained using ngshmmalign also reported a higher score compared to alignments by BWA MEM or Bowtie 2. Furthermore, comparing performance outcomes based on HIV-1 datasets, showing variation in the form of indels, with HCV-based datasets, the advantage of using ngshmmalign over conventional aligners became apparent. We further evaluated the performance based solely on detection of deletions, finding that ngshmmalign outperforms the other evaluated aligner. We thus argue that ngshmmalign should be the preferred choice for populations containing indels as sources of variation. Although ngshmmalign performed better under the evaluation conditions, it is possible that the alignments produced by BWA MEM or Bowtie 2 could be further improved by fine-tuning their hyper-parameters. Nevertheless, our results suggest that the SNV detection accuracy is to a larger extent influenced by the upstream alignment step than by the choice of the SNV caller. More importantly, with this simulation study, we demonstrated the benchmarking component implemented in V-pipe, which can be useful for assessing the actual run times of individual tools and adjusting the pipeline to meet specific requirements, e.g. on accuracy of the results or availability of computational resources. Intrinsic parameters of the analysed sample such as diversity and number of mutations can generally impact the performance of the individual tools integrated in V-pipe. In our simulation study, we analysed synthetic data from populations ranging from 8 to 60 haplotypes, as well as different distributions for the haplotype frequencies, and did not find substantial changes in the F1 score (Supplementary Figs S4 and S7). Using V-pipe, we were able to reconstruct the genetic diversity present in a five-virus-mix control sample dataset with high recall. While we observed lower precision, this could be improved substantially by filtering out deletion calls with frequencies below 0.5%. Choosing such thresholds for mutation and structural variant calling is highly dependent on the HTS protocols, but detecting SNVs at frequencies as low as 1% is already very close to the Illumina sequencing error rate. In addition to simulated datasets and control sequencing samples, we demonstrated the applicability of the pipeline by processing data from 92 clinical samples resulting in typical run times of 16 h per sample. Thus, we believe that total turnaround times of a few days are feasible using V-pipe’s default configuration (i.e. with ngshmmalign and ShoRAH) and including sample collection and preparation, sequencing and data analysis. Hence, the accuracy as well as the run times of V-pipe make it suitable for many applications. Most studies on viral genetic diversity have been limited to the identification of SNVs. Yet, the occurrence of mutations on the same genome might not be independent and the combined effect of co-occurring mutations might not be additive. On the other hand, it is not entirely obvious how to incorporate haplotype information, e.g. into monitoring epidemics, drug resistance surveillance or supporting treatment decisions. There is, however, supporting evidence for using viral haplotypes to infer transmission pairs (Poon ). V-pipe can support such research towards the understanding of the missing links in viral haplotyping. In general, viral haplotype reconstruction is an active research field. Consequently, new computational methods for this task are regularly being proposed. In addition to novel methods, technological improvements are deployed at a high rate. Therefore, any pipeline needs to be actively maintained and constantly adapted to new requirements. We addressed this aspect from two directions. First, additional efforts have been directed into making the individual components of our pipeline, such as ShoRAH and HaploClique, more performant to handle current sequencing throughputs. Second, V-pipe features a modular and extensible architecture, such that the pipeline can be adapted to incorporate new tools. Furthermore, there is a pressing need to establish standards for data analysis. We thus introduced a benchmark component to support testing of different pipeline configurations. While we mainly focused on performance assessment of mutation calling, the benchmark utilities can be employed in a similar fashion to evaluate haplotype reconstruction, which is subject to future work. One limitation of the pipeline is that it currently focuses on Illumina data. Other sequencing technologies such as Pacific Biosciences and Oxford Nanopore can produce longer reads which have the potential to reduce the complexity of the haplotype reconstruction problem. However, the higher error rates, compared to Illumina sequencing, can be a limiting factor. Combining data from both short-read and long-read sequencing is a promising direction (Viehweger et al., 2019), and future developments should include extending the pipeline to support Pacific Biosciences and Oxford Nanopore data. Given that long-read technologies often do not provide an output in FASTQ format, the minimum changes required entail input file conversion as well as a customized pipeline configuration, e.g. using BWA-MEM as the aligner. As the need arises, dedicated long-read tools such as minimap2 (Li, 2018) or pbmm2 (https://github.com/PacificBiosciences/pbmm2), read assemblers and even base callers should be added to the pipeline. Click here for additional data file.

49 in total

Review 1. Rates of evolutionary change in viruses: patterns and determinants.

Authors: Siobain Duffy; Laura A Shackelton; Edward C Holmes
Journal: Nat Rev Genet Date: 2008-03-04 Impact factor: 53.242

2. MinVar: A rapid and versatile tool for HIV-1 drug resistance genotyping by deep sequencing.

Authors: Michael Huber; Karin J Metzner; Fabienne D Geissberger; Cyril Shah; Christine Leemann; Thomas Klimkait; Jürg Böni; Alexandra Trkola; Osvaldo Zagordi
Journal: J Virol Methods Date: 2016-11-17 Impact factor: 2.014

3. Snakemake--a scalable bioinformatics workflow engine.

Authors: Johannes Köster; Sven Rahmann
Journal: Bioinformatics Date: 2012-08-20 Impact factor: 6.937

4. ART: a next-generation sequencing read simulator.

Authors: Weichun Huang; Leping Li; Jason R Myers; Gabor T Marth
Journal: Bioinformatics Date: 2011-12-23 Impact factor: 6.937

5. Evaluation of haplotype callers for next-generation sequencing of viruses.

Authors: Anton Eliseev; Keylie M Gibson; Pavel Avdeyev; Dmitry Novik; Matthew L Bendall; Marcos Pérez-Losada; Nikita Alexeev; Keith A Crandall
Journal: Infect Genet Evol Date: 2020-03-06 Impact factor: 3.342

Review 6. Diversity considerations in HIV-1 vaccine selection.

Authors: Brian Gaschen; Jesse Taylor; Karina Yusim; Brian Foley; Feng Gao; Dorothy Lang; Vladimir Novitsky; Barton Haynes; Beatrice H Hahn; Tanmoy Bhattacharya; Bette Korber
Journal: Science Date: 2002-06-28 Impact factor: 47.728

7. The evolutionary analysis of emerging low frequency HIV-1 CXCR4 using variants through time--an ultra-deep approach.

Authors: John Archer; Andrew Rambaut; Bruce E Taillon; P Richard Harrigan; Marilyn Lewis; David L Robertson
Journal: PLoS Comput Biol Date: 2010-12-16 Impact factor: 4.475

8. De novo assembly of viral quasispecies using overlap graphs.

Authors: Jasmijn A Baaijens; Amal Zine El Aabidine; Eric Rivals; Alexander Schönhuth
Journal: Genome Res Date: 2017-04-10 Impact factor: 9.043

9. Performance comparison of next generation sequencing analysis pipelines for HIV-1 drug resistance testing.

Authors: Emma R Lee; Neil Parkin; Cheryl Jennings; Chanson J Brumme; Eric Enns; Maria Casadellà; Mark Howison; Mia Coetzer; Santiago Avila-Rios; Rupert Capina; Eric Marinier; Gary Van Domselaar; Marc Noguera-Julian; Don Kirkby; Jeff Knaggs; Richard Harrigan; Miguel Quiñones-Mateu; Roger Paredes; Rami Kantor; Paul Sandstrom; Hezhao Ji
Journal: Sci Rep Date: 2020-01-31 Impact factor: 4.379

10. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets.

Authors: Andreas Wilm; Pauline Poh Kim Aw; Denis Bertrand; Grace Hui Ting Yeo; Swee Hoe Ong; Chang Hua Wong; Chiea Chuen Khor; Rosemary Petric; Martin Lloyd Hibberd; Niranjan Nagarajan
Journal: Nucleic Acids Res Date: 2012-10-12 Impact factor: 16.971

21 in total

1. Unlocking capacities of genomics for the COVID-19 response and future pandemics.

Authors: Sergey Knyazev; Karishma Chhugani; Varuni Sarwal; Ram Ayyala; Harman Singh; Smruthi Karthikeyan; Dhrithi Deshpande; Pelin Icer Baykal; Zoia Comarova; Angela Lu; Yuri Porozov; Tetyana I Vasylyeva; Joel O Wertheim; Braden T Tierney; Charles Y Chiu; Ren Sun; Aiping Wu; Malak S Abedalthagafi; Victoria M Pak; Shivashankar H Nagaraj; Adam L Smith; Pavel Skums; Bogdan Pasaniuc; Andrey Komissarov; Christopher E Mason; Eric Bortz; Philippe Lemey; Fyodor Kondrashov; Niko Beerenwinkel; Tommy Tsan-Yuk Lam; Nicholas C Wu; Alex Zelikovsky; Rob Knight; Keith A Crandall; Serghei Mangul
Journal: Nat Methods Date: 2022-04 Impact factor: 47.990

2. Quantification of the spread of SARS-CoV-2 variant B.1.1.7 in Switzerland.

Authors: Chaoran Chen; Sarah Ann Nadeau; Ivan Topolsky; Marc Manceau; Jana S Huisman; Kim Philipp Jablonski; Lara Fuhrmann; David Dreifuss; Katharina Jahn; Christiane Beckmann; Maurice Redondo; Christoph Noppen; Lorenz Risch; Martin Risch; Nadia Wohlwend; Sinem Kas; Thomas Bodmer; Tim Roloff; Madlen Stange; Adrian Egli; Isabella Eckerle; Laurent Kaiser; Rebecca Denes; Mirjam Feldkamp; Ina Nissen; Natascha Santacroce; Elodie Burcklen; Catharine Aquino; Andreia Cabral de Gouvea; Maria Domenica Moccia; Simon Grüter; Timothy Sykes; Lennart Opitz; Griffin White; Laura Neff; Doris Popovic; Andrea Patrignani; Jay Tracy; Ralph Schlapbach; Emmanouil T Dermitzakis; Keith Harshman; Ioannis Xenarios; Henri Pegeot; Lorenzo Cerutti; Deborah Penet; Anthony Blin; Melyssa Elies; Christian L Althaus; Christian Beisel; Niko Beerenwinkel; Martin Ackermann; Tanja Stadler
Journal: Epidemics Date: 2021-08-09 Impact factor: 5.324

Review 3. Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research.

Authors: Franziska Hufsky; Kevin Lamkiewicz; Alexandre Almeida; Abdel Aouacheria; Cecilia Arighi; Alex Bateman; Jan Baumbach; Niko Beerenwinkel; Christian Brandt; Marco Cacciabue; Sara Chuguransky; Oliver Drechsel; Robert D Finn; Adrian Fritz; Stephan Fuchs; Georges Hattab; Anne-Christin Hauschild; Dominik Heider; Marie Hoffmann; Martin Hölzer; Stefan Hoops; Lars Kaderali; Ioanna Kalvari; Max von Kleist; Renó Kmiecinski; Denise Kühnert; Gorka Lasso; Pieter Libin; Markus List; Hannah F Löchel; Maria J Martin; Roman Martin; Julian Matschinske; Alice C McHardy; Pedro Mendes; Jaina Mistry; Vincent Navratil; Eric P Nawrocki; Áine Niamh O'Toole; Nancy Ontiveros-Palacios; Anton I Petrov; Guillermo Rangel-Pineros; Nicole Redaschi; Susanne Reimering; Knut Reinert; Alejandro Reyes; Lorna Richardson; David L Robertson; Sepideh Sadegh; Joshua B Singer; Kristof Theys; Chris Upton; Marius Welzel; Lowri Williams; Manja Marz
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

4. Expasy, the Swiss Bioinformatics Resource Portal, as designed by its users.

Authors: Séverine Duvaud; Chiara Gabella; Frédérique Lisacek; Heinz Stockinger; Vassilios Ioannidis; Christine Durinx
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

5. PipeCoV: a pipeline for SARS-CoV-2 genome assembly, annotation and variant identification.

Authors: Renato R M Oliveira; Tatianne Costa Negri; Gisele Nunes; Inácio Medeiros; Guilherme Araújo; Fabricio de Oliveira Silva; Jorge Estefano Santana de Souza; Ronnie Alves; Guilherme Oliveira
Journal: PeerJ Date: 2022-04-13 Impact factor: 2.984

6. Heritability of the HIV-1 reservoir size and decay under long-term suppressive ART.

Authors: Chenjie Wan; Nadine Bachmann; Venelin Mitov; François Blanquart; Susana Posada Céspedes; Teja Turk; Kathrin Neumann; Niko Beerenwinkel; Jasmina Bogojeska; Jacques Fellay; Volker Roth; Jürg Böni; Matthieu Perreau; Thomas Klimkait; Sabine Yerly; Manuel Battegay; Laura Walti; Alexandra Calmy; Pietro Vernazza; Enos Bernasconi; Matthias Cavassini; Karin J Metzner; Huldrych F Günthard; Roger D Kouyos
Journal: Nat Commun Date: 2020-11-02 Impact factor: 14.919

Review 7. Managing the life cycle of a portfolio of open data resources at the SIB Swiss Institute of Bioinformatics.

Authors: Chiara Gabella; Severine Duvaud; Christine Durinx
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622

8. Variant abundance estimation for SARS-CoV-2 in wastewater using RNA-Seq quantification.

Authors: Jasmijn A Baaijens; Alessandro Zulli; Isabel M Ott; Mary E Petrone; Tara Alpert; Joseph R Fauver; Chaney C Kalinich; Chantal B F Vogels; Mallery I Breban; Claire Duvallet; Kyle McElroy; Newsha Ghaeli; Maxim Imakaev; Malaika Mckenzie-Bennett; Keith Robison; Alex Plocik; Rebecca Schilling; Martha Pierson; Rebecca Littlefield; Michelle Spencer; Birgitte B Simen; William P Hanage; Nathan D Grubaugh; Jordan Peccia; Michael Baym
Journal: medRxiv Date: 2021-09-02

9. Viral Haplotypes in COVID-19 Patients Associated With Prolonged Viral Shedding.

Authors: Yingping Wu; Shufa Zheng; Tian Liu; Xueke Liu; Huina Tang; Yutong He; Wei Xu; Lele Li; Wenxu Yu; Ke Xing; Xiaoping Xia
Journal: Front Cell Infect Microbiol Date: 2021-11-03 Impact factor: 5.293

10. Detection and Genome Sequencing of SARS-CoV-2 in a Domestic Cat with Respiratory Signs in Switzerland.

Authors: Julia Klaus; Marina L Meli; Barbara Willi; Sarah Nadeau; Christian Beisel; Tanja Stadler; Herman Egberink; Shan Zhao; Hans Lutz; Barbara Riond; Nina Rösinger; Hanspeter Stalder; Sandra Renzullo; Regina Hofmann-Lehmann
Journal: Viruses Date: 2021-03-17 Impact factor: 5.048