Literature DB >> 32278821

Applying next-generation sequencing to unravel the mutational landscape in viral quasispecies.

Abstract

Next-generation sequencing (NGS) has revolutionized the scale and depth of biomedical sciences. Because of its unique ability for the detection of sub-clonal variants within genetically diverse populations, NGS has been successfully applied to analyze and quantify the exceptionally-high diversity within viral quasispecies, and many low-frequency drug- or vaccine-resistant mutations of therapeutic importance have been discovered. Although many works have intensively discussed the latest NGS approaches and applications in general, none of them has focused on applying NGS in viral quasispecies studies, mostly due to the limited ability of current NGS technologies to accurately detect and quantify rare viral variants. Here, we summarize several error-correction strategies that have been developed to enhance the detection accuracy of minority variants. We also discuss critical considerations for preparing a sequencing library from viral RNAs and for analyzing NGS data to unravel the mutational landscape.

Entities: Chemical Disease Species

Keywords: Consensus-based error correction; Next-generation sequencing (NGS); Quasispecies; RNA; Rare variants; Viruses

Year: 2020 PMID： 32278821 PMCID： PMC7144618 DOI： 10.1016/j.virusres.2020.197963

Source DB: PubMed Journal: Virus Res ISSN： 0168-1702 Impact factor: 3.303

Background

Developing low-frequency variants or mutations is a self-protective approach for various types of cells or organisms that is evolutionarily preservative to survive under stressful conditions through a variety of scales, from mitochondria, to tumor cells, to viruses (Andino and Domingo, 2015; He et al., 2010; Mwenifumbo and Marra, 2013; Salehi et al., 2015; Salk et al., 2018; Woo and Reifman, 2012). Viruses, particularly RNA viruses, possess a great capability to evolve and mutate in order to rapidly respond to host immune selection pressure. Consequently, they generate a population with a large number of variable but closely related genomes, also known as quasispecies (Andino and Domingo, 2015; Woo and Reifman, 2012). Accurate characterization of low-frequency variants could not only provide invaluable insights into molecular mechanisms but also aid clinical decision making (Andino and Domingo, 2015; Godoy et al., 2019; Parker and Chen, 2017; Pawlotsky, 2002; Suzuki et al., 2017). Minority variants in RNA viruses are often generated by error-prone replication (Domingo et al., 2012). Previous studies have shown that, among influenza, even those with a frequency below the detection limit of conventional surveillance methods, are evidently associated with antibody escape in vaccinated humans (Dinis et al., 2016) and could cause a large global public health burden (Chambers et al., 2015). However, enhancing the sensitivity and specificity for identifying minority variants still remains as one of the major challenges in virology. These viruses, including coronavirus (Li et al., 2020; Wu et al., 2020; Zhu et al., 2020), cytomegalovirus (CMV) (Sahoo et al., 2013), human immunodeficiency virus (HIV) (James et al., 2019; Kyeyune et al., 2016; Rawson et al., 2017), influenza virus (Chambers et al., 2015; Zaraket et al., 2010), hepatitis C virus (HCV) (Itakura et al., 2015), poliovirus (Acevedo et al., 2014), and others, have a superior ability to adapt to a new environment and emerge as drug- and vaccine-resistant mutants. Kyeyune et al. (2016) have shown that poor prognosis can be foreseen by the detection of drug-resistant mutations at a frequency of as low as 1 % in a human immunodeficiency virus (HIV)-infected patient. Applications of next-generation sequencing (NGS) in basic and clinical virology research have grown rapidly over the past decade (Houldcroft et al., 2017), particularly for virus discovery (Datta et al., 2015) and diagnosis (Barzon et al., 2013, 2011; Capobianchi et al., 2013; Gardy and Loman, 2018; Kuroda et al., 2010; Prachayangprecha et al., 2014). Compared with conventional gold-standard Sanger sequencing, NGS provides considerably more sequencing reads for a lower cost and allows multiplexing of samples (Shendure et al., 2017). Although NGS technologies enable acquisition of a vast amount of sequencing data, high error rates from 0.1 % to 15 %, depending on platforms and applications, often impede the detection of rare mutations (Salk et al., 2018). To improve the accuracy of NGS for identifying low-frequency viral variants, a variety of error-correction approaches have been developed and applied to investigate viral quasispecies (Table 1 ). Several bioinformatics tools or pipelines for variant calling have been developed specifically for studying viral variants (Huber et al., 2017; McElroy et al., 2013; Verbist et al., 2015; Zagordi et al., 2010) and calculating the complexity of a quasispecies as well as measuring the genetic distance between two similar quasispecies (Marinier et al., 2019). Many more variants callers have been discussed and compared in dedicated reviews or methodological comparison papers (Hwang et al., 2015; Lee et al., 2020; Pereira et al., 2020). Implementing an existing variant-calling tool in NGS data analysis is relatively simple and saves additional costs for sample preparation. However, error corrections made by variant-calling tools have a low positive predictive value. They are not optimal for amplicon analysis because they are mostly based on the assumption that the error rate is randomly distributed (Posada-Cespedes et al., 2017). Therefore, another more innovative and accurate approach, named the consensus-based error-correction method, has become increasingly popular in NGS studies (Salk et al., 2018). Three major related approaches have currently been applied in virus quasispecies studies: Tag-based sequencing (Geller et al., 2016; Hauck et al., 2018; Jabara et al., 2011; Seifert et al., 2016), circular sequencing (CirSeq) (Acevedo et al., 2014), and intramolecular-ligated nanopore consensus sequencing (INC-Seq) (Li et al., 2016).

Table 1

Comparison of various NGS approaches in virus quasispecies analysis.

Principle	Strengths	Weaknesses	Error frequency
Unique molecular identifiers (UID or Safe-SeqS):• Randomly generated UID• Allowing identification of every single reverse-transcribed viral RNA• Only mutations that exist in a majority of sequences with an identical UID considered as true variants	• Preservation of minor variant frequency• Multiplexing possible	• Incapable of correcting reverse transcription polymerase chain reaction (RT-PCR) errors• Risk of tag clashes when tag diversity is inadequate	1.4 × 10⁻⁵
Duplex sequencing (DupSeq):• Molecular barcodes applied to each double-stranded DNA molecule• Simultaneously identifying the two complementary strands and distinguish them• True mutations present in a majority of sequences in each strand group and the complementary strand group	• Multiplexing possible	• Incapable of correcting PCR errors that occur during reverse transcription.• Risk of tag clashes when tag diversity is inadequate• DupSeq cannot be applied directly to RNA templates, which can cause the loss of preservation of minor variant frequency of RNA viruses	5 × 10⁻⁸
Circular sequencing (CirSeq):• Fragments of viral RNA followed by self-ligation into circularized RNAs for rolling circle amplification. The amplicon composed of many tandem repeats of the circularized RNA• Mutations present in most of repeats on the same molecule considered as true variants	• No probe or primer design required• Preservation of minor variant frequency	• A tendency towards G-to-A and C-to-T errors in the absence of uracil-DNA glycosylase and formamidopyrimidine-DNA glycosylase• Large amounts of viral RNA (>1 μg) required for library preparation• Very limited length of sequences that can be genotyped as tandem copies on short-read platforms	7.6 × 10⁻⁶
Intramolecular-ligated nanopore consensus sequencing (INC-Seq):• Viral RNAs directly self-ligated into closed loops for rolling-circle amplification• Each amplicon composed of concatenated repeats of a starting viral molecule• Similar to the CirSeq but with many more copies of much longer fragments	• Capability of extremely long-read sequencing (possible to identify multidrug-resistant variants in a single viral genome)• Multiplexing possible• Rapid and field-deployable• No probe or primer design required	• High single-read error rates (about 1%–5%)• Requirement of high coverage to minimize the effect of sequencing errors	3 × 10⁻²

Comparison of various NGS approaches in virus quasispecies analysis. In this mini-review, we mainly deliberate on the consensus-based error-correction approaches to characterize the population structure of single-strand RNA viruses. Although identification of novel viruses is also extremely important and challenging, it requires very different techniques and approaches (Houldcroft et al., 2017; Illingworth et al., 2017; McCrone and Lauring, 2016), which are beyond the scope of the current review. Along with an increasing number of applications in viral quasispecies research, it is important to evaluate various approaches used in NGS for improving the information–noise ratios of the obtained NGS data. Details of each method together with their major advantages and disadvantages are discussed and summarized in this work. Obtaining high-accuracy NGS in studies of viral quasispecies not only relies on error corrections of NGS data, but also depends on the well-tailored design of experimental and computational analysis approaches. Thus, we also briefly address the technical and analytical considerations when applying NGS to unravel the mutational landscape in viral quasispecies, particularly in comparative studies.

Approaches for enhancing the accuracy of NGS in virus quasispecies studies

In the last decade, several NGS approaches and platforms have been developed for viral whole-genome sequencing (WGS) and quasispecies studies in order to enhance infection control and disease management (Houldcroft et al., 2017). A majority of these studies have focused on specific short amplicons that can be sequenced on a short-read platform, such as Roche 454 (Sopena et al., 2018), Illumina (Sutar et al., 2019), or Ion Torrent technology (Goodwin et al., 2016). These amplicon strategies require a relatively simpler analysis workflow because only short regions of the viral genome are in focus. In comparison to single amplicon deep sequencing, WGS involves markedly more data processing procedures, such as de novo assembly and alignment within existing genome databases, but it can deliver a more complete view of the heterogeneity within viral populations, which is particularly important for the identification of novel viruses (Goodwin et al., 2016; Marston et al., 2013). Long-read sequencing has the advantage of directly obtaining information in repetitive sequences with a single read and consequently eliminating ambiguous information in those repetitive regions, but it still suffers from relatively high error rates (Amarasinghe et al., 2020). A high coverage could significantly reduce the error rates, but that entails a higher cost, relatively more computational power and longer times for analysis. To facilitate its comprehension, in this review we mainly concentrate on short-read approaches by discussing consensus-based error-correction methods (Table 1) for enhancing the accuracy of NGS data in virus quasispecies studies.

Tag-based sequencing

This is the most commonly used error-correction approach in short-read NGS platforms, in which a DNA library is typically amplified by polymerase chain reaction (PCR) before sequencing. Zhou et al. (2015), Hauck et al. (2018), Jabara et al.(2011), and Seifert et al. (2016) have applied randomly generated unique molecular identifiers (UIDs, also known as Safe-SeqS (Fig. 1 a), “molecular barcodes”, “primer IDs”, or “tags”). UIDs are linked to the primer for reverse transcription in order to label each single-stranded viral cDNA derived from a particular RNA molecule before PCR amplification. Each UID is passed on to all its derivative PCR copies, thus allowing the grouping of all sequence reads derived from the same viral RNA molecule template. Sequences with the same UID are then collapsed to a consensus sequence. Thus, each of these collapsed sequences correspond to one original viral RNA strand. Differences between sequences within a family of sequences with the same UID are due to technical substitution errors during PCR or sequencing and can be easily corrected (Hiatt et al., 2010; Kinde et al., 2011). Applying UIDs for error correction can decrease the sequencing error frequency to 1.4 × 10−5 (Fox et al., 2014).

Fig. 1

Library preparation approaches of consensus-based error correction for investigating virus quasispecies. (a) Safe-SeqS uses primers linked to unique molecular identifiers (UIDs) and mouse identifiers (MIDs) for reverse transcription, which not only enables the recognition of every original viral RNA strand after PCR amplification, but also allows multiplexing of samples in the same sequencing run. (b) DupSeq applies randomized duplex tags to each double-stranded DNA molecule in a way that derivative PCR products of the two strands can be informatively related to each other but also distinguishable. Consensus wild-type or mutation sequences are reached only if the reads of each of the double strands show identical sequences. (c) CirSeq begins by circularizing of single-stranded DNA fragments without any exogenous molecular barcodes followed by rolling-circle amplification, fragmentation and sequencing. (d) INC-Seq also entails circularization single-stranded DNA fragments followed by rolling-circle amplification of the loop; however, the end product is a long DNA strand (>10Kb) comprising concatenated copies of one of the strands of the starting molecule to be sequenced on a long-read platform. For INC-Seq, only in-silico fragmentation is performed for analysis following sequencing. For CirSeq and INC-seq, the random fragmentation points of the starting molecules serve as endogenous UIDs for consensus-based error correction. For all above-mentioned four methods, after library preparation, pooling and sequencing, sequences originating from the same viral RNA strand of the same sample, are collapsed to a single consensus sequence. True mutations (pink circle) can be distinguished from PCR errors (purple star). Due to limited space, sequencing errors are not marked here. Another tag-based error-correction approach, named duplex sequencing (DupSeq, Fig. 1b), has been applied to study the genetic variation of HCV (Geller et al., 2016). DupSeq utilizes special tags to label each double-stranded cDNA molecule derived from the same viral RNA after reverse transcription and subsequent complementary DNA synthesis by DNA polymerase so that derivative PCR copies of the two strands can be informatively related to each other but remain distinct (Schmitt et al., 2012). Consensuses are first generated for each single-strand group with the same tag and then compared to that of the complementary strand. Sequencing or PCR errors are extremely unlikely to take place at the same positions of the two DNA strands by chance. The double checking principle as indicated by the name of DupSeq can thus significantly reduce the sequencing error frequency down to 5 × 10−8 (Fox et al., 2014). However, compared with UID approaches, DupSeq cannot be directly applied to RNA templates. Before inserting tags, it requires additional reverse transcriptase PCR and second-strand PCR, which can significantly impact the low-frequency RNA templates in the samples (Head et al., 2014). Therefore, DupSeq might particularly suffer from the loss of preservation of variant frequency of RNA viruses. For both UID/Safe-SeqS and DupSeq error-correction approaches, mistakes that occur during reverse transcription, second-strand synthesis, and PCR recombination will escape correction (Zanini et al., 2017). Moreover, there is the risk of tag clash when the diversity of barcodes is too little to label each independent molecule. On the other hand, tags with too many random nucleotides could also directly contribute to PCR biases (Kou et al., 2016).

CirSeq

CirSeq (Fig. 1c) is another consensus sequencing method used in short-read NGS. In this case, viral RNAs are fragmented into very short pieces and self-ligated into many circularized RNAs that serve as templates for complementary DNA (cDNA) synthesis. CirSeq incorporates rolling-circle reverse transcription of circularized viral RNA to generate tandem repeat cDNA in order to enrich the target sequences (Acevedo et al., 2014; Whitfield and Andino, 2016). Thus, unlike the tag-based sequencing approach that requires exogenous barcodes to label each viral RNA or cDNA copy, CirSeq makes use of physically jointed copies of the sequence for consensus calling. True mutations can be distinguished from either amplification or sequencing errors by building a consensus sequence based on the linked copies to a single molecule. CirSeq, however, has a tendency towards G-to-A and C-to-T errors derived from base damage due to cytosine deamination; therefore, it is necessary to add in uracil-DNA glycosylase and formamidopyrimidine-DNA glycosylase during rolling circle amplification in order to eliminate such errors caused by DNA damage (Lou et al., 2013). The sequencing error frequency of CirSeq is about 7.6 × 10−6 (Fox et al., 2014). Because CirSeq is built on sequencing tandem repeats on the single-end Illumina sequencing platform, only short sequence fragments (<150 base pair (bp)) can be genotyped in this approach. Due to this in-built requirement, CirSeq thus particularly suffers from the constraint on short-length fragments and the inability to perform paired-end sequencing relative to the other major approaches. Moreover, CirSeq requires the input of large amounts of viral RNA (>1 μg) for library preparation (Whitfield and Andino, 2016).

INC-Seq

INC-Seq (Fig. 1d) is a direct consensus sequencing approach based on long-read nanopore sequencing, a platform developed by Oxford Nanopore Technologies (Li et al., 2016; Mikheyev and Tin, 2014). Akin to the CirSeq technique, INC-Seq begins by intramolecular circularizing of RNA molecules to form closed loops. Each RNA loop molecule further undergoes rolling-circle reverse transcription (RT)-PCR amplification to form a long cDNA product comprising concatenated repeats descended from the starting RNA molecule. After sequencing, the resultant reads consist of a long string of tandem copies similar to the results of the CirSeq technique but with many more copies of much longer fragments. True mutations are identified as the variants present in the majority of tandem repeats on the same single molecule, whereas technical substitution errors from RT-PCR or sequencing should not be found in a majority of repeats. The challenge is, however, that this approach has a high raw-read error rate of 5 %–20 % (Salk et al., 2018). Therefore, high coverage is required to reduce the impact of sequencing errors (Houldcroft et al., 2017). It is worth noting that the aforementioned approaches are mainly applied to studies of single-strand RNA viruses, which tend to mutate much faster than double-strand RNA viruses and hence represent a major challenge in deciphering viral population structures. Characterization of variants of double-strand RNA viruses that only account for a small fraction of pathological viruses could benefit from particular approaches, such as DupSeq, which has been reported to detect ultralow-frequency variants from double-strand DNA samples (Kennedy et al., 2014; Schmitt et al., 2012).

Improving detection of rare variants in comparative studies

The genetic diversity of RNA viruses facilitates their adaptation to new environments and evasion of host immunity. Monitoring quasispecies evolution in infected hosts under treatment or after vaccination is important for the early detection of escape mutants. This analysis is complicated by the need not only to minimize technical sequencing artifacts, but also to enhance the comparability among different samples. Technical artifacts/biases correspond to systematic PCR or sequencing errors due to variability in sample processing and experimental design (Head et al., 2014). Artifacts, which cannot be otherwise eradicated, must be eliminated by experimental design. There are several ways of improving NGS data quality for comparing heterogeneous samples in a virus quasispecies study, including (1) sample and library preparation protocols that limit experimental biases, (2) single-molecule consensus sequencing that allows for the identification of true mutations, but exclusion of sequencing errors, (3) computational strategies for read normalization. We describe here mainly two of the strategies.

Sample and library preparation

When it comes to sample and library preparation for analyzing complex populations such as virus quasispecies, it is especially critical to reduce sequencing biases to obtain a faithful picture of the analyzed samples (Acevedo and Andino, 2014; Chen et al., 2018; Forth and Hoper, 2019; Head et al., 2014; Verhoeven et al., 2018). Thus, even within the same experiment, only viral samples with similar RNA quantity and quality should be compared. Following RNA extraction and viral RNA enrichment from the host RNA (Forth and Hoper, 2019; Houldcroft et al., 2017; Sathiamoorthy et al., 2018; Singanallur et al., 2019), the first critical quality control (QC) step (Fig. 2 ) is to test both quantity and integrity of the starting viral RNA (Hauck et al., 2018; Ng et al., 2018; Yang et al., 2016). The necessity to control virus tire or genome copy numbers in comparative studies has been demonstrated in several studies as false positive variant calls become more evident with lower material inputs (Gallet et al., 2017; Illingworth et al., 2017; McCrone and Lauring, 2016). Ideally, in one comparative study, all the final concentrations of sequencing libraries should be identical in order to reduce false-positive calls. Furthermore, following the library preparation, the quantity of the library also should be examined (Ng et al., 2018) and in principle, only similar amount of libraries should be directly compared. Since fragmentation is required for the CirSeq approach (Fig. 1c), the fragment size distribution should be also analyzed at least in the CirSeq workflow to allow for sensible comparison (Acevedo and Andino, 2014; Lou et al., 2013).

Fig. 2

A general experimental and computational workflow for improving NGS data quality of virus quasispecies studies. For the comparative studies or clinical samples, we start from different samples (1,2,i,…,n). One has to first go through different experimental steps, such as sample preparation, library preparation, library quality control, sample indexing, library pooling and sequencing. Then, computational steps are followed, such as data cleaning, consensus-based error correction, variant calling and annotation.

Barcoding technique and multiplex sequencing

In addition to sample and library QC, bias can be further reduced by pooling various indexed or barcoded samples. Molecular barcodes on short-read NGS platforms allow consensus-based error corrections and the detection of low-frequency variants down to ∼0.001 % mutations per base pair and the total error rate varies dependent on the length of the target sequences (Geller et al., 2016; Salk et al., 2018). In addition, multiplex sequencing is possible by assigning an additional barcode, e.g., mouse identifier (MID) (Fig. 1a), to the library of each sample (Hauck et al., 2018). Multiplexing is possible for all the aforementioned four sequencing methods (Salk et al., 2018). This can be done in comparative studies, in which the library of treated (e.g., vaccinated or infected) and control mice specifically tagged with a distinguishable barcode allows pooled samples into the same sequencing run (Fig. 1a), which reduces potential technical bias associated with run-to-run variability. To further reduce substitution errors, one can implement one of the approaches in Fig. 1 to ensure the unique identification of every original viral RNA strand (or cDNA strand) after PCR amplification. After library preparation and deep sequencing, all sequences obtained are grouped by UIDs or by endogenous fragmentation points and compared for different type of errors. Sequences with the same UID or fragmentation point, i.e., originating from the same viral RNA strand, are collapsed into a single sequence. If within a family of grouped sequences there are differences between sequences due to late PCR errors or sequencing errors, these differences should be corrected using statistical approaches, such as cumulative binomial distribution, in order to assign and rank the probability of a correct base at each given position for various numbers of read copies. Thus, this tag-based error-correction approach eliminates most amplification biases and identifies the majority of substitution errors (Fig. 2). Each NGS platform varies with its specific error profile that requires particular downstream computational and statistical handling. For instance, the Ion Torrent technology is prone to make insertion–deletion (indel) errors in homopolymeric stretches of DNA (Goodwin et al., 2016). Although the widely-used Illumina technology usually possesses an accuracy rate higher than 99.5 %, the platform displays a tendency towards substitution errors (Allhoff et al., 2013; Minoche et al., 2011). While indel errors may be a problem in the case of de novo sequencing, they can be easily identified and removed by comparison to the corresponding reference viral sequence as demonstrated by several works (Hauck et al., 2018; Song et al., 2017; Yeo et al., 2012). Using this technique, the chance of the same sequencing error that occurs within sequences of the same UID family is extremely low. According to the estimations of Kinde et al. (2011), the PCR amplification errors are around 2.2 × 10−6 distinct alterations/bp. With such an extremely-low error rate, even with a read number of 7.8 × 105 in a sample, for the targeted sequence with a length of 165bp [e.g., the conserved long α-helix (LAH) domain of influenza hemagglutinin protein (Hauck et al., 2018)], the estimated total PCR amplification error is only around 280 reads (2.2 × 10−6 × 165 × 7.8 × 105). Therefore, the low-frequency viral variants [with a frequency higher than 1 % out of the entire population as demonstrated by Peng et al. (2015)] can be confidently considered as biologically mutated sequences rather than artifacts. If without financial constraints and computational analysis limits, one should simply increase the sequencing coverage or depth to increase the confidence level. However, experimental cost, computational capacity and budget might always constrain us, which requires us to have optimal experimental design to balance between both experimental and computational cost and sequencing output. Estimation of the number of raw reads, or correspondingly the sequencing coverage/depth is not a trivial issue. Although there is a classic formula for us to estimate these numbers by: the sequencing coverage = the number of total reads × the read length/the length of target sequence or genome (Lander and Waterman, 1988). However, the estimation cannot work properly when the virus variant is very rare. Apparently, for the low-frequency variants, we need to increase the number of raw reads and the sequencing depth to increase the confidence and decrease the errors. However, except for empirical numbers obtained by different groups, no one knows exactly which sequencing depth or raw reads are needed to more accurately characterize viral quasispecies. For instance, as compellingly demonstrated by Griffith et al. (2015), the standard depth of 50x coverage can only detect 10 % of single nucleotide polymorphisms (SNPs) with minority variance (=<15 %) in tumor samples. They concluded that the coverage as high as 10,000x could be required to validate rare variants. We could foresee that characterization of viral quasispecies might encounter similar issues as did deciphering tumor clonal architecture. For more details about estimation of sequencing depth, please refer to the dedicated review published elsewhere (Sims et al., 2014).

Specific considerations in bioinformatics data analyses to detect low-frequency viral variants

General computational workflows to analyze and correct NGS data have been discussed elsewhere (Dolled-Filhart et al., 2013; Pabinger et al., 2014; Reinert et al., 2015; Salk et al., 2018; Treangen and Salzberg, 2011) and are beyond this review. In general, bioinformatics analysis of short NGS reads involves five main steps: (i) quality assessment of raw sequencing data, such as trimming, filtering, and others; (ii) read alignment; (iii) variant call (Fig. 2); (iv) annotation by comparing with knowledge databases; and (v) visualization of aligned reads and mutations (Pabinger et al., 2014; Posada-Cespedes et al., 2017). Several particular issues should be addressed in virus quasispecies studies. One of the key issues in computational analysis is related to variant calling. There are several popular variant callers. One example is MinVar (Huber et al., 2017) that is based on LoFreq (Wilm et al., 2012). The other variant callers include ShoRAH (Zagordi et al., 2011) and its extension (McElroy et al., 2013), SNVer (Wei et al., 2011), deepSNV (Gerstung et al., 2012), SAMtools (Li, 2011), GATK (McKenna et al., 2010), Ion-Torrent specific TVC and others. For comparison and evaluation of different methods, please refer to the dedicated reviews or comparative work (Hwang et al., 2015; Lee et al., 2020; Pereira et al., 2020). In short, each method has its own pros and cons. Investigation of viral diversity is very sensitive to the used variant calling methods (McCrone and Lauring, 2016). None of them alone can reliably identify authentic minority variants or mutations and therefore often a combination of several variant callers are required to reach better results (Leung et al., 2014). While UID-based barcoding NGS approaches can significantly improve the identification of low-frequency variants, the sampling bias on original templates that is introduced during PCR amplification remains challenging. It is also challenging to remove errors introduced in the UIDs during PCR amplification. In this context, Kou et al. tried to correct false UIDs to avoid the false identification of mutations (Kou et al., 2016). They clustered UIDs that differed only in one or two nucleotides into a single UID family. In addition, it has been observed that the first nucleotide of the sample UID and the last nucleotide of the UID are more error prone at least on some sequencing platforms (Brodin et al., 2015). Therefore, UIDs with minor differences are grouped into the same UID group family, and the positions of potential error bases should be taken into consideration. Construction of consensus sequences is another critical step in analyzing UID-derived sequences. The consensus sequences are often constructed from the sequencing reads labelled with the same UIDs that have been retrieved for a minimal number of times, e.g., >= 3 times (Brodin et al., 2015). So far, most approaches treat multiple-sequence alignments in the same way irrespective of the UID family size. However, statistically, it is obvious that the higher the number of read copies for the given UID, the lower the probability that reads with identical bases occur by chance. This should be integrated into analysis pipelines to further refine error corrections. One could at least rank viral variants by confidence by including such approaches.

Concluding remarks and outlook

Dissecting viral population structures has important biomedical applications but is subject to a wide range of experimental and computational challenges. Improved NGS approaches provide the opportunity to better identify low-frequency, nevertheless clinically relevant viral variants (Houldcroft et al., 2017). Several major experimental methods applying consensus-based error correction to enhance data accuracy have been proposed and discussed, and more powerful instruments and creative approaches are under development. Consensus-based error-correction approaches can identify those error occurrences during PCR and sequencing. But these approaches require a relatively high sequencing coverage and are compromised by the impaired efficiency of the tag labeling. In the library preparation processes for NGS, which includes adapter ligation and multiple clean-up cycles, there exists usually an inevitable loss of the starting materials. This might result in the loss of preservation of minority-variant frequency, especially in the case of using viral or clinical samples that contain a limited amount of materials (Illingworth et al., 2017). To circumvent some of the labelling-related issues, a non-consensus-based error-correction approach (named overlapping paired-end read sequencing) has been shown to significantly scale down sequencing error frequency (5 × 10−4) and to improve the accuracy of rare-variant detection (Chen-Harris et al., 2013). For the overlapping paired-end read sequencing, as indicated by the name, each pair of read deriving from the same viral RNA should be exactly complementary. If not exactly complementary, the reads will be regarded as errors, therefore reducing false positive discovery of minority variants. From a computational point of view, variant identification and sequence annotation are currently performed in separate steps. In the near future, both steps of variant call and viral protein structural annotation may be integrated into a single iterative analysis loop. For instance, with the advancement of the methods in protein structure prediction based on mutations, viral variants that might cause viral protein structural changes and reduce viral fitness in the host would be ranked lower. This may require more computational analysis power, including cloud computing (Langmead and Nellore, 2018). In the context of translational virology, the current approaches for the diagnosis of viral infection need to be applied (Barzon et al., 2013, 2011; Capobianchi et al., 2013). All the clinical samples suffer from high ratios of host-to-viruses genetic inputs and a low amount of starting materials (Fernandez-Cassi et al., 2018). The first step is therefore to enrich and purify the viral materials from biopsies (Houldcroft et al., 2017). In order to compensate a relatively small number of starting templates, the number of PCR amplification cycles might need to be slightly increased, which might relatively compromise PCR-related errors. In routine clinical tests, measurement speed is another critical step, which often requires receiving results within hours rather than days (Capobianchi et al., 2013). This indicates the need for even higher throughput instruments compared with the current available machines. Computational analysis can also constitute a bottleneck to the analysis. Sequence assembly is particularly computationally intensive and demands much more computational power in clinical settings (Shendure et al., 2017). To address all these clinic-related challenges, there is still a long way to go even in consideration of the unparalleled high development pace of NGS or even third-generation sequencing approaches (Editorial, 2018; Lavezzo et al., 2016).

Author contributions

IL proposed and drafted the manuscript. F.H. conceptualized the framework and revised the manuscript. C.M. revised the manuscript.

Declaration of competing interest

None declared.

97 in total

1. Comparison of error correction algorithms for Ion Torrent PGM data: application to hepatitis B virus.

Authors: Liting Song; Wenxun Huang; Juan Kang; Yuan Huang; Hong Ren; Keyue Ding
Journal: Sci Rep Date: 2017-08-14 Impact factor: 4.379

Review 2. DNA sequencing at 40: past, present and future.

Authors: Jay Shendure; Shankar Balasubramanian; George M Church; Walter Gilbert; Jane Rogers; Jeffery A Schloss; Robert H Waterston
Journal: Nature Date: 2017-10-11 Impact factor: 49.962

3. The long view on sequencing.

Authors:
Journal: Nat Biotechnol Date: 2018-04-05 Impact factor: 54.908

4. The Number of Target Molecules of the Amplification Step Limits Accuracy and Sensitivity in Ultradeep-Sequencing Viral Population Studies.

Authors: Romain Gallet; Frédéric Fabre; Yannis Michalakis; Stéphane Blanc
Journal: J Virol Date: 2017-07-27 Impact factor: 5.103

5. Measurements of Intrahost Viral Diversity Are Extremely Sensitive to Systematic Errors in Variant Calling.

Authors: John T McCrone; Adam S Lauring
Journal: J Virol Date: 2016-07-11 Impact factor: 5.103

6. Heteroplasmic mitochondrial DNA mutations in normal and tumour cells.

Authors: Yiping He; Jian Wu; Devin C Dressman; Christine Iacobuzio-Donahue; Sanford D Markowitz; Victor E Velculescu; Luis A Diaz; Kenneth W Kinzler; Bert Vogelstein; Nickolas Papadopoulos
Journal: Nature Date: 2010-03-03 Impact factor: 49.962

7. CHOPER filters enable rare mutation detection in complex mutagenesis populations by next-generation sequencing.

Authors: Faezeh Salehi; Roberta Baronio; Ryan Idrogo-Lam; Huy Vu; Linda V Hall; Peter Kaiser; Richard H Lathrop
Journal: PLoS One Date: 2015-02-18 Impact factor: 3.240

8. Benefits and Challenges with Applying Unique Molecular Identifiers in Next Generation Sequencing to Detect Low Frequency Mutations.

Authors: Ruqin Kou; Ham Lam; Hairong Duan; Li Ye; Narisra Jongkam; Weizhi Chen; Shifang Zhang; Shihong Li
Journal: PLoS One Date: 2016-01-11 Impact factor: 3.240

9. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets.

Authors: Andreas Wilm; Pauline Poh Kim Aw; Denis Bertrand; Grace Hui Ting Yeo; Swee Hoe Ong; Chang Hua Wong; Chiea Chuen Khor; Rosemary Petric; Martin Lloyd Hibberd; Niranjan Nagarajan
Journal: Nucleic Acids Res Date: 2012-10-12 Impact factor: 16.971

10. Next-generation sequencing library preparation method for identification of RNA viruses on the Ion Torrent Sequencing Platform.

Authors: Guiqian Chen; Yuan Qiu; Qingye Zhuang; Suchun Wang; Tong Wang; Jiming Chen; Kaicheng Wang
Journal: Virus Genes Date: 2018-05-09 Impact factor: 2.332

15 in total

1. Whole genome analysis of more than 10 000 SARS-CoV-2 virus unveils global genetic diversity and target region of NSP6.

Authors: Indrajit Saha; Nimisha Ghosh; Ayan Pradhan; Nikhil Sharma; Debasree Maity; Kaushik Mitra
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

2. Review of genome sequencing technologies in molecular characterization of influenza A viruses in swine.

Authors: Ravendra P Chauhan; Michelle L Gordon
Journal: J Vet Diagn Invest Date: 2022-01-17 Impact factor: 1.279

3. Hepatitis C virus transmission cluster among injection drug users in Pakistan.

Authors: Kashif Iqbal Sahibzada; Lilia Ganova-Raeva; Zoya Dimitrova; Sumathi Ramachandran; Yulin Lin; Garrett Longmire; Leonard Arthur; Guo-Liang Xia; Yury Khudyakov; Idrees Khan; Saima Sadaf
Journal: PLoS One Date: 2022-07-15 Impact factor: 3.752

4. Identification of Geographic Specific SARS-Cov-2 Mutations by Random Forest Classification and Variable Selection Methods.

Authors: Manoj Kandpal; Ramana V Davuluri
Journal: Stat Appl Date: 2020-06-30

5. The Challenges of Vaccine Development against Betacoronaviruses: Antibody Dependent Enhancement and Sendai Virus as a Possible Vaccine Vector.

Authors: T A Zaichuk; Y D Nechipurenko; A A Adzhubey; S B Onikienko; V A Chereshnev; S S Zainutdinov; G V Kochneva; S V Netesov; O V Matveeva
Journal: Mol Biol Date: 2020-09-04 Impact factor: 1.374

6. Genome-wide analysis of Indian SARS-CoV-2 genomes for the identification of genetic mutation and SNP.

Authors: Indrajit Saha; Nimisha Ghosh; Debasree Maity; Nikhil Sharma; Jnanendra Prasad Sarkar; Kaushik Mitra
Journal: Infect Genet Evol Date: 2020-07-11 Impact factor: 3.342

7. Inferring the genetic variability in Indian SARS-CoV-2 genomes using consensus of multiple sequence alignment techniques.

Authors: Indrajit Saha; Nimisha Ghosh; Debasree Maity; Nikhil Sharma; Kaushik Mitra
Journal: Infect Genet Evol Date: 2020-09-01 Impact factor: 3.342

Review 8. An Overview on SARS-CoV-2 (COVID-19) and Other Human Coronaviruses and Their Detection Capability via Amplification Assay, Chemical Sensing, Biosensing, Immunosensing, and Clinical Assays.

Authors: Yasin Orooji; Hessamaddin Sohrabi; Nima Hemmat; Fatemeh Oroojalian; Behzad Baradaran; Ahad Mokhtarzadeh; Mohamad Mohaghegh; Hassan Karimi-Maleh
Journal: Nanomicro Lett Date: 2020-11-02

9. Koala retrovirus diversity, transmissibility, and disease associations.

Authors: HaoQiang Zheng; Yi Pan; Shaohua Tang; Geoffrey W Pye; Cynthia K Stadler; Larry Vogelnest; Kimberly Vinette Herrin; Bruce A Rideout; William M Switzer
Journal: Retrovirology Date: 2020-10-02 Impact factor: 4.602

Review 10. Conventional and Nanotechnology-Based Sensing Methods for SARS Coronavirus (2019-nCoV).

Authors: Nagaraj P Shetti; Amit Mishra; Shikandar D Bukkitgar; Soumen Basu; Jagriti Narang; Kakarla Raghava Reddy; Tejraj M Aminabhavi
Journal: ACS Appl Bio Mater Date: 2021-02-04