Literature DB >> 33416890

Bioinformatics resources for SARS-CoV-2 discovery and surveillance.

Tao Hu¹, Juan Li¹, Hong Zhou¹, Cixiu Li¹, Edward C Holmes², Weifeng Shi¹.

Abstract

In early January 2020, the novel coronavirus (SARS-CoV-2) responsible for a pneumonia outbreak in Wuhan, China, was identified using next-generation sequencing (NGS) and readily available bioinformatics pipelines. In addition to virus discovery, these NGS technologies and bioinformatics resources are currently being employed for ongoing genomic surveillance of SARS-CoV-2 worldwide, tracking its spread, evolution and patterns of variation on a global scale. In this review, we summarize the bioinformatics resources used for the discovery and surveillance of SARS-CoV-2. We also discuss the advantages and disadvantages of these bioinformatics resources and highlight areas where additional technical developments are urgently needed. Solutions to these problems will be beneficial not only to the prevention and control of the current COVID-19 pandemic but also to infectious disease outbreaks of the future.

Entities: CellLine Chemical Disease Gene Species

Keywords: COVID-19; SARS-CoV-2; bioinformatics; next-generation sequencing; pathogen discovery; phylogenetic analysis

Mesh：

Year: 2021 PMID： 33416890 PMCID： PMC7929396 DOI： 10.1093/bib/bbaa386

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

In late December 2019, pneumonia of unidentified cause was first reported in Wuhan, China [1]. Clinical diagnosis using various commercialized assays targeting multiple common respiratory pathogens failed to identify the causative agent [2]. However, the next-generation sequencing (NGS) of clinical samples, particularly bronchoalveolar lavage fluid from the first group of patients, soon identified a novel coronavirus [1-3], later named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [4]. In this review article, we briefly summarize the NGS technologies and bioinformatics resources employed in the discovery and surveillance of SARS-CoV-2, as well as their advantages and disadvantages, and highlight areas where future work will be particularly profitable.

NGS approaches for SARS-CoV-2 discovery and surveillance

NGS, also termed high-throughput sequencing, has become one of the most widely used approaches in virus research, especially in the areas of the diagnosis of infectious diseases of unidentified cause [1], virus evolution [5] and virus discovery [6, 7]. Multiple sequencing strategies have been applied to discover and monitor the causative agent of the ongoing COVID-19 pandemic—SARS-CoV-2 (Figure 1). Among these, metagenomics has proven itself to be a simple, unbiased and highly efficient approach to virus discovery [7, 8, 9]. The metagenomic approach works best when the abundance of the target virus (SARS-CoV-2) is relatively high and other microorganisms in the samples also need to be analyzed (Figure 1A). Importantly, the proportion of virus-related reads can be greatly increased if the total RNA of clinical samples from COVID-19 patients is subject to ribosomal RNA (rRNA) depletion during the library preparation step [10]. Alternatively, a hybrid capture method can be used to enrich SARS-CoV-2 by using a mixture of RNA probes corresponding to SARS-CoV-2-specific fragments following library construction (Figure 1B).

Figure 1

The workflow of different NGS sequencing approaches currently available for virus discovery and genomic surveillance. The library construction scheme employed in (A) metatranscriptomic sequencing, (B) a hybrid capture-based approach based on a metatranscriptomic library, (C) multiplex PCR amplification for NGS platforms and (D) the Oxford Nanopore sequencing platform. After many SARS-CoV-2 genomes were obtained using metagenomics sequencing during the early stages of the outbreak, a multiplex polymerase chain reaction (PCR) amplification technology targeting SARS-CoV-2 was developed (Figure 1C and D): total RNA is reverse transcribed to synthesize cDNA, and a PCR is then run using multiple amplification primer pairs targeting SARS-CoV-2, followed by ligation reaction to add the indexes/barcodes. The libraries are subsequently sequenced on Illumina, MGI or Nanopore platforms (https://artic.network/ncov-2019) [11]. In particular, the multiplex PCR amplification technology is efficient in cases of samples with low viral load [12, 13], when the cycle threshold (Ct) value of SARS-CoV-2 quantitative real-time (qRT)-PCR ranges from 24.5 to 31.8 (1–100 viral genome copies per microliter) [12]. Facilitated with the multiplex PCR amplification technology, the MinION device is widely used to diagnose and identify SARS-CoV-2 within hours with high sensitivity [14, 15]. Importantly, however, multiplex PCR amplification sequencing cannot be used to sequence highly diverse or recombinant viruses because the primers are designed according to the reference genomes. PCR may also be limited by the primer dimer formation and the non-optimized reaction system. In addition, the error rate of this technology is higher than most other NGS platforms, with many of the deletions accumulating in the homopolymers [16-18]. It may therefore be preferable to use the Oxford Nanopore platform supplemented with Sanger or Illumina and MGI platforms to obtain viral genomes with higher accuracy and coverage [19].

Bioinformatics resources for SARS-CoV-2 discovery

As NGS will usually generate millions of sequencing reads with or without a priori knowledge of SARS-CoV-2, the efficiency of virus discovery is heavily dependent on the downstream bioinformatics tools employed. Unfortunately, there is still not a fully integrated bioinformatics pipeline available that is able to automatically analyze NGS data and identify those reads potentially related to viruses. In the context of virus discovery, a typical NGS data analysis workflow consists of several essential steps, including quality control of the NGS data, removal of host/rRNA data, reads assembly, taxonomic classification and virus genome verification (Figure 2). Fortunately, numerous applications are now available for every step (Table 1).

Figure 2

Table 1

Summary of the available bioinformatics resources for SARS-CoV-2 discovery and genomic surveillance

Databases and software	URL	Reference
Data quality control
Trimmomatic	http://www.usadellab.org/cms/index.php?page=trimmomatic	[20]
Cutadapt	https://cutadapt.readthedocs.io/en/stable/	[21]
SOAPnuke	https://github.com/BGI-flexlab/SOAPnuke	[22]
AfterQC	http://www.github.com/OpenGene/AfterQC	[23]
Fastp*	https://github.com/OpenGene/fastp	[24]
Cut_Multi_Primer.py	https://github.com/MGI-tech-bioinformatics/SARS-CoV-2_Multi-PCR_v1.0	-
NanoPack	https://github.com/wdecoster/nanopack	[45]
Porechop	https://github.com/rrwick/Porechop	-
Read mapping
Hisat2	https://daehwankimlab.github.io/hisat2/	[25]
BWA	http://bio-bwa.sourceforge.net/	[26]
Bowtie2*	http://bowtie-bio.sourceforge.net/bowtie2/index.shtml	[27]
KMA	https://bitbucket.org/genomicepidemiology/kma	[28]
SortmeRNA	http://bioinfo.lifl.fr/RNA/sortmerna	[29]
Minimap2	https://github.com/lh3/minimap2	[46]
NGMLR	https://github.com/philres/ngmlr	[47]
MarginAlign	https://github.com/benedictpaten/marginAlign	[48]
De novo assembly
Trinity*	http://www.nature.com/nbt/index.html.	[31]
Megahit	https://hku-bal.github.io/megabox/	[32]
SPAdes	http://bioinf.spbau.ru/spades	[33]
Trans-ABySS	https://github.com/bcgsc/transabyss	[34]
PEHaplo	https://github.com/chjiao/PEHaplo	[35]
SAVAGE	https://bitbucket.org/jbaaijens/savage/src	[36]
coronaSPAdes	http://cab.spbu.ru/software/coronaspades/	[38]
Blast
Diamond*	https://www.wsi.uni-tuebingen.de/lehrstuehle/algorithms-in-bioinformatics/software/diamond/	[39]
Blastn*	ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST	[40]
Phyre2	http://www.sbg.bio.ic.ac.uk/phyre2/html	[42]
Canu	https://github.com/marbl/canu	[49]
Falcon	https://github.com/PacificBiosciences/falcon	-
Miniasm	https://github.com/lh3/miniasm	[50]
Genome visualization
IGV	http://software.broadinstitute.org/software/igv/	[43]
Geneious*	https://www.geneious.com/	-
QUAST	https://sourceforge.net/projects/quast/	[44]
SEQMAN	https://www.dnastar.com/software/molecular-biology/	-
Database
GISAID*	https://www.epicov.org/	[51]
NCBI*	https://www.ncbi.nlm.nih.gov/	[52]
CNCB/NGDC database	https://bigd.big.ac.cn/ncov/	[53]
Genome Warehouse (GWH)	https://bigd.big.ac.cn/gwh/	-
Virus Pathogen Resource (ViPR)	https://www.viprbrc.org/
Sequence alignment
CLUSTALW	https://www.genome.jp/tools-bin/clustalw	[56]
MAFFT*	https://mafft.cbrc.jp/alignment/software/	[57]
MUSCLE	http://drive5.com/muscle/	[58]
T-Coffee	http://www.tcoffee.org/	[59]
ProbCons	http://probcons.stanford.edu/	[60]
PRANK	http://wasabiapp.org/software/prank/	[62]
Bali-Phy	http://www.bali-phy.org/	[63]
StatAlign	https://dl.acm.org/doi/10.1093/bioinformatics/btn457	[64]
JABAWS	http://www.jalview.org/	[65]
EMBL-EBI	https://www.ebi.ac.uk/	[66]
webPRANK	https://www.ebi.ac.uk/goldman-srv/webprank/	[67]
Jalview	http://www.jalview.org/getdown/release/	[69]
MSAViewer	http://msa.biojs.net/index.html	[70]
AliView	http://www.ormbunkar.se/aliview/	[71]
Bioedit*	http://www.mbio.ncsu.edu/BioEdit/	[72]
Phylogenetic analysis
jMODELTEST	http://evomics.org/learning/phylogenetics/jmodeltest/	[77]
ProtTest	http://darwin.uvigo.es/software/prottest.html	[78]
TempEst	http://tree.bio.ed.ac.uk/software/tempest/	[82]
BIONJ	http://www.atgc-montpellier.fr/bionj/	[83]
PhyML	http://www.atgc-montpellier.fr/phyml/	[84]
RAxML*	http://www.exelixis-lab.org/software.html	[85]
IQ-TREE	http://www.iqtree.org/	[86]
MrBayes	http://nbisweden.github.io/MrBayes/	[87]
PhyloBayes	http://www.atgc-montpellier.fr/phylobayes/	[88]
BEAST1	http://beast.community/	[89]
BEAST2	http://www.beast2.org/	[90]
PAUP	http://paup.csit.fsu.edu/	[91]
MEGA	https://www.megasoftware.net/	[92]
PhyloSuite	http://phylosuite.jushengwu.com/	[93]
Tree visualization
Dendroscope	http://www-ab.informatik.uni-tuebingen.de/software/dendroscope	[94]
FigTree*	http://tree.bio.ed.ac.uk/software/figtree/	-
ggtree	https://yulab-smu.github.io/treedata-book/	[95]
iTOL	https://itol.embl.de/	[96]
Evolview	http://www.evolgenius.info/evolview	[97]
Genomic analysis
Pangolin COVID-19 Lineage Assigner	https://pangolin.cog-uk.io/	-
Nextstrain analysis platform*	https://nextstrain.org/	[104]
Conserved Domain Database*	https://www.ncbi.nlm.nih.gov/cdd/	[108]
UCSC	http://genome.ucsc.edu/	-
GFF2PS	http://genome.imim.es/software/gfftools/GFF2PS.html	[109]
Vectro NTI	https://www.winsite.com/vector/vector+nti/	[110]
IBS	http://ibs.biocuckoo.org/	[111]
PHYLIP	https://evolution.genetics.washington.edu/phylip.html	[112]
SimPlot*	https://sray.med.som.jhmi.edu/SCRoftware/simplot/	[114]
RDP	http://web.cbio.uct.ac.za/~darren/rdp.html	[115]
Swiss-Model program*	https://swissmodel.expasy.org/	[116]
PyMOL*	https://www.lfd.uci.edu/~gohlke/pythonlibs/	[117]

*Computer programs used by us in the discovery of SARS-CoV-2 [2, 118].

A schematic workflow and the bioinformatics resources used in novel virus discovery. Each key step in the workflow is shown with different backgrounds. Computational tools used in the SARS-CoV-2 discovery by our group are colored in orange. Summary of the available bioinformatics resources for SARS-CoV-2 discovery and genomic surveillance *Computer programs used by us in the discovery of SARS-CoV-2 [2, 118].

Quality profiling

The quality control and preprocessing of raw FASTQ files is critical for subsequent analyses, especially for degraded samples, and involves removing adapter sequences, filtering low quality/complexity reads, error correction, etc. Sequence matching-based adapter trimming tools like Trimmomatic [20], Cutadapt [21] and SOAPnuke [22] (Table 1) can be employed as adapter trimmers and can also perform sliding window or maximum information quality filtering. Recently, all-in-one FASTQ preprocessors, such as AfterQC [23] and fastp [24] (Table 1), provide a variety of functions, including quality profiling, adapter and polyG/polyX tail trimming, base correction, per-read quality pruning and unique molecular identifier preprocessing. These efficient computer applications can support both single-end and paired-end (PE) sequencing data, with the exception of PE data requiring some additional steps based on overlapping analysis. For instance, adapter sequences are detected using the overlapping detection algorithm of each pair and can be trimmed with even only one base in the tail, whereas most sequence matching-based tools require at least three bases. Fastp also performs sequence matching-based adapter trimming when setting specific adapter sequences. In addition, for multiplex PCR amplification technology that targets SARS-CoV-2, multiple amplification primer pairs will be removed after general quality control steps using a particular Python script (Cut_Multi_Primer.py) (Table 1) developed by MGI Tech (https://github.com/MGI-tech-bioinformatics/SARS-CoV-2_Multi-PCR_v1.0).

Removal of host/rRNA data

The next challenge is to efficiently process immense amount of data and identify potential virus-related sequences after the quality control of the raw data. As virus genetic material will normally only comprise a tiny proportion of the total nucleic acids present in any sequencing run, the (more) abundant host reads need to be removed by mapping all reads to a host reference genome (if available) using mapping and alignment tools (such as Hisat2 [25], BWA [26], Bowtie2 [27] or KMA [28]) (Table 1). rRNA also needs to be removed using Bowtie2 or SortmeRNA [29] (Table 1), although it is also possible to perform rRNA depletion at the library preparation stage. However, for samples with low concentration or low quality, the host/rRNA depletion step can be skipped to increase the chances of obtaining viral reads.

Reads assembly

Without a priori knowledge of a novel virus genome, a routine approach is to de novo assemble the reads into contigs. Generally, there are two different assembly algorithms [30]: (i) the de Bruijn graph approach is usually used to assemble short reads by converting them to k-mers, which is employed in programs like Trinity [31], Megahit [32], SPAdes [33] and Trans-ABySS [34] (Table 1); and (ii) the overlap–layout–consensus (OLC) approach, which is normally used for the assembly of long reads and is applicable to highly similar genomes such as different viral variants or haplotypes, and employed in programs like PEHaplo [35] and SAVAGE [36] (Table 1). De novo assembly is the best approach in the context of emerging infectious diseases where no reference genomes are available, such as COVID-19. Read assembly can be challenging under virus discovery settings because both sequence divergence and background noise may be extensive, and no tools are always guaranteed to give the best results [37]. Recently, a specialized assembler, coronaSPAdes [38] (Table 1), was developed to recover genome sequences of the Coronaviridae (including both novel and known species), employing algorithmic assembly from rnaviralSPAdes and the HMM-guided algorithms of biosyntheticSPAdes, based on the genome organization from fragmented assemblies.

Taxonomic classification

Once the reads are assembled into contigs, the next step is to assign contigs to a specific taxon (i.e. species, genus and family) as a means of taxonomic classification. The most common approach is to blast the individual contigs against a nucleotide/protein database, such as the non-redundant protein sequence database (nr) or the reference virus sequence database (RefSeq_viruses). Diamond [39] (Table 1) is one of the most popular tools used for aligning translated short reads against the nr database and is much faster than Blastx (a gold standard tool for protein alignment) with a similar degree of sensitivity. Blastn [40] (Table 1), as a traditional nucleotide-to-nucleotide search program, is still widely used for nucleotide sequence alignment. As there will be tens of thousands of contigs, it is usually advisable to create a local database that contains protein sequences of all known reference viruses to further reduce the computational burden and accelerate the blast process. However, the results from local blast searches should be interpreted with caution because of false hits to non-viral proteins that share homology with viral counterparts. It is therefore important to perform a confirmatory Blastx search against the nr database to avoid false positives. Another strategy is to directly align all remaining reads to reference databases using alignment tools for contigs: this will greatly reduce the computational resources required for data analysis, especially with libraries constructed using human-related samples. Reads from the virus-positive library are then de novo assembled as described above. The remaining unannotated contigs are tentatively assigned as ‘orphan’ contigs [41]. Although divergent in primary sequence, such orphaned contigs can be further analyzed using protein structure-informed approaches such as that implemented in Phyre2 [42] (Table 1).

Virus genome verification

After virus-associated contigs are extracted, the quality of the contigs can be examined by read mapping. The reliable contigs with unassembled overlaps, or those from the same scaffold, are then merged to form longer viral contigs using contig assembly tools (SEQMAN or Geneious), followed by iterative read mapping for further extension of the genome at both ends. Results in sam/bam format can be visualized using programs like IGV [43], Geneious (http://www.geneious.com) or QIAST [44] (Table 1). If necessary, gaps can be filled by RT-PCR and Sanger sequencing, and genome termini can be determined by RNA circularization or 5′/3′ RACE kits. The consensus sequence determined from the final assembly of the mapped reads can be used as the newly identified virus genome for downstream analyses. For sequence data generated from third-generation sequencing platform (e.g. Oxford Nanopore sequencing), the data analysis workflow is basically the same as that described above, but using different programs at each step due to the production of longer reads. For instance, NanoPack [45] (Table 1) is a comprehensive preprocessing tool with several individual scripts for long-read sequencing data, providing multiple quality profiling features (NanoStat, NanoPlot and NanoComp), read filtering and trimming (NanoFilt) and the removal of contamination (NanoLyse). Porechop (https://github.com/rrwick/Porechop) (Table 1) functions as an adapter trimmer and is able to find and remove adapters from Oxford Nanopore reads through alignment-based strategies, even with low sequence identity. In addition, new alignment tools (i.e. Minimap2 [46], NGMLR [47], MarginAlign [48]) and multiple de novo assembly tools (i.e. Canu [49], Falcon (https://github.com/PacificBiosciences/falcon), Miniasm [50]) (Table 1) based on OLC approaches have now also been developed specifically for long-read data.

Bioinformatics resources for genomic and evolutionary analyses

SARS-CoV-2-related databases

There have been a number of SARS-CoV-2-related databases, such as Global Initiative on Sharing All Influenza Data (GISAID, https://www.gisaid.org/) [51], the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/sars-cov-2/) [52], Genome Warehouse (https://bigd.big.ac.cn/gwh/), National Bioinformatics Center (CNCB)/National Genomics Data Center (NGDC) database (https://bigd.big.ac.cn/ncov/) [53] and Virus Pathogen Resource (https://www.viprbrc.org/) [54]. Among them, GISAID deposited the largest number of SARS-CoV-2 genome sequences. These databases play important roles in sequence archive, homology searching, variation discovery, disease phenotype association, etc.

Multiple sequence alignment

Accurate multiple sequence alignment (MSA) is the foundation of all comparative genome sequence analyses. The number of available MSA methods has increased in recent decades [55], although they can be classified into three major categories: (i) progressive-based methods (including CLUSTALW [56], MAFFT [57] and MUSCLE [58]), (ii) consistency-based methods (including T-Coffee [59], ProbCons [60] and some versions of MAFFT [61]) and (iii) evolution-based methods (including PRANK [62], Bali-Phy [63] and StatAlign [64]) (Table 1). JABAWS [65] integrates a variety of MSA tools (e.g. MUSCLE, MAFFT and ClustalW) (Table 1), which can be conveniently packaged to run locally. Web services, such as EMBL-EBI [66], also provide free access to online applications of popular sequence analysis tools (e.g. MUSCLE, MAFFT, ClustalW, T-Coffee and webPRANK [67]) (Table 1). Different MSA algorithms can produce different alignments, obviously impacting all downstream analyses [68]. Fortunately, because SARS-CoV-2 genomes are so similar in sequence, with few insertion–deletion events (indels), MSA is normally straightforward. The resulting alignment can be then visualized, analyzed, annotated and manually edited using Jalview [69], MSAViewer [70], AliView [71], Bioedit [72] or Geneious (https://www.geneious.com) (Table 1).

Phylogenetic and evolutionary analyses

Phylogenetic trees are central to understanding the emergence and evolution of SARS-CoV-2 and can be estimated using a variety of approaches, particularly distance-based methods such as neighbor joining (NJ) [73] and character-based methods including maximum parsimony (MP) [74], maximum likelihood (ML) [75] and Bayesian inference (BI) [76]. The NJ, ML and BI methods use explicit statistical models of nucleotide or amino acid substitution (that can be compared and evaluated using programs such as jMODELTEST [77] and ProtTest [78] (Table 1). Although they are both based on substitution models, BI methods differ from ML methods in that they use statistical distributions to quantify uncertainties (posterior distributions) both in the tree and model parameters [79]. Substitution models are not employed in MP which instead attempts to minimize the number of evolutionary changes across the tree in accord with the parsimony principle [80]. Bayesian methods have been extended to determine the patterns of virus spread in both space (i.e. phylogeography) and time [81]. Both applications require sequence evolution to proceed at an approximately constant rate, the so-called molecular clock of evolution. Before running analyses assuming a molecular clock, it is advisable to test its presence through a regression of root-to-tip genetic distances against date of sampling (e.g. using the TempEst program [82]). A number of computer programs and packages implementing these phylogenetic methods have been developed, such as BIONJ [83] for NJ; PhyML [84], RAxML [85] and IQ-TREE [86] for ML; and MrBayes [87], PhyloBayes [88], BEAST1 [89] and BEAST2 [90] for BI (Table 1). Multiple methods are included in packages such as PAUP [91] and MEGA [92] (Table 1). Recently, a novel integrated desktop platform, PhyloSuite [93] (Table 1), has been developed for streamlined molecular sequence data management and phylogenetics studies. A variety of approaches are also available for visualization of the resultant phylogeny, such as Dendroscope v3.5.10 [94], FigTree v1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/) and ggtree [95] (Table 1). Online services such as iTOL [96] and Evolview [97] (Table 1) can also be used to annotate phylogenetic trees. Both ML and BI approaches have been used extensively to study the evolution of SARS-CoV-2 [98, 85, 99–103]. When a large number of SARS-CoV-2 genome sequences are being analyzed, IQ-TREE v1.6.8 and RAxML v8.2.9 (Table 1) are recommended as they can utilize many computation nodes with high efficiency and hence are applicable to large datasets. For example, a BI-based analysis platform, Nextstrain [104], analyzes the latest SARS-CoV-2 data from GISAID [51] and visualizes the spread and evolution of all available SARS-CoV-2 strains in real time (https://nextstrain.org/ncov/global/zh). Finally, although alignment-free approaches have recently been proposed to enable genome-scale phylogenetic inference, such as the average common subsequence [105], composition vector (CVTree) [106], k-mer [107] methods, to the best of our knowledge, they have not been used in the phylogenetic analysis of SARS-CoV-2.

Virus genome annotation and analysis

Genome annotation

For a new virus, genome annotation can be challenging. The open reading frames of SARS-CoV-2 were initially predicted using Geneious v11.1.5 and annotated using the Conserved Domain Database (https://www.ncbi.nlm.nih.gov/cdd/) [108]. Subsequently, online services, such as the popular genome browser UCSC (http://genome.ucsc.edu/covid19.html), Ensembl (https://covid-19.ensembl.org/index.html), or NCBI SARS-CoV-2 Resources (https://www.ncbi.nlm.nih.gov/sars-cov-2/) [52] were developed to facilitate SARS-CoV-2 genome annotation. For multiple SARS-CoV-2 sequences, cross-referencing on the reference virus genome, NC_045512.3 (strain Wuhan-Hu-1), from GenBank (https://www.ncbi.nlm.nih.gov/), simplifies gene annotation. In addition, there are several offline applications(e.g. GFF2PS [109], Vectro NTI [110] and IBS [111]) (Table 1) that are able to annotate virus genomes.

Detection of genetic variation

Levels of genetic identity provide a simple impression of the relationships between viruses. Several computer programs can be used to calculate pairwise sequence identities between sequences, such as Geneious v11.1.5 [2]. The DNADIST program of PHYLIP v3.697 [112] (Table 1) can also be used to estimate the genetic distance matrix of SARS-CoV-2. When the number of sequences is small, visual inspection of the alignment is sufficient to identify mutational changes. For example, by inspecting the alignment of full-length SARS-CoV-2 genomes, mutational sites including nucleotide substitutions and indels could be readily identified in each virus genome using MEGA X [102, 103]. However, when the number of sequences to be analyzed is large, visual inspection becomes challenging. Online resources, including the CNCB/NGDC database (https://bigd.big.ac.cn/ncov/), Nextstrain website [104] and the UCSC Genome Browser for SARS-CoV-2 (http://genome.ucsc.edu/covid19.html), can display the single-nucleotide polymorphisms at more than 10 000 sites across the SARS-CoV-2 genome.

Detecting coronavirus recombination

Coronaviruses are well known to have undergone frequent recombination [113] so that the occurrence of this process should be considered carefully. For example, we used SimPlot v3.5.1 [114] (Table 1) to detect potential inter-lineage recombination in betacoronaviruses, in which a sliding window analysis is employed to determine the changing patterns of sequence similarity between sequences which can then be verified by phylogenetic analysis. RDP4 [115] (Table 1) is a popular package to detect recombination events in a specific dataset and contains a number of common and important algorithms used for recombination detection, such as RDP, GENECONV, 3Seq, Chimaera, SiScan, MaxChi and LARD. Generally, a recombination event is regarded as reliable when it is detected by multiple independent methods.

Homology modeling

Homology modeling is a useful tool to predict protein structures depending on the degree of similarity between the target sequence and the template sequences available in the databases (e.g. PDB), thereby helping to make inferences on protein function. The three-dimensional structures of SARS-CoV-2 have been modeled using the website Swiss-Model program [116] and displayed using PyMOL v2.1 [117] (Table 1): these studies revealed that SARS-CoV-2 may also use human ACE2 as binding receptor [2]. Similarly, by using homology modeling, we showed that the Spike protein of a bat coronavirus, RmYN02, might be not able to bind to human ACE2 [118].

Current challenges and future directions

A combination of NGS technology and available bioinformatics tools successfully identified SARS-CoV-2 within days of the report of a novel pneumonia. In addition, genomic epidemiology—based on a solid bed rock of phylogenetic analysis—has played an irreplaceable role in investigating the origins, tracing the spread, monitoring the evolution and variation of SARS-CoV-2, and will clearly play a key role in helping to contain the COVID-19 pandemic as well as future outbreaks. Along with the development of sequencing technologies, the output of sequencing devices such as the Illumina Novaseq6000 reaches terabases per flow cell. Sample multiplexing should therefore be employed to maximize the efficiency and reduce costs, although this also leads to index hopping/swapping that can wrongly assign viruses to samples [119, 120]. The misassignment rates from Exclusion Amplification (ExAmp) chemistry (HiSeqX, HiSeq4000 and NovaSeq) instruments are estimated 0.2 to be 6%, approximately 10-fold higher than random cluster amplification instruments such as MiSeq [120, 121]. A high input of adapters and no dilution of PCR-free libraries often result in a high ratio of residual-free adapters. Therefore, index swapping rates using the PCR-free library construction method are higher than PCR-based methods, and one must be careful when interpreting those viruses found in pooled samples sequenced on a single lane [119]. It is also advisable that libraries with comparably high viral load are constructed in the same batch and loaded in the same lane to help eliminate amplicon contamination in other samples with low viral loads [11]. In contrast to Illumina technologies, MGI sequencers utilize the DNA nanoball technology that can reduce the misassignment rate to 0.0001–0.0004% under recommended procedures [120], although single indexed adapters somewhat limit these advantages. To mitigate cross-contamination, a non-redundant dual-indexing approach has been designed to be used on the Illumina sequencing platform, which would sharply reduce the index hopping rate. Challenges remain when employing NGS data for novel virus discovery, in particular when the proportion of virus reads is very low. The situation is especially complex for highly divergent viruses that exhibit such little similarity in primary sequence and hence that they need to be identified by homology-based or protein structure-based methods. As a consequence, a better understanding of virosphere clearly requires more adequately validated methods and advances in computational analyses. At the time of writing, there are >144 000 full-length genome sequences of SARS-CoV-2 available from GISAID. The generation of such an enormous amount of data represents both an achievement and a challenge. In particular, phylogenetic analysis and visualization of SARS-CoV-2 genomes is a cumbersome exercise with such a huge amount of data. In this case, both offline applications and online services cannot work efficiently due to the high demand of computational resources (memory/CPU). Splitting the whole dataset into several smaller sub-datasets, aligning the sub-datasets independently and combining them after MSA is a simple but feasible way to proceed in these circumstances. Even if phylogenetic analysis is feasible (e.g. [122]), such large trees are extremely difficult to visualize and interpret. Subsampling may therefore be the optimal way to proceed, and algorithms for this purpose are urgently needed. In sum, technical advances in NGS and bioinformatics enabled us rapidly identify the causative agent of COVID-19 and track its global spread. However, the marked increase in the SARS-CoV-2 genome sequences has highlighted the technical obstacles in analyzing such ‘big’ datasets. Solutions to these problems will assist not only in the control of the current COVID-19 pandemic but also in those future outbreaks of infectious disease that will undoubtedly occur. A variety of next-generation sequencing technologies have been applied for the discovery and genomic surveillance of SARS-CoV-2, with metatranscriptomic sequencing suitable for virus discovery and amplicon/probe hybridization-based approaches more effective in genomic surveillance. No fully integrated bioinformatics pipeline is currently available for virus discovery. However, there are many tools available for each component step, from quality control of the raw genomic sequence data to virus genome verification. Current bioinformatics resources in multiple sequence alignment, phylogenetics, tree visualization and genomic analysis are robust and reliable. However, the sharp increase in the amount of SARS-CoV-2 genome sequence data available poses serious challenges for data storage and analysis that urgently need to be resolved.

10 in total

1. Databases, Knowledgebases, and Software Tools for Virus Informatics.

Authors: Yuxin Lin; Yulan Qian; Xin Qi; Bairong Shen
Journal: Adv Exp Med Biol Date: 2022 Impact factor: 2.622

2. Robust clinical detection of SARS-CoV-2 variants by RT-PCR/MALDI-TOF multitarget approach.

Authors: Matthew M Hernandez; Radhika Banu; Ana S Gonzalez-Reiche; Adriana van de Guchte; Zenab Khan; Paras Shrestha; Liyong Cao; Feng Chen; Huanzhi Shi; Ayman Hanna; Hala Alshammary; Shelcie Fabre; Angela Amoako; Ajay Obla; Bremy Alburquerque; Luz Helena Patiño; Juan David Ramírez; Robert Sebra; Melissa R Gitman; Michael D Nowak; Carlos Cordon-Cardo; Ted E Schutzbank; Viviana Simon; Harm van Bakel; Emilia Mia Sordillo; Alberto E Paniz-Mondolfi
Journal: J Med Virol Date: 2021-12-16 Impact factor: 20.693

3. Population Genomics Approaches for Genetic Characterization of SARS-CoV-2 Lineages.

Authors: Fatima Mostefai; Isabel Gamache; Arnaud N'Guessan; Justin Pelletier; Jessie Huang; Carmen Lia Murall; Ahmad Pesaranghader; Vanda Gaonac'h-Lovejoy; David J Hamelin; Raphaël Poujol; Jean-Christophe Grenier; Martin Smith; Etienne Caron; Morgan Craig; Guy Wolf; Smita Krishnaswamy; B Jesse Shapiro; Julie G Hussin
Journal: Front Med (Lausanne) Date: 2022-02-21

4. Novel astrovirus and paramyxovirus in Mongolian gerbils ( Meriones unguiculatus) from China.

Authors: Shou-Min Nie; Juan Li; Yi-Ting Wang; Cui-Hong An; Hong Zhou; Lin Xu; Yang-Xin Sun; Wen-Hui Chang; Ci-Xiu Li; Wei-Feng Shi
Journal: Zool Res Date: 2022-05-18

5. RT-PCR/MALDI-TOF Diagnostic Target Performance Reflects Circulating SARS-CoV-2 Variant Diversity in New York City.

Authors: Matthew M Hernandez; Radhika Banu; Ana S Gonzalez-Reiche; Brandon Gray; Paras Shrestha; Liyong Cao; Feng Chen; Huanzhi Shi; Ayman Hanna; Juan David Ramírez; Adriana van de Guchte; Robert Sebra; Melissa R Gitman; Michael D Nowak; Carlos Cordon-Cardo; Ted E Schutzbank; Viviana Simon; Harm van Bakel; Emilia Mia Sordillo; Alberto E Paniz-Mondolfi
Journal: J Mol Diagn Date: 2022-05-04 Impact factor: 5.341

6. Identification of a novel hepacivirus in Mongolian gerbil (Meriones unguiculatus) from Shaanxi, China.

Authors: Cui-Hong An; Juan Li; Yi-Ting Wang; Shou-Min Nie; Wen-Hui Chang; Hong Zhou; Lin Xu; Yang-Xin Sun; Wei-Feng Shi; Ci-Xiu Li
Journal: Virol Sin Date: 2022-01-19 Impact factor: 6.947

7. Bioinformatics Approach Predicts Candidate Targets for SARS-CoV-2 Infections to COPD Patients.

Authors: Li Che; Guangshu Chen; Xingdong Cai; Zhefan Xie; Tingting Xia; Wei Zhang; Shengming Liu
Journal: Biomed Res Int Date: 2022-06-21 Impact factor: 3.246

8. Benchmark datasets for SARS-CoV-2 surveillance bioinformatics.

Authors: Lingzi Xiaoli; Jill V Hagey; Daniel J Park; Christopher A Gulvik; Erin L Young; Nabil-Fareed Alikhan; Adrian Lawsin; Norman Hassell; Kristen Knipe; Kelly F Oakeson; Adam C Retchless; Migun Shakya; Chien-Chi Lo; Patrick Chain; Andrew J Page; Benjamin J Metcalf; Michelle Su; Jessica Rowell; Eshaw Vidyaprakash; Clinton R Paden; Andrew D Huang; Dawn Roellig; Ketan Patel; Kathryn Winglee; Michael R Weigand; Lee S Katz
Journal: PeerJ Date: 2022-09-05 Impact factor: 3.061

9. Molecular characterization of a new SARS-CoV-2 recombinant cluster XAG identified in Brazil.

Authors: Thaís de Souza Silva; Richard Steiner Salvato; Tatiana Schäffer Gregianini; Ighor Arantes Gomes; Elisa Cavalcante Pereira; Eneida de Oliveira; André Luiz de Menezes; Regina Bones Barcellos; Fernanda Marques Godinho; Irina Riediger; Maria do Carmo Debur; Cristina Mendes de Oliveira; Rodrigo Ribeiro-Rodrigues; Fabio Miyajima; Fernando Stehling Dias; Adriano Abbud; Rubens do Monte-Neto; Carlos Eduardo Calzavara-Silva; Marilda Mendonça Siqueira; Gabriel Luz Wallau; Paola Cristina Resende; Gabriel da Rocha Fernandes; Pedro Alves
Journal: Front Med (Lausanne) Date: 2022-09-28

10. An outbreak of a novel recombinant Coxsackievirus A4 in a kindergarten, Shandong province, China, 2021.

Authors: Juan Li; Nan Ni; Yanan Cui; Shuai Zong; Xue Yao; Tao Hu; Mengyuan Cao; Yong Zhang; Peiqiang Hou; Michael J Carr; Weijia Xing; Hong Zhou; Weifeng Shi
Journal: Emerg Microbes Infect Date: 2022-12 Impact factor: 19.568

10 in total