Literature DB >> 34419470

CovidPhy: A tool for phylogeographic analysis of SARS-CoV-2 variation.

Xabier Bello¹, Jacobo Pardo-Seco¹, Alberto Gómez-Carballa¹, Hansi Weissensteiner², Federico Martinón-Torres³, Antonio Salas⁴.

Abstract

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the pathogen responsible for the coronavirus disease 2019 (COVID-19) pandemic. SARS-CoV-2 genomes have been sequenced massively and worldwide and are now available in different public genome repositories. There is much interest in generating bioinformatic tools capable to analyze and interpret SARS-CoV-2 variation. We have designed CovidPhy (http://covidphy.eu), a web interface that can process SARS-CoV-2 genome sequences in plain fasta text format or provided through identity codes from the Global Initiative on Sharing Avian Influenza Data (GISAID) or GenBank. CovidPhy aggregates information available on the large GISAID database (>1.49 M genomes). Sequences are first aligned against the reference sequence and the interface provides different sources of information, including automatic classification of genomes into a pre-computed phylogeny and phylogeographic information, haplogroup/lineage frequencies, and sequencing variation, indicating also if the genome contains known variants of concern (VOC). Additionally, CovidPhy allows searching for variants and haplotypes introduced by the user and includes a list of genomes that are good candidates for being responsible for large outbreaks worldwide, most likely mediated by important superspreading events, indicating their possible geographic epicenters and their relative impact as recorded in the GISAID database.

Entities: Chemical

Keywords: COVID-19; Phylogeny; RNA; SARS-CoV-2; Superspreading events; Variants of concern

Mesh：

Year: 2021 PMID： 34419470 PMCID： PMC8376833 DOI： 10.1016/j.envres.2021.111909

Source DB: PubMed Journal: Environ Res ISSN： 0013-9351 Impact factor: 6.498

Introduction

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a single-stranded RNA virus responsible for the coronavirus disease 2019 (COVID-19) pandemic. There has been a massive interest in sequencing genomes from coronavirus circulating in COVID-19 patients worldwide since its first early sequencing in December 2019 (Wu et al., 2020). Genomes are stored in public repositories such as GenBank (https://www.ncbi.nlm.nih.gov/sars-cov-2/) or, more specifically, in The Global Initiative on Sharing Avian Influenza Data (GISAID; https://www.gisaid.org) (Shu and McCauley, 2017) as well as the 2019 Novel Coronavirus Resource (2019nCoVR; https://bigd.big.ac.cn/ncov/online/tools). During the last few months, several software applications and web tools have been developed that aim at understanding SARS-CoV-2 variation as well as dissemination in a worldwide scale. One of the most popular tool is Nextstrain (https://nextstrain.org (Hadfield et al., 2018)), which provides a maximum likelihood phylogeny built on a massive amount of SARS-CoV-2 genomes, and which allows to investigate the phylodynamics of the virus since the beginning of the pandemic. However, nomenclature of the Nextstrain phylogeny is limited to a few nodes among hundreds, it does not follow systematic criteria for naming clades, and it is not stable since it has undergone several changes over the last few months. Moreover, mutational pathways along branches from the root to a given node can only be reconstructed partially in most of the tree; therefore, although very informative from the phylodynamics point of view, the phylogeny does not allow to classify a genome into a phylogenetic node with the exception of a few nodes of interest (see, in contrast, human mtDNA phylogeny including 5,500 haplogroups (van Oven, 2015) and related classification tools such as Haplogrep (Weissensteiner et al., 2016). The tool Nextclade, which is part of the Nextstrain project, allows to classify SARS-CoV-2 fasta sequences into 18 major clades. Besides mutation calling and clade assignment, it performs some quality checks and phylogenetic placement in a global tree. The largest SARS-CoV-2 clade dataset is provided by the Pango Network comprising 1,801 virus lineages (out of which 514 lineages are withdrawn) with its own lineage naming rules (Rambaut et al., 2020). Several tools for online and offline classification, lineage suggestion with an own lineage designation committee, tree modification and reporting are provided. Other analytical tools for the analysis of SARS-CoV-2 variation include CovSeq, a Python and JavaScript designed web interface (Liu et al., 2020a) that aggregates data from different repositories to extract information on genetic variants that, ultimately, can be downloaded by users. The varSEAK database (https://varseak.bio) offers a variety of analyses, providing information on variants, lineages, and a splice site prediction tool. Other software includes Viral Genome ORF Reader (VIGOR (Wang et al., 2010)), which focuses on gene annotation, VAPiD (Shean et al., 2019), a pipeline that facilitates genome submissions to NCBI GenBank, and the National Genomics Data Center (Gong et al., 2020), an online tool that includes BLAST alignment, genome annotation, variant identification modules, among others utilities. We have developed CovidPhy (www.covidphy.eu), a web tool that allows to process and analyze complete SARS-CoV-2 genomes. CovidPhy implements a pipeline that accepts newly generated sequencing fasta files, but also the identification codes of genomes stored in GISAID or GenBank repositories. It classifies genomes into main phylogenetic nodes and offers information on viral variants and clade frequencies worldwide. By inspecting the large GISAID database, it makes it possible to identify specific SARS-CoV-2 sequences as strong candidates for being responsible for notable COVID-19 outbreaks. The first attempt was carried out by Gómez-Carballa et al. (2020a); in this early article, we explored the database available at that time (containing >4.7 K SARS-CoV-2 genomes) for identical genomes that showed a high frequency in a short time frame (of only a few days) and occurring in specific geographic areas. This signature, coupled with the substitution rate of the SARS-CoV-2 (which generates a substitution approximately every two weeks) and the incubation period needed for the development of the COVID-19 symptoms (5–6 days, according to WHO on November 8, 2020) signaled the likely presence of sudden local outbreaks that could have been originated by superspreading events. Topological inspection of the phylogenies originated by these candidates added further support to a model of germ transmission compatible with superspreading and not with alternative ways of transmission (e.g. chains). This procedure was subsequently extended in Gómez-Carballa et al. (2020b) to explore the important outbreaks occurring in Spain during the first wave of the pandemic, which was particularly devastating in this country. Further investigations corroborated the important role of superspreading in the COVID-19 pandemic (Adam et al., 2020; Althouse et al., 2020; Liu et al., 2020b; Walker et al., 2020). The article by Lemieux et al. (2020) specifically treated two important outbreaks occurring in Boston in the early weeks of the pandemic; notably, one of these event had been recorded by our early analysis in Gómez-Carballa et al. (2020a); this feature was pointed out in Salas et al. (2021). Therefore, the identification of genomes that were responsible for important outbreaks in different countries and locations by simply inspecting the GISAID database constitutes another useful feature of CovidPhy.

Material and methods

Data source and processing

CovidPhy uses data from GISAID to compute variant and clade (haplogroup) frequencies and infer SARS-CoV-2 candidates responsible for outbreaks. Genomes are aligned against the reference genome with GenBank accession number MN908947.3 (submitted on January 5, 2020) corresponding to the first SARS-CoV-2 genome released on GenBank (GISAID ID #402125). Most of the sequences used by CovidPhy were incrementally downloaded from GISAID since our initial publications and aligned to the reference sequence as previously indicated (Gómez-Carballa et al. 2020a, b; Pardo-Seco et al., 2021; Salas et al., 2021). A total of 1,493,746 genomes (downloaded on February 5, 2021 from GISAID) are now being processed in CovidPhy. We extracted the differences between any sequence and the reference, and the genomes were classified into the nodes of a pre-generated phylogeny.

Phylogeny and nomenclature

We implemented the phylogeny (and nomenclature) built by Gómez-Carballa et al., 2020a, Gómez-Carballa et al., 2020b to classify SARS-CoV-2 genomes into clades; this is also available and navigable in the web interface. Although built on data produced during the first wave of the pandemic, to the best of our knowledge this phylogeny remains the most elaborate one available. The lack of a consensus nomenclature has generated great controversy among the scientific community (https://www.newscientist.com/article/mg24933242-900-coronavirus-variant-names-are-too-confusing-there-is-a-better-way/ (Callaway, 2021)), as also echoed by the media (e.g. https://www.nytimes.com/2021/03/02/health/virus-variant-names.html), especially with the identification of new variants of concern (VOC) in late December 2020. We had already warned about this controversy in our early publication (Gómez-Carballa et al., 2020a). Thereby, the differing naming for SARS-CoV-2 genetic lineages by GISAID, Nextstrain and Pango will remain by the scientific community, so that it seems most reasonable to keep the consistent nomenclature employed in the source mentioned. In addition, note also that the minimal nomenclature scheme employed by Nextstrain does not follows a systematic criteria for naming branches and the nomenclature is dynamic (see comments in (Gómez-Carballa et al., 2020a)); this is also a common feature of the nomenclature used by Rambaut et al. (2020, 2021). Instead, as of June 2021, The World Health Organization (WHO) proposed a new naming convention for VOCs and variants of interest (https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/) introducing letters of the Greek alphabet; CovidPhy provides classification of sequences into the four popular VOCs, namely, Alpha, Beta, Gamma and Delta. Together with this classification scheme, CovidPhy also identifies worrying mutations according to the CoVariants resource (https://covariants.org/shared-mutations), many of these mutations are relevant to VOCs.

Software stack

The whole stack was written in Nim programming language (https://nim-lang.org). Nim is one of the best performing languages (https://github.com/kostya/benchmarks; https://github.com/def-/nim-benchmarksgame), usually on par with C, without sacrificing readability and expressiveness. The alignment of the input sequences is performed with a modified version of MAFFT (Katoh and Standley, 2013), so it can be interfaced directly through a foreign function interface (FFI). The database of choice is SQLite (https://www.sqlite.org/), as we deemed that the strengths of that database (easy installation and management, capable of handling web traffic over 500 K hits/day) outweighed the weaknesses (not ready for high concurrency writes, no client/server structure). Nim can also compile to JavaScript, targeting the browser and allowing the developer to write the full stack of a web service in one single language. The graphics for the webpage are created using Plotly (https://plotly.com/javascript/), interfaced using Nim both in the frontend and in the backend. The web stack is a single binary, but the core (the aligner and the classifier) is decoupled, and therefore it could be reused to build also a graphical user interface (GUI) and command line interface (CLI) (Fig. 1 ). We provide three main programs in the repository:

Fig. 1

Pipeline of CovidPhy. CovidPhy offers three interfaces: a web, a CLI and a GUI. All three can be fed with a fasta file (top left) that is aligned using libdistfast.so against the Reference (402,125) and scanned looking for differences that allow the classification in a precomputed phylogeny (top, red square marked “core”). The output varies for each program: the CLI and the GUI only output the haplogroup and the variants found (bottom black square), while the web offers additional information: haplogroup frequencies in regions (e.g. countries), candidates for important outbreaks as inferred from database searchers, and VOCs. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

covidphy, the web server that can be reached at www.covidphy.eu; covidphy_cli, a command line interface that is used to classify sequences in our internal pipelines; covidphy_gui, a simple graphic interface for those unfamiliar with CLI, who may prefer to select the input with buttons Pipeline of CovidPhy. CovidPhy offers three interfaces: a web, a CLI and a GUI. All three can be fed with a fasta file (top left) that is aligned using libdistfast.so against the Reference (402,125) and scanned looking for differences that allow the classification in a precomputed phylogeny (top, red square marked “core”). The output varies for each program: the CLI and the GUI only output the haplogroup and the variants found (bottom black square), while the web offers additional information: haplogroup frequencies in regions (e.g. countries), candidates for important outbreaks as inferred from database searchers, and VOCs. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

Detection of outbreak candidates

Outbreak candidates were investigated in GISAID by searching identical haplotypes detected at least 30 times in a period of five consecutive days in a specific country or state (note however that these values might change in future updates of the tool depending on e.g. database size). Once an outbreak is detected, we determine its length by adding consecutive days after the identified event, so long as the displaced 5-day window meets the previous condition of at least 30 identical sequences; alternatively, we reduce the interval to the extreme if the number of counts equals to 0 while still satisfying the minimum criterion of at least 30 identical haplotypes in the shortened period. These analyses were carried out using R software (R core Team, 2019).

Results

Analysis carried out by CovidPhy

CovidPhy admits fasta files and can also run genomes provided through a GISAID or a GenBank identification code. The sequences are aligned against the SARS-CoV-2 reference sequence MN908947.3. New genomes are then investigated individually by exploring the variants present, their frequency information as extracted from the large GISAID database, and their phylogenetic allocation with regards to the reference phylogeny implemented in CovidPhy. It takes about 0.1 s to align a single genome and carry out the clade classification; the rest of the analysis is even faster because the results displayed have been precomputed. While sequence codes from NCBI can be retrieved automatically from the website, GISAID does not allow sharing genomes stored in their database; therefore, to investigate these sequences, the user must first register in the GISAID platform and download them the fasta files directly. When a GISAID code is entered, CovidPhy only provides information on its lineage/haplogroup assignation and frequency, but not on the variants involved. For the genomes uploaded directly in the fasta text box or those indicated as a NCBI code, the user also obtains information on the variants present against the reference sequence (MN908947.3). This basic information includes the nucleotide position, the reference variant, and the alternative allele, the ORF assignation, the predicted severity of the variant for the virus (low: if synonymous; medium: if non-synonymous; high: if it leads to stop gain/loss, frameshift, or start loss), their functional description (missense, synonymous, etc.), and its frequency in the database. In addition, the user is informed if the sequence carries worrying mutations, indicating the nucleotide and the amino acid change (if any; e.g. A23063T (N501Y) that is characteristic of the B.1.1.7 and other VOC (Davies et al., 2021)). Information on the length and coverage of the genomes is also provided (note that the 5′ and 3’ ends of the genomes stored in GISAID are usually missed). Genomes are automatically classified into a clade according to the phylogeny provided by Gómez-Carballa et al., 2020b, Gómez-Carballa et al., 2020a and also indicates if the sequence is classified in any of the four most important VOC, but the algorithm can be applied to any classification tree. Once a sequence is classified into a clade, information is provided on the geographic location of this clade by way of displaying a continental map and including information on continental or country haplogroup frequencies. The phylogeny used by CovidPhy is provided in full in a tab. Briefly, there are two main clades that are phylogenetically located at the same level. Although there is not a clear consensus on the root of the tree (see a discussion in (Gómez-Carballa et al., 2020a), the nomenclature is based on classical cladistics and, for practical purposes, it assumes the root in haplogroup A. Haplogroup names are organized and named hierarchically from this root: A > A1 > A1a > A1a2, and the branch variants are indicated. By clicking on an haplogroup label, the user is moved to another tab indicating geographic frequencies of this lineag /haplogroup as well as the diagnostic mutational path that characterizes it. Finally, another notable feature of CovidPhy is to search genomes carrying a particular variants or a set of variants in the GISAID database and selecting by country. For a specific query, the tool informs on the frequency of all the haplotypes in the database containing the variant(s) in the query through the time since the beginning of the pandemic.

Outbreaks recorded in SARS-CoV-2 databases

By inspecting the large database of GISAID, it is possible to identify specific SARS-CoV-2 sequences as good candidates for being responsible for large sudden COVID-19 outbreaks triggered by superspreading events. These events are listed in CovidPhy by geographic region and by country. Information is provided on the number of sequences represented in the database for the exponential growth period, and the relative frequency of the responsible genome against the other genomes circulating in the same region during the same timeframe. Lineage assignation of the candidate genome is provided, and the continental frequency of this haplogroup can also be graphically displayed on a map.

Discussion

The COVID-19 pandemic has impacted every region of the world. There is much interest in investigating and tracking evolutionary characteristics of SARS-CoV-2 variants and lineages. Apart from the interest of CovidPhy for research, there is also demand from e.g. microbiological units in hospitals lacking the bioinformatic tools for the treatment of the genome sequences that they generate on a daily basis. There are similar tools available that can process SARS-CoV-2 genome sequences and carry out different kinds of analyses. Compared to previous developments, CovidPhy offers additional features. For instance, it automatically classifies a given genome into a clade by providing a SARS-CoV-2 phylogeny, and to provide phylogeographic information for this genome. Additionally, it allows variant(s) searches in the large GISAID database providing with information on frequencies for haplotypes containing the queried variation. It also provides information on lineages that have played a critical role in the dispersal of the SARS-CoV-2 pathogen, by initiating rapid and sudden outbreaks across the world. CovidPhy has been specifically designed for treating information on the SARS-CoV-2 sequences, but it can be easily scaled to other microorganisms of interest for which large datasets are available. (e.g. for influenza in GISAID).

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

21 in total

1. 'A bloody mess': Confusion reigns over naming of new COVID variants.

Authors: Ewen Callaway
Journal: Nature Date: 2021-01 Impact factor: 49.962

2. Clustering and superspreading potential of SARS-CoV-2 infections in Hong Kong.

Authors: Dillon C Adam; Peng Wu; Jessica Y Wong; Eric H Y Lau; Tim K Tsang; Simon Cauchemez; Gabriel M Leung; Benjamin J Cowling
Journal: Nat Med Date: 2020-09-17 Impact factor: 53.440

3. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors: Kazutaka Katoh; Daron M Standley
Journal: Mol Biol Evol Date: 2013-01-16 Impact factor: 16.240

4. CoV-Seq, a New Tool for SARS-CoV-2 Genome Analysis and Visualization: Development and Usability Study.

Authors: Boxiang Liu; Kaibo Liu; He Zhang; Liang Zhang; Yuchen Bian; Liang Huang
Journal: J Med Internet Res Date: 2020-10-02 Impact factor: 5.428

5. VIGOR, an annotation program for small viral genomes.

Authors: Shiliang Wang; Jaideep P Sundaram; David Spiro
Journal: BMC Bioinformatics Date: 2010-09-07 Impact factor: 3.169

6. VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank.

Authors: Ryan C Shean; Negar Makhsous; Graham D Stoddard; Michelle J Lin; Alexander L Greninger
Journal: BMC Bioinformatics Date: 2019-01-23 Impact factor: 3.169

7. Phylogenetic analysis of SARS-CoV-2 in Boston highlights the impact of superspreading events.

Authors: Jacob E Lemieux; Katherine J Siddle; Glen R Gallagher; Lawrence C Madoff; Sandra Smole; Virginia M Pierce; Eric Rosenberg; Pardis C Sabeti; Daniel J Park; Bronwyn L MacInnis; Bennett M Shaw; Christine Loreth; Stephen F Schaffner; Adrianne Gladden-Young; Gordon Adams; Timelia Fink; Christopher H Tomkins-Tinch; Lydia A Krasilnikova; Katherine C DeRuff; Melissa Rudy; Matthew R Bauer; Kim A Lagerborg; Erica Normandin; Sinéad B Chapman; Steven K Reilly; Melis N Anahtar; Aaron E Lin; Amber Carter; Cameron Myhrvold; Molly E Kemball; Sushma Chaluvadi; Caroline Cusick; Katelyn Flowers; Anna Neumann; Felecia Cerrato; Maha Farhat; Damien Slater; Jason B Harris; John A Branda; David Hooper; Jessie M Gaeta; Travis P Baggett; James O'Connell; Andreas Gnirke; Tami D Lieberman; Anthony Philippakis; Meagan Burns; Catherine M Brown; Jeremy Luban; Edward T Ryan; Sarah E Turbett; Regina C LaRocque; William P Hanage
Journal: Science Date: 2020-12-10 Impact factor: 47.728

8. Genetic structure of SARS-CoV-2 reflects clonal superspreading and multiple independent introduction events, North-Rhine Westphalia, Germany, February and March 2020.

Authors: Andreas Walker; Torsten Houwaart; Tobias Wienemann; Malte Kohns Vasconcelos; Daniel Strelow; Tina Senff; Lisanna Hülse; Ortwin Adams; Marcel Andree; Sandra Hauka; Torsten Feldt; Björn-Erik Jensen; Verena Keitel; Detlef Kindgen-Milles; Jörg Timm; Klaus Pfeffer; Alexander T Dilthey
Journal: Euro Surveill Date: 2020-06

9. Phylogeography of SARS-CoV-2 pandemic in Spain: a story of multiple introductions, micro-geographic stratification, founder effects, and super-spreaders.

Authors: Alberto Gómez-Carballa; Xabier Bello; Jacobo Pardo-Seco; María Luisa Pérez Del Molino; Federico Martinón-Torres; Antonio Salas
Journal: Zool Res Date: 2020-11-18

10. An online coronavirus analysis platform from the National Genomics Data Center.

Authors: Zheng Gong; Jun-Wei Zhu; Cui-Ping Li; Shuai Jiang; Li-Na Ma; Bi-Xia Tang; Dong Zou; Mei-Li Chen; Yu-Bin Sun; Shu-Hui Song; Zhang Zhang; Jing-Fa Xiao; Yong-Biao Xue; Yi-Ming Bao; Zheng-Lin Du; Wen-Ming Zhao
Journal: Zool Res Date: 2020-11-18

1 in total

1. Identifying SARS-CoV-2 regional introductions and transmission clusters in real time.

Authors: Jakob McBroome; Jennifer Martin; Adriano de Bernardi Schneider; Yatish Turakhia; Russell Corbett-Detig
Journal: Virus Evol Date: 2022-06-16

1 in total