Literature DB >> 32835191

Genotyping SARS-CoV-2 through an interactive web application.

Hassaan Maan1,2, Hamza Mbareche3,4, Amogelang R Raphenya5,6, Arinjay Banerjee7,8, Jalees A Nasir5,6, Robert A Kozak3, Natalie Knox9,10, Samira Mubareka3,4, Andrew G McArthur5,6, Bo Wang1,2,11,12.   

Abstract

Entities:  

Mesh:

Year:  2020        PMID: 32835191      PMCID: PMC7292579          DOI: 10.1016/S2589-7500(20)30140-0

Source DB:  PubMed          Journal:  Lancet Digit Health        ISSN: 2589-7500


× No keyword cloud information.
The ongoing COVID-19 pandemic is the greatest health-care challenge of this generation. Early viral genome sequencing studies of small cohorts have indicated the possibility of distinct severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genotypes. If these subtypes result in an altered virus tropism or pathogenesis in infected hosts, this could have immediate implications for vaccine design, drug development, and efforts to control the pandemic. Therefore, the genomic surveillance and characterisation of circulating viral strains is a high priority for research and development. To facilitate the epidemiological tracking of SARS-CoV-2, researchers worldwide have created various web-portals and tools, such as the Johns Hopkins University COVID-19 dashboard. An unprecedented effort to make COVID-19-related data accessible in near real-time has resulted in more than 25 000 publicly available genome sequences of SARS-CoV-2 on Global Initiative on Sharing All Influenza Data (GISAID). Although platforms to survey epidemiological data are prevalent, tools that summarise publicly available viral genome data are scarce and those that are available do not offer users the ability to analyse in-house sequencing data. To address this gap, we have developed an accessible application, the COVID-19 Genotyping Tool (CGT). A video demonstration of CGT is available in appendix 1. CGT uses publicly deposited SARS-CoV-2 consensus genome sequences from the GISAID EpiCoV™ database, and summarises relevant information through salient visualisations, including Uniform Manifold Approximation and Projection (UMAP), Minimum Spanning Trees (MST) of sequence networks, and allele frequencies of annotated high-prevalence non-synonymous Single-Nucleotide Polymorphisms (SNPs) within structural protein-coding genomic regions (envelope, membrane, nucleocapsid, and spike proteins; figure ). New sequencing data from GISAID are added to CGT once a week. Currently, three metadata types can be overlaid on the visualisations: the region of sample collection, the country of sample collection, and the sample collection date with respect to the start of the pandemic (heuristically defined as Dec 1, 2019; figure; appendix 2). Users can upload post-assembly consensus FASTA sequences of SARS-CoV-2 for interactive analysis. The 3' and 5' untranslated regions of genomes are trimmed because of low sequence identity, caused by difficulties in their amplification and sequencing. After sequence alignment, the DNA distance is calculated using the Kimura-80 model of nucleotide substitution. After the calculation of the distance and annotation of SNPs, visualisations are reactively reprocessed with user-uploaded data. We use UMAP and network analysis instead of phylogenetics because of a faster computation time and ease of interpretability for large datasets. UMAP reduces the high-dimensional representation of DNA sequences to a two-dimensional embedding, indicating a sequence similarity based on the distance between points; that is, SARS-CoV-2 genomes. The MST of the SARS-CoV-2 genome network is a metric used to define a network subset such that all the nodes are connected while minimising distance. MSTs have been used before in outbreak analysis to identify the most probable transmission events between hosts, and can therefore offer epidemiological insight into SARS-CoV-2 transmission. Lastly, heterogeneity within structural proteins is particularly notable in terms of the host immune response, and has direct implications for vaccine development. With this in mind, we present the most prevalent (based on minor allele frequency) non-synonymous SNPs in the genomic regions of structural proteins.
Figure

The COVID-19 Genotyping Tool user interface

There are three main tabs: Input, which allow users to upload SARS-CoV-2 genome sequences and adjust the visualisations; Details, which allows users to view details of the analysis algorithms; and Data, which allows users to view data sources and privacy information. The main panel consists of a tab interface with the three visualisations in the application (UMAP, MST, and SNP). Metadata corresponding to the region of sample origin, the country of sample origin, and the sample collection date can be toggled and overlaid on each visualisation. MST=minimum spanning trees. SARS-CoV-2=severe acute respiratory syndrome coronavirus 2. SNP=single-nucleotide polymorphisms. UMAP=uniform manifold projection and approximation.

The COVID-19 Genotyping Tool user interface There are three main tabs: Input, which allow users to upload SARS-CoV-2 genome sequences and adjust the visualisations; Details, which allows users to view details of the analysis algorithms; and Data, which allows users to view data sources and privacy information. The main panel consists of a tab interface with the three visualisations in the application (UMAP, MST, and SNP). Metadata corresponding to the region of sample origin, the country of sample origin, and the sample collection date can be toggled and overlaid on each visualisation. MST=minimum spanning trees. SARS-CoV-2=severe acute respiratory syndrome coronavirus 2. SNP=single-nucleotide polymorphisms. UMAP=uniform manifold projection and approximation. Our results indicate that there are distinct viral isolate clusters for SARS-CoV-2 sequences uploaded to GISAID. Larger outbreak clusters and hubs from UMAP and MST probably reflect outbreak epicentres (figure). Smaller clusters might be indicative of isolated outbreaks, and singleton samples might be indicative of isolated cases (figure). The analysis of SNPs reveals variants with notable minor allele frequencies involving missense substitutions in structural protein-coding genome regions (figure). Our novel tool is accessible to those who might not be trained in bioinformatics and epidemiological analysis, and thus it serves as a platform for aiding in a pivotal aspect of the global research effort against the COVID-19 pandemic. The current limitations of CGT include the sequence input limit and processing time, both of which are because of the size of the public data that must be concurrently processed with user-input sequences (>25 000 GISAID sequences). Our team is working to continuously integrate optimisations to the CGT data processing pipeline through code parallelisation and the refinement of deployment infrastructure. Future releases of the application will aim to decrease the processing time, increase the user-input sequence limit, incorporate the input and processing of raw SARS-CoV-2 sequencing data (eg, FASTQ files), and add additional information related to sequence epidemiology, such as travel history. Complete documentation of the CGT analysis pipeline and application source-code are available in a GitHub repository. Our application does not store any user-uploaded sequence data on the server-side or client-side; CGT simply processes the data to create updated visualisations, and information does not persist after the user disconnects. SARS-CoV-2 genome sequence data and linked metadata from GISAID are not published on our website, as per the GISAID data usage policy. Up-to-date acknowledgments for the usage of GISAID uploaded sequences for analysis are available in the GitHub repository. Our team is grateful to all the researchers who have shared SARS-CoV-2 viral genome sequencing data on GISAID.
  4 in total

1.  A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.

Authors:  M Kimura
Journal:  J Mol Evol       Date:  1980-12       Impact factor: 2.395

2.  Use of the minimum spanning tree model for molecular epidemiological investigation of a nosocomial outbreak of hepatitis C virus infection.

Authors:  Enea Spada; Luciano Sagliocca; John Sourdis; Anna Rosa Garbuglia; Vincenzo Poggi; Carmela De Fusco; Alfonso Mele
Journal:  J Clin Microbiol       Date:  2004-09       Impact factor: 5.948

3.  Data, disease and diplomacy: GISAID's innovative contribution to global health.

Authors:  Stefan Elbe; Gemma Buckland-Merrett
Journal:  Glob Chall       Date:  2017-01-10

4.  An interactive web-based dashboard to track COVID-19 in real time.

Authors:  Ensheng Dong; Hongru Du; Lauren Gardner
Journal:  Lancet Infect Dis       Date:  2020-02-19       Impact factor: 25.071

  4 in total
  2 in total

1.  Emerging SARS-CoV-2 Diversity Revealed by Rapid Whole-Genome Sequence Typing.

Authors:  Ahmed M Moustafa; Paul J Planet
Journal:  Genome Biol Evol       Date:  2021-09-01       Impact factor: 4.065

2.  A Comparison of Whole Genome Sequencing of SARS-CoV-2 Using Amplicon-Based Sequencing, Random Hexamers, and Bait Capture.

Authors:  Jalees A Nasir; Robert A Kozak; Patryk Aftanas; Amogelang R Raphenya; Kendrick M Smith; Finlay Maguire; Hassaan Maan; Muhannad Alruwaili; Arinjay Banerjee; Hamza Mbareche; Brian P Alcock; Natalie C Knox; Karen Mossman; Bo Wang; Julian A Hiscox; Andrew G McArthur; Samira Mubareka
Journal:  Viruses       Date:  2020-08-15       Impact factor: 5.048

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.