Literature DB >> 33045776

An online coronavirus analysis platform from the National Genomics Data Center.

Zheng Gong1,2, Jun-Wei Zhu1,3, Cui-Ping Li1,3, Shuai Jiang1,3, Li-Na Ma1,3, Bi-Xia Tang1,3, Dong Zou1,3, Mei-Li Chen1,3, Yu-Bin Sun1,3, Shu-Hui Song1,3, Zhang Zhang1,3,2, Jing-Fa Xiao1,3,2, Yong-Biao Xue1,3,2, Yi-Ming Bao1,3,2, Zheng-Lin Du1,4, Wen-Ming Zhao1,3,5.   

Abstract

Since the first reported severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection in December 2019, coronavirus disease 2019 (COVID-19) has become a global pandemic, spreading to more than 200 countries and regions worldwide. With continued research progress and virus detection, SARS-CoV-2 genomes and sequencing data have been reported and accumulated at an unprecedented rate. To meet the need for fast analysis of these genome sequences, the National Genomics Data Center (NGDC) of the China National Center for Bioinformation (CNCB) has established an online coronavirus analysis platform, which includes de novoassembly, BLAST alignment, genome annotation, variant identification, and variant annotation modules. The online analysis platform can be freely accessed at the 2019 Novel Coronavirus Resource (2019nCoVR) (https://bigd.big.ac.cn/ncov/online/tools).

Entities:  

Keywords:  Coronavirus; Genome annotation; High-throughput sequencing; Variant identification; de novo assembly

Mesh:

Year:  2020        PMID: 33045776      PMCID: PMC7671910          DOI: 10.24272/j.issn.2095-8137.2020.065

Source DB:  PubMed          Journal:  Zool Res        ISSN: 2095-8137


DEAR EDITOR, Since the first reported severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection in December 2019, coronavirus disease 2019 (COVID-19) has become a global pandemic, spreading to more than 200 countries and regions worldwide. With continued research progress and virus detection, SARS-CoV-2 genomes and sequencing data have been reported and accumulated at an unprecedented rate. To meet the need for fast analysis of these genome sequences, the National Genomics Data Center (NGDC) of the China National Center for Bioinformation (CNCB) has established an online coronavirus analysis platform, which includes de novoassembly, BLAST alignment, genome annotation, variant identification, and variant annotation modules. The online analysis platform can be freely accessed at the 2019 Novel Coronavirus Resource (2019nCoVR) (https://bigd.big.ac.cn/ncov/online/tools). As of 1 October 2020, the Global Initiative on Sharing All Influenza Data (GISAID, https://www.gisaid.org/) (Shu & McCauley, 2017) contained 131 424 SARS-CoV-2 sequences, the 2019 Novel Coronavirus Resource (2019nCoVR) (Song et al., 2020; Zhao et al., 2020) contained 135 979 genome sequences, and the National Center for Biotechnology Information (NCBI) (Leinonen et al., 2011) contained 61 551 high-throughput sequencing runs. In addition, the Genome Sequence Archive (GSA) (Wang et al., 2017) has also released more than 200 accessions of SARS-CoV-2 sequencing runs. These data provide important information for SARS-CoV-2-based studies on viral classification, viral tracing, viral mutations, genome evolution, and antiviral drug development. Thus, there is an urgent need for a comprehensive online analysis platform to deal with the massive amount of data available. To promote studies and applications based on SARS-CoV-2 sequencing data, specific sequence analysis tools have been established in several online platforms worldwide. For example, NCBI has provided the BLAST alignment tool (Altschul et al., 1990) in SARS-CoV-2 Resources (https://www.ncbi.nlm.nih.gov/sars-cov-2/). The University of California, Santa Cruz (UCSC) SARS-CoV-2 Genome Browser has integrated the visualization browser with BLAT alignment and variant annotation tools (https://genome.ucsc.edu/covid19.html) (Fernandes et al., 2020). The National Microbiology Data Center (NMDC) has provided various analysis tools, such as BLAST alignment and phylogenetic analysis, in the Global Coronavirus Data Sharing and Analysis System (http://nmdc.cn/coronavirus/). The Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences (CAS), has established the Virus Identification Cloud (VIC, https://www.biosino.org/vic/), offering online analysis services for viral sequence identification and genome assembly. The Genome Detective webserver has also provided a virus identification workflow for high-throughput sequencing data (https://www.genomedetective.com/) (Cleemput et al., 2020). Although the above SARS-CoV-2 analysis tools provide online services, their functions are relatively limited and do not cover all aspects of SARS-CoV-2 research (Table 1).
Table 1

Analysis function comparison of SARS-CoV-2 online resources

Functions or features2019nCoVRNCBI SARS-CoV-2 resourcesUCSC SARS-CoV-2 browserGenome detectiveNMDCVIC
*: 2019nCoVR: 2019 Novel Coronavirus Resource; NCBI: National Center for Biotechnology Information; UCSC: University of California, Santa Cruz; NMDC: National Microbiology Data Center; VIC: Virus Identification Cloud.
Genome sequencesSequence comparison
Gene annotation
Variant identification
Phylogenetic tree
NGS raw readsDe novo assembly
Variant identification
Variant annotation
Open accessDo not need login
Thus, to provide a unified and convenient approach for processing SARS-CoV-2 sequencing data, the National Genomics Data Center (NGDC) of the China National Center for Bioinformation (CNCB) established an online coronavirus analysis platform based on viral genomes collected in 2019nCoVR (https://bigd.big.ac.cn/ncov/online/tools), offering free analysis services for researchers. The platform includes five functional modules (Figure 1), which cover various SARS-CoV-2 genomic data analyses.
Figure 1

Processing workflow and webpage demonstration of analysis results

Processing workflow and webpage demonstration of analysis results A: Analysis modules are in the middle of the figure. Main software used in the workflow is shown beside each module. B–D: Analysis demonstration of de novo assembly, variant identification, and genome annotation modules. N/A: Not available. 1. De novo assembly module This module can be used for de novo assembly of next-generation sequencing (NGS) data. First, raw reads are trimmed for quality using Trimmomatic (Bolger et al., 2014) with the settings SLIDINGWINDOW: 4:15, LEADING: 3, TRAILING: 3 and MINLEN: 36. Megahit (Li et al., 2015) is then used for sequence assembly with default parameters. The assembled sequences are compared with the SARS-CoV-2 reference genome (NC_045512.2) using BLASTN (Altschul et al., 1990) to identify target sequence(s), and assembly quality is evaluated using QUAST (Gurevich et al., 2013). The assembly results depend on the qualities of samples and sequencing data and may consist of a complete genome or several contigs. In the future, we plan to assemble those contigs into a single sequence by alignment with the reference genome, and to support genome assembly for third-generation sequencing data. 2. BLAST module To compare sequences among virus strains, the analysis platform includes a BLAST alignment module, with three algorithms (BLASTN, Mega BLAST and discontinuous Mega BLAST) (Altschul et al., 1990). Users can select the SARS-CoV-2 reference genome, 2019nCoVR genome database, or coronavirus genome database (including alpha/beta/delta/gamma genus) for online BLAST. 3. Genome annotation module To perform sequence comparison and evolutionary analysis on specific viral genes, gene annotations are required. However, most viral genomes in the above SARS-CoV-2 databases are not annotated. Therefore, we built a genome annotation module based on VAPiD (Shean et al., 2019), which can identify coding sequences (CDS) or protein sequences and generate a GenBank annotation file. 4. Variant identification modules The variant identification function consists of the Genome-to-Variants and Fastq-to-Variants modules. Both modules use the genome NC_045512.2 as a default reference, but users can customize the reference by uploading a genome file. Genome-to-Variants can detect mutation sites from complete or partial genomes, using Muscle (Edgar, 2004) for sequence alignment. Fastq-to-Variants can identify genome variants from NGS raw data and connect seamlessly to the GSA system to load massive raw sequencing data to the server automatically. Sequencing reads are aligned to the SARS-CoV-2 reference genome (NC_045512.2) using BWA (Li & Durbin, 2009), after which Picard is used to remove duplicate reads and calculate aligned read number, error rate, sequencing depth, and genome coverage (http://broadinstitute.github.io/picard/). Single nucleotide polymorphisms (SNPs) and insertions and deletions (indels) are identified using GATK (McKenna et al., 2010). 5. Variation annotation module To clarify the mutation influence on gene function, the variation annotation module integrates the Ensembl Variant Effect Predictor (VEP) (McLaren et al., 2016) to show codon and amino acid changes, and then calculates the degree of function influence. It is worth mentioning that the parameters for the data analysis modules have been highly optimized to improve efficiency and reduce computing time. For example, when testing the running time with the Fastq-to-Variants module using one 24-core server, it cost ~1 min to process 1 Gb of NGS data and less than 4 min for handling 8 Gb of NGS data (Table 2). For this online platform, we established five servers to provide public service, which indicates that the platform has the capacity to analyze 7 200 NGS data in one day if the data size is less than 1 Gb. In general, a notification email will be automatically sent to users when computing jobs are finished.
Table 2

Reference running time

Data1Data2Data3Data4Data5
*: Run on 24 CPU cores.
NCBI accession No.SRR11247077SRR11092064SRR11092057SRR11092058SRR10971381
Calculation time*0 m 37 s0 m 55 s1 m 10 s1 m 36 s3 m 42 s
Data size (bp)118 M1.0 G1.5 G2.2 G8.0 G
For future applications, we will continue to improve this specialized online platform by integrating more tools, software, and pipelines for SARS-CoV-2 data analysis and provide one-click and public data analysis services for coronavirus researchers.

COMPETING INTERESTS

The authors declare that they have no competing interests.

AUTHORS’ CONTRIBUTIONS

W.M.Z., Y.M.B., Y.B.X., J.F.X., and Z.Z. designed the research. Z.L.D., Z.G., L.N.M., S.J., S.H.S., M.L.C., and C.P.L. implemented the analysis modules. J.W.Z., B.X.T., D.Z., and Y.B.S. built the web server. Z.L.D. and Z.G. wrote the manuscript. Y.B.X., W.M.Z., and Z.L.D. revised the manuscript. All authors read and approved the final version of the manuscript.
  14 in total

1.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.

Authors:  Dinghua Li; Chi-Man Liu; Ruibang Luo; Kunihiko Sadakane; Tak-Wah Lam
Journal:  Bioinformatics       Date:  2015-01-20       Impact factor: 6.937

2.  QUAST: quality assessment tool for genome assemblies.

Authors:  Alexey Gurevich; Vladislav Saveliev; Nikolay Vyahhi; Glenn Tesler
Journal:  Bioinformatics       Date:  2013-02-19       Impact factor: 6.937

3.  The 2019 novel coronavirus resource.

Authors:  Wen-Ming Zhao; Shu-Hui Song; Mei-Li Chen; Dong Zou; Li-Na Ma; Ying-Ke Ma; Ru-Jiao Li; Li-Li Hao; Cui-Ping Li; Dong-Mei Tian; Bi-Xia Tang; Yan-Qing Wang; Jun-Wei Zhu; Huan-Xin Chen; Zhang Zhang; Yong-Biao Xue; Yi-Ming Bao
Journal:  Yi Chuan       Date:  2020-02-20

4.  The sequence read archive.

Authors:  Rasko Leinonen; Hideaki Sugawara; Martin Shumway
Journal:  Nucleic Acids Res       Date:  2010-11-09       Impact factor: 16.971

5.  GISAID: Global initiative on sharing all influenza data - from vision to reality.

Authors:  Yuelong Shu; John McCauley
Journal:  Euro Surveill       Date:  2017-03-30

6.  Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes.

Authors:  Sara Cleemput; Wim Dumon; Vagner Fonseca; Wasim Abdool Karim; Marta Giovanetti; Luiz Carlos Alcantara; Koen Deforche; Tulio de Oliveira
Journal:  Bioinformatics       Date:  2020-06-01       Impact factor: 6.937

7.  The UCSC SARS-CoV-2 Genome Browser.

Authors:  Jason D Fernandes; Angie S Hinrichs; Hiram Clawson; Jairo Navarro Gonzalez; Brian T Lee; Luis R Nassar; Brian J Raney; Kate R Rosenbloom; Santrupti Nerli; Arjun A Rao; Daniel Schmelter; Alastair Fyfe; Nathan Maulding; Ann S Zweig; Todd M Lowe; Manuel Ares; Russ Corbet-Detig; W James Kent; David Haussler; Maximilian Haeussler
Journal:  Nat Genet       Date:  2020-10       Impact factor: 38.330

8.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

9.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors:  Robert C Edgar
Journal:  BMC Bioinformatics       Date:  2004-08-19       Impact factor: 3.169

10.  Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors:  Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal:  Bioinformatics       Date:  2014-04-01       Impact factor: 6.937

View more
  20 in total

1.  The forty-year journey of Zoological Research: advancing with the times.

Authors:  Yong-Gang Yao; Xue-Long Jiang
Journal:  Zool Res       Date:  2021-01-18

2.  Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022.

Authors: 
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

3.  CovidPhy: A tool for phylogeographic analysis of SARS-CoV-2 variation.

Authors:  Xabier Bello; Jacobo Pardo-Seco; Alberto Gómez-Carballa; Hansi Weissensteiner; Federico Martinón-Torres; Antonio Salas
Journal:  Environ Res       Date:  2021-08-20       Impact factor: 6.498

4.  Targeting liquid-liquid phase separation of SARS-CoV-2 nucleocapsid protein promotes innate antiviral immunity by elevating MAVS activity.

Authors:  Shuai Wang; Tong Dai; Ziran Qin; Ting Pan; Feng Chu; Lingfeng Lou; Long Zhang; Bing Yang; Huizhe Huang; Huasong Lu; Fangfang Zhou
Journal:  Nat Cell Biol       Date:  2021-07-08       Impact factor: 28.824

5.  Quasispecies of SARS-CoV-2 revealed by single nucleotide polymorphisms (SNPs) analysis.

Authors:  Rongsui Gao; Wenhong Zu; Yang Liu; Junhua Li; Zeyao Li; Yanling Wen; Haiyan Wang; Jing Yuan; Lin Cheng; Shengyuan Zhang; Yu Zhang; Shuye Zhang; Weilong Liu; Xun Lan; Lei Liu; Feng Li; Zheng Zhang
Journal:  Virulence       Date:  2021-12       Impact factor: 5.882

6.  Coronavirus GenBrowser for monitoring the transmission and evolution of SARS-CoV-2.

Authors:  Dalang Yu; Xiao Yang; Bixia Tang; Yi-Hsuan Pan; Jianing Yang; Guangya Duan; Junwei Zhu; Zi-Qian Hao; Hailong Mu; Long Dai; Wangjie Hu; Mochen Zhang; Ying Cui; Tong Jin; Cui-Ping Li; Lina Ma; Xiao Su; Guoqing Zhang; Wenming Zhao; Haipeng Li
Journal:  Brief Bioinform       Date:  2022-03-10       Impact factor: 11.622

7.  CoVrimer: A tool for aligning SARS-CoV-2 primer sequences and selection of conserved/degenerate primers.

Authors:  Merve Vural; Aslinur Akturk; Mert Demirdizen; Ronaldo Leka; Rana Acar; Ozlen Konu
Journal:  Genomics       Date:  2021-07-19       Impact factor: 5.736

8.  The Global Landscape of SARS-CoV-2 Genomes, Variants, and Haplotypes in 2019nCoVR.

Authors:  Shuhui Song; Lina Ma; Dong Zou; Dongmei Tian; Cuiping Li; Junwei Zhu; Meili Chen; Anke Wang; Yingke Ma; Mengwei Li; Xufei Teng; Ying Cui; Guangya Duan; Mochen Zhang; Tong Jin; Chengmin Shi; Zhenglin Du; Yadong Zhang; Chuandong Liu; Rujiao Li; Jingyao Zeng; Lili Hao; Shuai Jiang; Hua Chen; Dali Han; Jingfa Xiao; Zhang Zhang; Wenming Zhao; Yongbiao Xue; Yiming Bao
Journal:  Genomics Proteomics Bioinformatics       Date:  2020-12-03       Impact factor: 7.691

9.  The high diversity of SARS-CoV-2-related coronaviruses in pangolins alerts potential ecological risks.

Authors:  Min-Sheng Peng; Jian-Bo Li; Zheng-Fei Cai; Hang Liu; Xiaolu Tang; Ruochen Ying; Jia-Nan Zhang; Jia-Jun Tao; Ting-Ting Yin; Tao Zhang; Jing-Yang Hu; Ru-Nian Wu; Zhong-Yin Zhou; Zhi-Gang Zhang; Li Yu; Yong-Gang Yao; Zheng-Li Shi; Xue-Mei Lu; Jian Lu; Ya-Ping Zhang
Journal:  Zool Res       Date:  2021-11-18

10.  Single point mutations can potentially enhance infectivity of SARS-CoV-2 revealed by in silico affinity maturation and SPR assay.

Authors:  Ting Xue; Weikun Wu; Ning Guo; Chengyong Wu; Jian Huang; Lipeng Lai; Hong Liu; Yalun Li; Tianyuan Wang; Yuxi Wang
Journal:  RSC Adv       Date:  2021-05-10       Impact factor: 3.361

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.