| Literature DB >> 33704069 |
Shuhui Song1, Lina Ma2, Dong Zou2, Dongmei Tian3, Cuiping Li3, Junwei Zhu3, Meili Chen2, Anke Wang3, Yingke Ma3, Mengwei Li1, Xufei Teng1, Ying Cui1, Guangya Duan1, Mochen Zhang1, Tong Jin1, Chengmin Shi4, Zhenglin Du2, Yadong Zhang1, Chuandong Liu4, Rujiao Li2, Jingyao Zeng2, Lili Hao2, Shuai Jiang3, Hua Chen5, Dali Han6, Jingfa Xiao1, Zhang Zhang7, Wenming Zhao8, Yongbiao Xue9, Yiming Bao10.
Abstract
On January 22, 2020, China National Center for Bioinformation (CNCB) released the 2019 Novel Coronavirus Resource (2019nCoVR), an open-access information resource for the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). 2019nCoVR features a comprehensive integration of sequence and clinical information for all publicly available SARS-CoV-2 isolates, which are manually curated with value-added annotations and quality evaluated by an automated in-house pipeline. Of particular note, 2019nCoVR offers systematic analyses to generate a dynamic landscape of SARS-CoV-2 genomic variations at a global scale. It provides all identified variants and their detailed statistics for each virus isolate, and congregates the quality score, functional annotation, and population frequency for each variant. Spatiotemporal change for each variant can be visualized and historical viral haplotype network maps for the course of the outbreak are also generated based on all complete and high-quality genomes available. Moreover, 2019nCoVR provides a full collection of SARS-CoV-2 relevant literature on the coronavirus disease 2019 (COVID-19), including published papers from PubMed as well as preprints from services such as bioRxiv and medRxiv through Europe PMC. Furthermore, by linking with relevant databases in CNCB, 2019nCoVR offers data submission services for raw sequence reads and assembled genomes, and data sharing with NCBI. Collectively, SARS-CoV-2 is updated daily to collect the latest information on genome sequences, variants, haplotypes, and literature for a timely reflection, making 2019nCoVR a valuable resource for the global research community. 2019nCoVR is accessible at https://bigd.big.ac.cn/ncov/.Entities:
Keywords: 2019nCoVR; Database; Genomic variation; Haplotype; SARS-CoV-2
Year: 2020 PMID: 33704069 PMCID: PMC7836967 DOI: 10.1016/j.gpb.2020.09.001
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Comparison of functional modules between two versions of 2019nCoVR
Figure 1Statistics and distribution of all released SARS-CoV-2 genomes in 2019nCoVR as of July 14, 2020
A. Number and percentage of complete and high-quality genomes. B. Distribution of sequence number across different ranges of Ns for low-quality genomes. C. Frequency distribution of Ns across the whole genome. A sequence is defined as “complete” if it is longer than 29,000 bp and covers all protein-coding regions of SARS-CoV-2 (nt 266–29674 of GenBank: MN908947.3); otherwise, it is defined as “partial”. A sequence is considered “high-quality” if it contains ≤ 15 Ns and ≤ 50 Ds, and “low-quality” otherwise. N, unknown base; D, degenerate base.
Figure 2Landscape of genomic variants
A. Number of mutations in different mutation types. The orange bar indicates the number of all mutations, and the blue bar indicates the number of mutations with PMF > 0.001. B. 3D structure display for nonsynonymous mutations in S protein (PDB: 6VSB, http://www.rcsb.org/structure/6VSB). The structure is shown in sphere (left panel) and stick (right panel). The three chains (A, B, and C) of S protein are displayed in blue, yellow, and green, respectively. The binding region (amino acid residues: 336–516) of the S protein with its receptor human ACE2 is shown in cyan for all three chains. C. Pie chart showing variant annotation for each gene of SARS-CoV-2. D. Distribution of PMF for all variants. Coordinate information for representative variants (including positions 241, 1059, 3037, 8782, 14408, and 23403) is provided. PMF, population mutation frequency; ACE2, angiotensin-converting enzyme 2.
Figure 3Spatiotemporal dynamics of genomic variants
A. Heatmap of variant PMF (PMF > 0.01) over sampling date. B. Distribution of PMF and cumulative growth curve of the sequence with mutation at position 23403 (D614G). C. Cumulative growth of the sequence with mutation at position 23403 (D614G) in top 10 countries. Data were downloaded from 2019nCoVR on July 14, 2020.
Figure 4PMF of variant 23,403 for each country across different sampling dates
Number of accumulated sequences as of July 14, 2020 is provided in parenthesis after country name.
Figure 5Haplotype network and clade identification and distribution
A. Snapshot of haplotype network dashboard, dynamically showing the development of haplotype (I) across countries (II) and over time (III). Each node in the network represents a haplotype and the node size is proportional to the number of viral genome sequences. The edge between any two nodes represents the genetic distance between two haplotypes. Number of newly-released genome sequences each day is dynamically displayed on the respective date. B. Schematic diagram of haplotype clades (C01–C09). C. Schematic diagram of three lineages and nine clades, and the common mutation sites for each clade. D. Percentage of sequences in clades C01–C09 across different continents. E. Sequence number distribution of different lineages (S, L, and G) and clades (C01–C09) throughout the globe and in three representative countries (United States, United Kingdom, and China).
Signature mutations of haplotype clades
Note: AA, amino acid; NA, not applicable.