Literature DB >> 33704069

The Global Landscape of SARS-CoV-2 Genomes, Variants, and Haplotypes in 2019nCoVR.

Shuhui Song¹, Lina Ma², Dong Zou², Dongmei Tian³, Cuiping Li³, Junwei Zhu³, Meili Chen², Anke Wang³, Yingke Ma³, Mengwei Li¹, Xufei Teng¹, Ying Cui¹, Guangya Duan¹, Mochen Zhang¹, Tong Jin¹, Chengmin Shi⁴, Zhenglin Du², Yadong Zhang¹, Chuandong Liu⁴, Rujiao Li², Jingyao Zeng², Lili Hao², Shuai Jiang³, Hua Chen⁵, Dali Han⁶, Jingfa Xiao¹, Zhang Zhang⁷, Wenming Zhao⁸, Yongbiao Xue⁹, Yiming Bao¹⁰.

Abstract

On January 22, 2020, China National Center for Bioinformation (CNCB) released the 2019 Novel Coronavirus Resource (2019nCoVR), an open-access information resource for the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). 2019nCoVR features a comprehensive integration of sequence and clinical information for all publicly available SARS-CoV-2 isolates, which are manually curated with value-added annotations and quality evaluated by an automated in-house pipeline. Of particular note, 2019nCoVR offers systematic analyses to generate a dynamic landscape of SARS-CoV-2 genomic variations at a global scale. It provides all identified variants and their detailed statistics for each virus isolate, and congregates the quality score, functional annotation, and population frequency for each variant. Spatiotemporal change for each variant can be visualized and historical viral haplotype network maps for the course of the outbreak are also generated based on all complete and high-quality genomes available. Moreover, 2019nCoVR provides a full collection of SARS-CoV-2 relevant literature on the coronavirus disease 2019 (COVID-19), including published papers from PubMed as well as preprints from services such as bioRxiv and medRxiv through Europe PMC. Furthermore, by linking with relevant databases in CNCB, 2019nCoVR offers data submission services for raw sequence reads and assembled genomes, and data sharing with NCBI. Collectively, SARS-CoV-2 is updated daily to collect the latest information on genome sequences, variants, haplotypes, and literature for a timely reflection, making 2019nCoVR a valuable resource for the global research community. 2019nCoVR is accessible at https://bigd.big.ac.cn/ncov/.

Entities: CellLine Disease Gene Mutation Species

Keywords: 2019nCoVR; Database; Genomic variation; Haplotype; SARS-CoV-2

Year: 2020 PMID： 33704069 PMCID： PMC7836967 DOI： 10.1016/j.gpb.2020.09.001

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Coronavirus disease 2019 (COVID-19) is a severe respiratory disease that is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1]. It has rapidly spread as a pandemic after its outbreak in late December 2019. As of July 14, 2020, 12,964,809 confirmed cases have been reported in 216 countries/territories/areas (WHO Situation Report Number 176; https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/). SARS-CoV-2 samples have been extensively isolated and sequenced by different laboratories across many countries [2], resulting in a considerable number of viral genome sequences worldwide. Therefore, public sharing and free access to a comprehensive collection of SARS-CoV-2 genome sequences is of great significance, which would help to accelerate scientific research and knowledge discovery and also help develop medical countermeasures and sensible decision-making [3]. To date, unfortunately, SARS-CoV-2 genome sequences generated worldwide were scattered around different database resources, primarily including the Global Initiative on Sharing All Influenza Data (GISIAD) [4] repository and NCBI GenBank [5]. Many sequences are available in multiple repositories but their updates are not synchronized. This makes it extremely challenging for worldwide users to effectively retrieve a non-redundant and most updated set of sequence data, and to collaboratively and rapidly deal with this global pandemic. Toward this end, we constructed the 2019 Novel Coronavirus Resource (2019nCoVR, https://bigd.big.ac.cn/ncov/) [6]. Through comprehensive integration and value-added annotation and analysis, we provide public, free, and rapid access to a complete collection of non-redundant global SARS-CoV-2 genomes. Since its inception on January 22, 2020, 2019nCoVR is updated on a daily basis, leading to unprecedentedly dramatic data expansion from 86 genomes in its first release to 64,789 genomes in its current version (as of July 14, 2020). Moreover, it has been substantially upgraded by implementing enhanced data curation and analysis pipelines and online functionalities. Specifically, we enrich 2019nCoVR by including data quality evaluation, variant calling, variant spatiotemporal dynamic tracking, viral haplotype construction, and interactive visualization with more user-friendly web interfaces (Table 1). Here we report these significant updates of 2019nCoVR and present the global landscape of SARS-CoV-2 genomes, variants, and haplotypes.

Table 1

Comparison of functional modules between two versions of 2019nCoVR

Data collection and processing

Data collection and integration

All genome sequences as well as their related metadata were integrated from SARS-CoV-2 resources worldwide, including GISAID [4], NCBI [5], National Genomics Data Center (NGDC) [7], National Microbiology Data Center (NMDC) [8], and China National GeneBank (CNGB) [9]. To provide a non-redundant dataset, duplicated records from different databases were identified and merged.

Quality control and curation

To ensure the integrity of genome sequences, a sequence is defined as ‘complete’ if it is longer than 29,000 bp and covers all protein-coding regions of SARS-CoV-2 (nt 266–29674 of GenBank: MN908947.3); otherwise, it is defined as “partial”. Furthermore, to examine the quality of genome sequences, unknown bases (Ns) and degenerate bases (Ds, more than one possible base at a particular position and sometimes referred as “mixed bases”) were counted for each sequence. According to our definition, a sequence is considered “high-quality” if it contains ≤ 15 Ns and ≤ 50 Ds, and “low-quality” otherwise. In addition, a sequence is clearly labeled if the number of variants is ≥ 15 or the total number of deletions is ≥ 2, or the distribution of sequence variations is more aggregated (the ratio of the number of variants divided by the total number of bases in a window is ≥ 0.25).

Variant identification and haplotype network construction

Only complete and high-quality genome sequences were used for downstream analyses, including sequence comparison, variant identification, functional annotation, and haplotype network construction. Genome sequence alignment was performed with MUSCLE (3.8.31) [10] by comparing against the earliest released SARS-CoV-2 genome (GenBank: MN908947.3). Sequence variation was identified directly using an in-house Perl program. The effect of variants was determined using Ensembl Variant Effect Predictor (VEP) [11]. SARS-CoV-2 haplotypes were constructed based on short pseudo sequences that consist of all variants (filtering out variations located in UTR regions). Then, all these pseudo sequences were clustered into groups, and each group (a haplotype) represents a unique sequence pattern. The haplotype network was inferred from all identified haplotypes, where the reference sequence haplotype was set as the starting node, and its relationship with other haplotypes was determined according to the inheritance of mutations. As a result, nine major haplotype network clades (denoted as C01–C09) were obtained according to the phylogenetic tree-and-branch structure and the shared landmark mutations (Table 1). Specifically, mutations with population mutation frequency (PMF) ≥ 0.05 (except for ATG deletion at position 1605, PMF ≈ 0.03) were selected, and the co-occurring mutations were determined by LD linkage analysis. A clade refers to sequences with the co-occurring landmark mutations.

Implementation

2019nCoVR was built based on a browser/server (B/S) architecture. Web interfaces was developed by the Java Server Pages (JSP), HTML, Cascading Style Sheet (CSS), Asynchronous JavaScript and XML (AJAX), JQuery (a cross-platform and feature-rich JavaScript library; http://jquery.com), as well as Semantic-UI (an open source web development framework; https://semantic-ui.com). The database server was implemented by using the Spring Boot (a rapid application development framework based on Spring; https://spring.io). MySQL (https://mysql.com) was used for data storage. For interactive visualization, we implemented HighCharts (a modern SVG-based multi-platform charting library; https://highcharts.com), D3.js (a JavaScript library for manipulating documents based on data; https://d3js.org), and 3Dmol.js (a JavaScript library for visualizing protein structure associated with mutated amino acid residues) [12] in 2019nCoVR. The haplotype network was visualized using D3.js, Leaflet (http://leafletjs.com), and Echarts (http://echarts.baidu.com/).

Database content and features

Statistics of SARS-CoV-2 genome assemblies

Since the outbreak of COVID-19, the number of SARS-CoV-2 genome sequences released globally has been increasing at an unprecedented rate. To facilitate free public access to all genome assemblies and help worldwide researchers better understand the variation and transmission of SARS-CoV-2, we perform daily updates for 2019nCoVR by integrating all available genomes throughout the world and conducting value-added curation and analysis. As of July 14, 2020, 2019nCoVR hosted a total of 64,789 non-redundant genome sequences and provided a global distribution of SARS-CoV-2 genome sequences in 97 countries/regions across 6 continents. Duplicated sequences from different databases are merged with all IDs cross-referenced. Sequences are contributed primarily by United Kingdom (28,823, 44.5%), United States (13,556, 20.9%), Australia (2351, 3.6%), Spain (1852, 2.9%), Netherlands (1605, 2.5%), India (1581, 2.4%), and China (1431, 2.2%). According to our statistics, SARS-CoV-2 genome sequences started to grow rapidly from mid-March (https://bigd.big.ac.cn/ncov/release_genome), concordant with the outset of global pandemic of COVID-19. A full list of our sequence datasets, including strain name, accession number, and source, is provided in Table S1. To provide high-quality genome sequences that are critically essential for downstream analyses (ranging from variant calling to haplotype construction), we perform sequence integrity and quality assessment for all newly-collected sequences. Among all the human-derived genome sequences released (64,700), 60,970 (94%) are complete, and 31,689 (49%) are high-quality (Figure 1A). Most of the low-quality sequences (29,281, 99.7%) contain different numbers of unknown bases (Ns). Among these sequences, 60% have 16–500 Ns (median 258), and 40% have more than 500 Ns (Figure 1B). Further investigation of the genomic locations reveals that some genomic regions with high frequency of Ns (Figure 1C). Sequence integrity and quality assessment analytic data are available for all genome sequences, and can be used as filters for sequence browse and search.

Figure 1

Statistics and distribution of all released SARS-CoV-2 genomes in 2019nCoVR as of July 14, 2020

A. Number and percentage of complete and high-quality genomes. B. Distribution of sequence number across different ranges of Ns for low-quality genomes. C. Frequency distribution of Ns across the whole genome. A sequence is defined as “complete” if it is longer than 29,000 bp and covers all protein-coding regions of SARS-CoV-2 (nt 266–29674 of GenBank: MN908947.3); otherwise, it is defined as “partial”. A sequence is considered “high-quality” if it contains ≤ 15 Ns and ≤ 50 Ds, and “low-quality” otherwise. N, unknown base; D, degenerate base.

Statistics and distribution of all released SARS-CoV-2 genomes in 2019nCoVR as of July 14, 2020 A. Number and percentage of complete and high-quality genomes. B. Distribution of sequence number across different ranges of Ns for low-quality genomes. C. Frequency distribution of Ns across the whole genome. A sequence is defined as “complete” if it is longer than 29,000 bp and covers all protein-coding regions of SARS-CoV-2 (nt 266–29674 of GenBank: MN908947.3); otherwise, it is defined as “partial”. A sequence is considered “high-quality” if it contains ≤ 15 Ns and ≤ 50 Ds, and “low-quality” otherwise. N, unknown base; D, degenerate base.

Landscape of genomic variants

Bases on the 31,689 human-derived high-quality complete genome sequences obtained globally (only high-quality complete genome sequences are used for downstream analysis if not indicated otherwise), we investigate the landscape of SARS-CoV-2 genomic variants in comparison with the reference genome (GenBank: MN908947.3) (Figure 2). By July 14, 2020, a total of 13,428 variants had been identified, including 12,828 (95.5%) single-nucleotide polymorphisms (SNPs), 437 deletions, 116 insertions, and 47 indels (a combination of an insertion and a deletion, affecting 2 or more nucleotides) (Figure 2A). More than half of these SNPs (6770, 50.4%) are nonsynonymous, causing amino acid changes. To evaluate the impact of missense variants of S protein on the interaction with its receptor human angiotensin-converting enzyme 2 (ACE2) (e.g., in the key binding region), mutated amino acids are projected onto protein 3D structures, which can be viewed by 360 degree rotation (Figure 2B). We further explore distribution of variants across different genes. Noticeably, three genes ORF1ab, S, and N accumulate more variants (Figure 2C). In addition, SNP densities (i.e., number of mutations per nucleotide in the genic region) are higher in several genic regions, including ORF7a, ORF3a, ORF6, and N (https://bigd.big.ac.cn/ncov/variation/annotation).

Figure 2

Landscape of genomic variants

A. Number of mutations in different mutation types. The orange bar indicates the number of all mutations, and the blue bar indicates the number of mutations with PMF > 0.001. B. 3D structure display for nonsynonymous mutations in S protein (PDB: 6VSB, http://www.rcsb.org/structure/6VSB). The structure is shown in sphere (left panel) and stick (right panel). The three chains (A, B, and C) of S protein are displayed in blue, yellow, and green, respectively. The binding region (amino acid residues: 336–516) of the S protein with its receptor human ACE2 is shown in cyan for all three chains. C. Pie chart showing variant annotation for each gene of SARS-CoV-2. D. Distribution of PMF for all variants. Coordinate information for representative variants (including positions 241, 1059, 3037, 8782, 14408, and 23403) is provided. PMF, population mutation frequency; ACE2, angiotensin-converting enzyme 2.

Landscape of genomic variants A. Number of mutations in different mutation types. The orange bar indicates the number of all mutations, and the blue bar indicates the number of mutations with PMF > 0.001. B. 3D structure display for nonsynonymous mutations in S protein (PDB: 6VSB, http://www.rcsb.org/structure/6VSB). The structure is shown in sphere (left panel) and stick (right panel). The three chains (A, B, and C) of S protein are displayed in blue, yellow, and green, respectively. The binding region (amino acid residues: 336–516) of the S protein with its receptor human ACE2 is shown in cyan for all three chains. C. Pie chart showing variant annotation for each gene of SARS-CoV-2. D. Distribution of PMF for all variants. Coordinate information for representative variants (including positions 241, 1059, 3037, 8782, 14408, and 23403) is provided. PMF, population mutation frequency; ACE2, angiotensin-converting enzyme 2. For each variant, we investigate its PMF (the ratio of the number of mutated genomes to the total number of complete high-quality genomes) (Figure 2D). Clearly, there are 62 variants with PMF > 0.01 and 18 variants with PMF > 0.05. In particular, there are 4 variants with PMF > 0.75, including positions 241 in 5′UTR, 3037 and 14408 in ORF1ab, as well as position 23403 in S. These may potentially represent the main prevalent virus genotypes across the globe. All identified variants and their functional annotations are publicly available in the database. In addition, an online pipeline for variant identification and functional annotation is also provided for free access at https://bigd.big.ac.cn/ncov/analysis [13].

Spatiotemporal dynamics of genomic variants

To track the dynamics of SARS-CoV-2 genomic variants, particularly de novo mutations, we explore the spatiotemporal change of PMF for each variant according to sampling dates and locations (Figure 3). Among the 18 sites with PMF > 0.05, a few mutations occurred simultaneously in multiple sequences and in a linkage manner (Figure 3A), such as mutations at positions 8782 and 28144 as reported previously [14]. It is of note that mutations at these two sites appeared in the early stage of the outbreak on December 30, 2019. Their mutation frequencies reach ~ 0.33 around January 22, 2020, and then gradually declined to 0.10 on July 14, 2020. In contrast, some variants appear only at the middle stage around March 3, 2020. For instance, mutation at position 23403 (resulting in an amino acid change D614G in the S protein) is accompanied by three other mutations, namely, a C-to-U mutation at position 241 in the 5′UTR of SARS-CoV-2 genome, a silent C-to-U mutation in the gene nsp3 at position 3037, and a missense C-to-U mutation in the gene RdRp at position 14408 (P4715L). To make it easier for users to investigate any variant of interest, we provide an interactive heatmap in 2019nCoVR (https://bigd.big.ac.cn/ncov/variation/heatmap) to dynamically display and cluster the mutation patterns over all sampling dates, with customized options available that allow users to select specific variant frequency, annotated gene/region, variant effect type, and transcription regulation sequence (TRS).

Figure 3

Spatiotemporal dynamics of genomic variants

A. Heatmap of variant PMF (PMF > 0.01) over sampling date. B. Distribution of PMF and cumulative growth curve of the sequence with mutation at position 23403 (D614G). C. Cumulative growth of the sequence with mutation at position 23403 (D614G) in top 10 countries. Data were downloaded from 2019nCoVR on July 14, 2020.

Spatiotemporal dynamics of genomic variants A. Heatmap of variant PMF (PMF > 0.01) over sampling date. B. Distribution of PMF and cumulative growth curve of the sequence with mutation at position 23403 (D614G). C. Cumulative growth of the sequence with mutation at position 23403 (D614G) in top 10 countries. Data were downloaded from 2019nCoVR on July 14, 2020. Moreover, we investigate dynamic patterns of SARS-CoV-2 genomic variants across different sampling locations over time. Taking the variant at position 23403 (D614G) as an example, its PMF has dramatically increased from 0 at the end of February to 0.76 in the middle of July, and the mutant form G614 became dominant gradually along with the development of pandemic (Figure 3B), presumably indicating that the mutated genotypes may have higher transmissibility [15]. In terms of the absolute number of mutations across different countries/regions, G614 form was dominantly reported in Europe and North America (Figure 3C). (https://bigd.big.ac.cn/ncov/variation/annotation/variant/23403). When investigating the mutation pattern for each country (Figure 4), we find that sequences from some Asian countries (such as South Korea, Malaysia, and Nepal) have no or very few G614 mutant form, whereas countries from Europe and America (e.g., Argentina, Czech Republic, and Serbia) have the G614 form that is dominant among the available samples. In some countries, both the D614 and G614 forms co-existed early in the epidemic, but the mutant form quickly became dominant, such as in Australia, Belgium, Canada, Chile, France, Israel, United States, and United Kingdom. The accumulation of this mutation varies in different parts of the world, possibly due to the prevention and control measures implemented in some countries/regions. Taken together, 2019nCoVR features spatiotemporal dynamics tracking of SARS-CoV-2 genomic variants, and thus bears great potential to help decipher viral transmission and adaptation to the host.

Figure 4

PMF of variant 23,403 for each country across different sampling dates

Number of accumulated sequences as of July 14, 2020 is provided in parenthesis after country name.

PMF of variant 23,403 for each country across different sampling dates Number of accumulated sequences as of July 14, 2020 is provided in parenthesis after country name.

Haplotype network construction and characterization

To better characterize the diversity of virus sequences, we built SARS-CoV-2 haplotypes based on all identified variants of non-UTR regions. As a result, 17,624 haplotypes were identified from 31,689 complete high-quality genome sequences as of July 14, 2020. We construct a haplotype network for SARS-CoV-2 (Figure 5), a graphical representation of relationships between individual genotypes inferred from genomic variations. The haplotype network is built based on the principle of the shortest set of connections that link all nodes (genotypes), where the length of each connection represents the genetic distance [16]. The SARS-CoV-2 haplotype network can be visualized according to sample collection date and across different countries/regions, thus providing an overview of the pandemic transmission in a spatiotemporal manner. It not only allows users to easily obtain a landscape of SARS-CoV-2 haplotypes and their relationships, but also helps users to navigate a set of haplotypes for a specific country/region. In addition to the haplotype network, the associated information could also be accessed, such as the number of genomes, as well as sampling time and location (Figure 5A).

Figure 5

Haplotype network and clade identification and distribution

A. Snapshot of haplotype network dashboard, dynamically showing the development of haplotype (I) across countries (II) and over time (III). Each node in the network represents a haplotype and the node size is proportional to the number of viral genome sequences. The edge between any two nodes represents the genetic distance between two haplotypes. Number of newly-released genome sequences each day is dynamically displayed on the respective date. B. Schematic diagram of haplotype clades (C01–C09). C. Schematic diagram of three lineages and nine clades, and the common mutation sites for each clade. D. Percentage of sequences in clades C01–C09 across different continents. E. Sequence number distribution of different lineages (S, L, and G) and clades (C01–C09) throughout the globe and in three representative countries (United States, United Kingdom, and China).

Haplotype network and clade identification and distribution A. Snapshot of haplotype network dashboard, dynamically showing the development of haplotype (I) across countries (II) and over time (III). Each node in the network represents a haplotype and the node size is proportional to the number of viral genome sequences. The edge between any two nodes represents the genetic distance between two haplotypes. Number of newly-released genome sequences each day is dynamically displayed on the respective date. B. Schematic diagram of haplotype clades (C01–C09). C. Schematic diagram of three lineages and nine clades, and the common mutation sites for each clade. D. Percentage of sequences in clades C01–C09 across different continents. E. Sequence number distribution of different lineages (S, L, and G) and clades (C01–C09) throughout the globe and in three representative countries (United States, United Kingdom, and China). According to the haplotype network, we classified all genome sequences into nine major clades (labeled as C01–C09; see Methods for details) (Figure 5B and C; Table 2). As the pandemic spread of SARS-CoV-2 is still ongoing, new branches that evolve and spread faster are constantly emerging, such as clades C04, C06, C08, and C09 (Table 2). The dominant clades are C06 (8681, 27.4%), C08 (7889, 24.9%), and C09 (6940, 21.9%) (Figure 5D), which are characterized by the signature mutations of C-to-U mutation at positions 241, 3037, and 14408, and A-to-G mutation at position 23403. These clades are defined as the G lineage (as the mutation at position 23403 leads to an amino acid change D614G of S protein). The G lineage sequences have been reported in 82 countries across the globe, and become the main epidemic virus type in most countries in Europe, North America, South America, Africa, and West Asia, etc. For example, there are 6827 (71.5%), 8305 (83.4%), and 970 (18.5%) sequences from the G lineage reported in the United States, United Kingdom, and China, respectively (Figure 5E). The widespread and prevalence of the G lineage in different countries suggest the adaptability of this lineage to humans [15].

Table 2

Signature mutations of haplotype clades

Note: AA, amino acid; NA, not applicable.

Signature mutations of haplotype clades Note: AA, amino acid; NA, not applicable.

Discussion

Genome sequencing is vital to understand the epidemiology of SARS-CoV-2, which is not only useful for deciphering genomic composition of the virus and investigating its evolution and transmission, but also highly effective at determining whether individuals belong to the same transmission chain [17]. According to 2019nCoVR, the ratio of the number of sequenced samples to the number of confirmed cases is very low in some countries/regions (Figure S1), and genome sequences are even unavailable in some affected countries/regions. The SARS-CoV-2 sampling bias and limited sequencing depth may lead to inaccurate transmission patterns and phylogenetic relationships [18]. Sequencing all infected cases in a single region reveals that the transmission of Clostridium difficile from symptomatic patients accounts for only one third of all infected cases [19]. Given our current understanding of SARS-CoV-2 is still limited, we call for more efforts and collaborations in sequencing more SARS-CoV-2 genomes from both symptomatic and asymptomatic cases. The SARS-CoV-2 genome sequences currently released were generated by multiple different laboratories on different sequencing platforms. This raises concerns on the quality of genome sequences, such as the Ns of genome, which may affect variant calling and lead to biased population frequency estimation. As mentioned above, the frequency of Ns in some genomic regions is high, possibly due to the low sequencing coverage, low sequence complexity, low efficiency of PCR primers used in sequencing library construction, presence of RNA secondary structure, etc. However, sequencing coverage information is largely unavailable, making it challenging to evaluate whether the Ns are due to low sequencing coverage. We further investigated the genomic regions with high frequency of Ns and had the following findings. (1) GC and AG contents of these regions are close to the average GC and AG contents of the whole genome, excluding the possibility of low sequence complexity. (2) The length of these regions ranges from 210 bp to 320 bp (similar to the length of PCR product) and more than 60% of the related sequences are generated on Illumina platform (based on PCR amplification), suggesting that these Ns may result from low efficiency of PCR primers during sequencing library construction. (3) Minimum free energy of RNA secondary structure of these regions is lower than that of randomly extracted regions, indicating that the secondary structure of these regions is more stable and may affect the determination of genome sequences (Figure S2). In future, we plan to construct a golden benchmark dataset with for quality assessment and data filtration. Compared to the early overly simplified L-S classification [14] and the comprehensive lineages defined by Rambaut et al. [20], our classification scheme with nine clades provides a moderate system that can be correlated with the two classifications mentioned above (Table 2). The nine clades could also be grouped into three lineages defined previously [14], namely, S (C02 and C04), G (C06, C08, and C09), and L (the remaining clades). Although haplotype network cannot provide a precise evolutionary position as phylogenetic trees do, it can be used to quickly inform the clustering of viruses according to signature mutations in each haplotype. Definitely, new clades will be introduced as the virus is continuing to evolve. A data-driven response to SARS-CoV-2 requires a public, free, and open-access data resource that contains complete high-quality genome sequence data, and equips with automated online pipelines to rapidly analyze genome sequences. Thus, 2019nCoVR (together with other resources in CNCB) provides a wide range of data services, including raw sequencing data archive, genome sequence and meta information management with quality control and curation, variation analysis, as well as data presentation and visualization. Additionally, compared to GISAID and NCBI Virus, 2019nCoVR features spatiotemporal dynamic tracking for all identified variants. This makes it easier for users worldwide to monitor any variant that may be associated with rapid transmission and high virulence. To better understand the epidemiology of SARS-CoV-2, future efforts are needed to collect ever more genome sequences worldwide, to include other types of omics data (such as transcriptome and epitranscriptome, if available) [21], and also to provide more friendly interfaces and online tools in support of research activities worldwide.

Data availability

SARS-CoV-2 genomes, variants (in vcf format), and their annotations are publicly available at https://bigd.big.ac.cn/ncov/.

CRediT author statement

Shuhui Song: Conceptualization, Methodology, Writing - original draft. Lina Ma: Data curation, Methodology, Writing - original draft. Dong Zou: Resources, Visualization, Writing - original draft. Dongmei Tian: Methodology. Cuiping Li: Methodology. Junwei Zhu: Software. Meili Chen: Data curation. Anke Wang: Software. Yingke Ma: Resources. Mengwei Li: Methodology. Xufei Teng: Visualization. Ying Cui: Data curation. Guangya Duan: Data curation. Mochen Zhang: Data curation. Tong Jin: Data curation. Chengmin Shi: Methodology. Zhenglin Du: Methodology. Yadong Zhang: Methodology. Chuandong Liu: Methodology. Rujiao Li: Data curation. Jingyao Zeng: Data curation. Lili Hao: Data curation. Shuai Jiang: Methodology. Hua Chen: Supervision. Dali Han: Supervision. Jingfa Xiao: Supervision, Methodology. Zhang Zhang: Conceptualization, Supervision, Writing - review & editing. Wenming Zhao: Conceptualization, Supervision, Methodology. Yongbiao Xue: Conceptualization, Supervision. Yiming Bao: Conceptualization, Supervision, Writing - review & editing. All authors read and approved the final manuscript.

Competing interests

The authors have declared no competing interests.

20 in total

1. Median-joining networks for inferring intraspecific phylogenies.

Authors: H J Bandelt; P Forster; A Röhl
Journal: Mol Biol Evol Date: 1999-01 Impact factor: 16.240

2. Increased interactivity and improvements to the GigaScience database, GigaDB.

Authors: Si Zhe Xiao; Chris Armit; Scott Edmunds; Laurie Goodman; Peter Li; Mary Ann Tuli; Christopher Ian Hunter
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

3. Database Resources of the National Genomics Data Center in 2020.

Authors:
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

Review 4. The application of genomics to tracing bacterial pathogen transmission.

Authors: Nicholas J Croucher; Xavier Didelot
Journal: Curr Opin Microbiol Date: 2014-11-22 Impact factor: 7.934

5. 3Dmol.js: molecular visualization with WebGL.

Authors: Nicholas Rego; David Koes
Journal: Bioinformatics Date: 2014-12-12 Impact factor: 6.937

6. GISAID: Global initiative on sharing all influenza data - from vision to reality.

Authors: Yuelong Shu; John McCauley
Journal: Euro Surveill Date: 2017-03-30

7. Sampling bias and incorrect rooting make phylogenetic network tracing of SARS-COV-2 infections unreliable.

Authors: Carla Mavian; Sergei Kosakovsky Pond; Simone Marini; Brittany Rife Magalis; Anne-Mieke Vandamme; Simon Dellicour; Samuel V Scarpino; Charlotte Houldcroft; Julian Villabona-Arenas; Taylor K Paisie; Nídia S Trovão; Christina Boucher; Yun Zhang; Richard H Scheuermann; Olivier Gascuel; Tommy Tsan-Yuk Lam; Marc A Suchard; Ana Abecasis; Eduan Wilkinson; Tulio de Oliveira; Ana I Bento; Heiko A Schmidt; Darren Martin; James Hadfield; Nuno Faria; Nathan D Grubaugh; Richard A Neher; Guy Baele; Philippe Lemey; Tanja Stadler; Jan Albert; Keith A Crandall; Thomas Leitner; Alexandros Stamatakis; Mattia Prosperi; Marco Salemi
Journal: Proc Natl Acad Sci U S A Date: 2020-05-07 Impact factor: 11.205

8. The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2.

Authors:
Journal: Nat Microbiol Date: 2020-03-02 Impact factor: 17.745

9. gcMeta: a Global Catalogue of Metagenomics platform to support the archiving, standardization and analysis of microbiome data.

Authors: Wenyu Shi; Heyuan Qi; Qinglan Sun; Guomei Fan; Shuangjiang Liu; Jun Wang; Baoli Zhu; Hongwei Liu; Fangqing Zhao; Xiaochen Wang; Xiaoxuan Hu; Wei Li; Jia Liu; Ye Tian; Linhuan Wu; Juncai Ma
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

10. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus.

Authors: Bette Korber; Will M Fischer; Sandrasegaram Gnanakaran; Hyejin Yoon; James Theiler; Werner Abfalterer; Nick Hengartner; Elena E Giorgi; Tanmoy Bhattacharya; Brian Foley; Kathryn M Hastie; Matthew D Parker; David G Partridge; Cariad M Evans; Timothy M Freeman; Thushan I de Silva; Charlene McDanal; Lautaro G Perez; Haili Tang; Alex Moon-Walker; Sean P Whelan; Celia C LaBranche; Erica O Saphire; David C Montefiori
Journal: Cell Date: 2020-07-03 Impact factor: 66.850

29 in total

1. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022.

Authors:
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

2. Conserved recombination patterns across coronavirus subgenera.

Authors: Arné de Klerk; Phillip Swanepoel; Rentia Lourens; Mpumelelo Zondo; Isaac Abodunran; Spyros Lytras; Oscar A MacLean; David Robertson; Sergei L Kosakovsky Pond; Jordan D Zehr; Venkatesh Kumar; Michael J Stanhope; Gordon Harkins; Ben Murrell; Darren P Martin
Journal: Virus Evol Date: 2022-06-14

3. Identification and validation of a novel locus, Qpm-3BL, for adult plant resistance to powdery mildew in wheat using multilocus GWAS.

Authors: Xijun Du; Weigang Xu; Chaojun Peng; Chunxin Li; Yu Zhang; Lin Hu
Journal: BMC Plant Biol Date: 2021-07-30 Impact factor: 4.215

4. Quasispecies of SARS-CoV-2 revealed by single nucleotide polymorphisms (SNPs) analysis.

Authors: Rongsui Gao; Wenhong Zu; Yang Liu; Junhua Li; Zeyao Li; Yanling Wen; Haiyan Wang; Jing Yuan; Lin Cheng; Shengyuan Zhang; Yu Zhang; Shuye Zhang; Weilong Liu; Xun Lan; Lei Liu; Feng Li; Zheng Zhang
Journal: Virulence Date: 2021-12 Impact factor: 5.882

5. Compositional Variability and Mutation Spectra of Monophyletic SARS-CoV-2 Clades.

Authors: Xufei Teng; Qianpeng Li; Zhao Li; Yuansheng Zhang; Guangyi Niu; Jingfa Xiao; Jun Yu; Zhang Zhang; Shuhui Song
Journal: Genomics Proteomics Bioinformatics Date: 2021-02-10 Impact factor: 7.691

6. Roles of host small RNAs in the evolution and host tropism of coronaviruses.

Authors: Qingren Meng; Yanan Chu; Changjun Shao; Jing Chen; Jian Wang; Zhancheng Gao; Jun Yu; Yu Kang
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

7. Phylogenomic Evidence of Reinfection and Persistence of SARS-CoV-2: First Report from Colombia.

Authors: Juan David Ramírez; Marina Muñoz; Nathalia Ballesteros; Luz H Patiño; Sergio Castañeda; Carlos A Rincón; Claudia Mendez; Carolina Oliveros; Julie Perez; Elizabeth K Márquez; Frank de Los Santos Ortiz; Camilo A Correa-Cárdenas; Maria Clara Duque; Alberto Paniz-Mondolfi
Journal: Vaccines (Basel) Date: 2021-03-19

8. Comprehensive Consensus Analysis of SARS-CoV-2 Drug Repurposing Campaigns.

Authors: Hazem Mslati; Francesco Gentile; Carl Perez; Artem Cherkasov
Journal: J Chem Inf Model Date: 2021-07-27 Impact factor: 4.956

9. Intra-host SARS-CoV-2 single-nucleotide variants emerged during the early stage of COVID-19 pandemic forecast population fixing mutations.

Authors: Yi Zhang; Ning Jiang; Weiqiang Qi; Tao Li; Yumeng Zhang; Haocheng Zhang; Jing Wu; Zhaoqin Zhu; Jingwen Ai; Chao Qiu; Wenhong Zhang
Journal: J Infect Date: 2022-01-15 Impact factor: 38.637

10. Genomic epidemiology of SARS-CoV-2 in the UAE reveals novel virus mutation, patterns of co-infection and tissue specific host immune response.

Authors: Rong Liu; Pei Wu; Pauline Ogrodzki; Sally Mahmoud; Ke Liang; Pengjuan Liu; Stephen S Francis; Hanif Khalak; Denghui Liu; Junhua Li; Tao Ma; Fang Chen; Weibin Liu; Xinyu Huang; Wenjun He; Zhaorong Yuan; Nan Qiao; Xin Meng; Budoor Alqarni; Javier Quilez; Vinay Kusuma; Long Lin; Xin Jin; Chongguang Yang; Xavier Anton; Ashish Koshy; Huanming Yang; Xun Xu; Jian Wang; Peng Xiao; Nawal Al Kaabi; Mohammed Saifuddin Fasihuddin; Francis Amirtharaj Selvaraj; Stefan Weber; Farida Ismail Al Hosani; Siyang Liu; Walid Abbas Zaher
Journal: Sci Rep Date: 2021-07-07 Impact factor: 4.379