| Literature DB >> 35354824 |
Yiwei Wang1, Jiaxin Yang2, Xinhao Zhuang2, Yunchao Ling2, Ruifang Cao2, Qingwei Xu3, Peng Wang4,5, Ping Xu6, Guoqing Zhang7.
Abstract
The outbreak of Coronavirus Disease 2019 (COVID-19) at the end of 2019 turned into a global pandemic. To help analyze the spread and evolution of the virus, we collated and analyzed data related to the viral genome, sequence variations, and locations in temporal and spatial distribution from GISAID. Information from the Wikipedia web page and published research papers were categorized and mined to extract epidemiological data, which was then integrated with the public dataset. Genomic and epidemiological data were matched with public information, and the data quality was verified by manual curation. Finally, an online database centered on virus genomic information and epidemiological data can be freely accessible at https://www.biosino.org/kgcov/ , which is helpful to identify relevant knowledge and devising epidemic prevention and control policies in collaboration with disease control personnel.Entities:
Mesh:
Year: 2022 PMID: 35354824 PMCID: PMC8967863 DOI: 10.1038/s41597-022-01237-1
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Data sources.
| Data source | Data type | Data Format | Recorded version | Provider | URL |
|---|---|---|---|---|---|
| Wikipedia | Epidemiology | text | 19 Apr. 2020 | — | |
| Xu | Epidemiology | csv | 16 Nov. 2020 | — | |
| PubMed SARS-CoV-2 Literature | Epidemiology | text | Jul. 2020 | National Institutes of Health | |
| GISAID | Genomic | fasta | Nov. 2020 | Max Planck Institute for Informatics | |
| UniProt | Viral domain information | csv | Nov. 2020 | UniProt |
Fig. 1Distribution of data rank and date. (a) The ratio distribution of unique entries in top 10 rank case or genome data. The ratio of unique case or genome entries to all entries, was calculated by three kinds of combination of 4 indicators. (b) The temporal distribution of the number of cases or genomes. Blue represents the cases and orange represents the genomes. Note that because cases are multiple sources, the number of cases does not represent the actual number of reported cases, which is higher than the actual number of cases.
Fig. 2Manual curation flow chart.
Fig. 3Top 1–15 rank Code matching accuracy.
Fig. 4Variation in epidemiological cases. The green nodes in the figure represent the variation information, the red nodes represent the genomic information, the orange nodes represent the SARS-CoV-2 protein information, and the blue nodes represent the case information.
| Measurement(s) | Viral Epidemiology • genetic sequence variation analysis |
| Technology Type(s) | digital curation • Bioinformatics |
| Sample Characteristic - Organism | Severe acute respiratory syndrome coronavirus 2 |