| Literature DB >> 33757278 |
Xu Zuo1, Yong Chen2, Lucila Ohno-Machado3, Hua Xu1.
Abstract
OBJECTIVE: This study aims at reviewing novel coronavirus disease (COVID-19) datasets extracted from PubMed Central articles, thus providing quantitative analysis to answer questions related to dataset contents, accessibility and citations.Entities:
Keywords: COVID-19; data sharing; review
Mesh:
Year: 2021 PMID: 33757278 PMCID: PMC7799277 DOI: 10.1093/bib/bbaa331
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1The workflow of screening and collecting publications and datasets from 18 332 PMC articles.
Descriptions and examples of metadata variables collected for each dataset
| Question | Variable | Description | Examples |
|---|---|---|---|
| Content | Data type | Types of dataset | Epidemiological, clinical, etc. |
| Geographic region | The region from where the data were collected | Worldwide, China, United States, etc. | |
| Accessibility | Download | Can user download the dataset | Immediately downloadable versus Request needed |
| Data format | Format of the dataset | CSV, XLSX, etc. | |
| Data hosting | Data repository where the dataset was hosted | GitHub, Mendeley, etc. | |
| Data update frequency | Whether the dataset was being updated regularly and the last date of update | Regularly updated versus One time only | |
| License | The type of license used | CC BY 4.0, MIT, etc. | |
| Metadata availability | Whether the metadata are provided and the metadata format | Machine readable, unstructured or unavailable | |
| Citation | Dataset article | Whether there was a PMC paper that described the dataset | The JHU dataset was described in the PMID 32087114 article |
| Citation count | The number of times that the dataset was cited by PMC articles (either via URL links or via data articles) | The JHU dataset was cited by 454 PMC articles |
JHU, Johns Hopkins University; CSV, comma separated values; XLSX, Microsoft Excel Open XML Spreadsheet; CC BY 4.0, Creative Commons Attribution 4.0; MIT, The MIT License.
The distribution of data types among 128 COVID-19 datasets
| Data type | Number | Percent |
|---|---|---|
| Epidemiology | 69 | 53.9% |
| Genomics | 19 | 14.8% |
| Clinical | 15 | 11.7% |
| Imaging | 3 | 2.3% |
| Mobility | 7 | 5.5% |
| Social science | 4 | 3.1% |
| Healthcare administration | 2 | 1.6% |
| Literature | 2 | 1.6% |
| Other | 10 | 7.8% |
Note: Imaging is considered as a subset of clinical datasets.
Figure 2Distribution of geographic regions where COVID-19 datasets were collected.
List of data repositories used by datasets in this study
| Repository | Number | Percent |
|---|---|---|
| GitHub | 57 | 44.5% |
| Google drive | 7 | 5.5% |
| Mendeley | 6 | 4.7% |
| Kaggle | 3 | 2.3% |
| Individual web page | 55 | 43.0% |
Figure 4The time of last updates for datasets that were not being updated regularly, among 128 COVID-19 datasets.
The number of times that a data license is used in dataset collection
| License | Number | Percent |
|---|---|---|
| MIT | 12 | 9.4% |
| Creative Commons Attribution 4.0 | 12 | 9.4% |
| GNU General Public License v3.0 | 7 | 5.5% |
| Apache license 2.0 | 4 | 3.1% |
| Creative Commons Zero v1.0 Universal | 3 | 2.3% |
| Creative Commons Attribution-NonCommercial-ShareAlike 4.0 | 2 | 1.6% |
| Mozilla Public License 2.0 | 1 | 0.8% |
| Self-defined data usage policy | 48 | 37.5% |
| Citation required | 11 | 8.6% |
| Unavailable | 30 | 23.4% |
Note: One dataset [111] used multiple licenses, thus percentages in this table may not add up. Self-defined data usage: data owners defined their own data usage policy; citation required: data owners only require users to cite their associated papers when using the data.
Figure 5Number of citations for each dataset. The horizontal axis indicates the number of citations of a dataset. The vertical axis label corresponds to the dataset ID in our dataset summary list (included in the Supplementary Data available online at https://academic.oup.com/bib).
Top 10 cited datasets
| Dataset | Overall citations | URL citations | Article citations |
|---|---|---|---|
| John Hopkins University Dashboard [ | 454 | 416 | 275 |
| Real-time estimation of the novel coronavirus incubation time [ | 239 | 0 | 239 |
| Worldometers [ | 231 | 231 | 0 |
| Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2) [ | 189 | 0 | 189 |
| Estimates of the severity of coronavirus disease 2019: a model-based analysis [ | 132 | 0 | 132 |
| Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts [ | 104 | 0 | 104 |
| Pattern of early human-to-human transmission of Wuhan 2019 novel coronavirus [ | 102 | 1 | 102 |
| Early dynamics of transmission and control of COVID-19: a mathematical modelling study [ | 97 | 0 | 97 |
| The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak [ | 90 | 0 | 90 |
| CDC [ | 87 | 87 | 0 |
Note: The number may not add up to the number of overall citations as we merged URL citations and article citations from the same article.