Literature DB >> 33757278

How do we share data in COVID-19 research? A systematic review of COVID-19 datasets in PubMed Central Articles.

Xu Zuo¹, Yong Chen², Lucila Ohno-Machado³, Hua Xu¹.

Abstract

OBJECTIVE: This study aims at reviewing novel coronavirus disease (COVID-19) datasets extracted from PubMed Central articles, thus providing quantitative analysis to answer questions related to dataset contents, accessibility and citations.
METHODS: We downloaded COVID-19-related full-text articles published until 31 May 2020 from PubMed Central. Dataset URL links mentioned in full-text articles were extracted, and each dataset was manually reviewed to provide information on 10 variables: (1) type of the dataset, (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) format of the dataset files, (5) where the dataset was hosted, (6) whether the dataset was updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PubMed Central paper describing the dataset and (10) the number of times the dataset was cited by PubMed Central articles. Descriptive statistics about these seven variables were reported for all extracted datasets.
RESULTS: We found that 28.5% of 12 324 COVID-19 full-text articles in PubMed Central provided at least one dataset link. In total, 128 unique dataset links were mentioned in 12 324 COVID-19 full text articles in PubMed Central. Further analysis showed that epidemiological datasets accounted for the largest portion (53.9%) in the dataset collection, and most datasets (84.4%) were available for immediate download. GitHub was the most popular repository for hosting COVID-19 datasets. CSV, XLSX and JSON were the most popular data formats. Additionally, citation patterns of COVID-19 datasets varied depending on specific datasets.
CONCLUSION: PubMed Central articles are an important source of COVID-19 datasets, but there is significant heterogeneity in the way these datasets are mentioned, shared, updated and cited.

Entities: Chemical Disease Gene Species

Keywords: COVID-19; data sharing; review

Mesh：

Year: 2021 PMID： 33757278 PMCID： PMC7799277 DOI： 10.1093/bib/bbaa331

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

The novel coronavirus disease (COVID-19) outbreak was first reported in Wuhan, China, on 31 December 2019. On 11 March 2020, World Health Organization officially declared COVID-19 a pandemic, marking the recognition of a global crisis [1]. To fight the COVID-19 pandemic, researchers worldwide have quickly investigated different aspects of this disease and reported novel scientific findings, on a daily basis. According to LitCovid [2], a curated literature hub for tracking COVID-19 publication, 34 890 new articles (as the date of 25 July 2020) have been published in the past seven months. Along with published articles, massive and heterogeneous datasets have been created, ranging from testing and case statistics at various locations (medical centers, cities, counties, states, countries), clinical data from studies (e.g., ‘omics, imaging, assays, questionnaires) or from electronic health records, surveys for patient-reported outcomes, administrative data [e.g., ventilators, hospitalizations, intensive care unit (ICU) beds], vital statistics (e.g., obituaries, death certificates), as well as sociodemographic, environmental, economic, individual mobility and transportation data. Efficient data sharing of biomedical data is an important component in the development of a successful data-driven research on COVID-19 [3]. Researchers reconstructed the early evolutionary paths of COVID-19 by genetic network analysis, for example using existing data of virus genomes collected across the world, providing insights into virus transmission patterns [4]. Nevertheless, it is challenging for researchers to find and identify reliable datasets for novel scientific discoveries, given the large volume and sometimes contradictory information (e.g. non-peer-reviewed sources) about available datasets. Principles such as FAIR (Findable, Accessible, Interoperable and Reusable) [5] and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) [6] have been proposed for sharing digital data and digital repositories, with applications to COVID-19 datasets as well (e.g. the Virus Outbreak Data Network) [7]. Here, we propose to conduct a systematic review on COVID-19 datasets that are associated with published literature. Our study aims at identifying a comprehensive list of available COVID-19 datasets across domains and at providing insights on how researchers share datasets as they publish COVID-19 research articles. Additionally, we also assess the accessibility, sustainability and impact of published datasets. More specifically, we attempt to answer the following research questions about COVID-19 datasets that are associated with publications: Q1. Contents: What types of data are published to support different studies and where are those data collected from? Q2. Accessibility: How can users access datasets and where are the data hosted? Q3. Citation: How are datasets cited by others and what are top high-impact datasets, by citation count?Our ultimate goal is to promote data sharing and data reuse through careful analyses of current practice by researchers. Through a systematic review, we provide researchers with a comprehensive list of reliable datasets that are available to the public. Additionally, we provide insights about data sharing strategies to aid those who plan to develop and publish new COVID-19 datasets.

Methods

COVID-19 publication collection

To identify and collect COVID-19-related articles, we leveraged LitCovid [2], a newly established literature database for tracking the latest scientific articles about COVID-19, developed by National Library of Medicine in the United States. LitCovid provides essential bibliographic information such as PubMed ID, title, abstract and journal of publications related to COVID-19. In this review, we included all LitCovid articles published before 31 May 2020, resulting in 18 332 articles. As the recognition of associated datasets requires access to full-text articles, we further limited articles to those with full text available in PubMed Central (PMC), which is one of the most significant open access literature repositories of full-text biomedical articles. We then removed errata notes of 16 articles. This further reduced the number of articles to 12 324, from which we carried out our dataset collection process.

COVID-19 dataset collection

We manually reviewed 100 PMC full-text articles and identified the following patterns for mentioning datasets: (1) Dataset information is available in the Data Availability Statement section provided by PMC, allowing the authors to disclose information about data availability and access, which often contains URL links to data sources, or (2) When Data Availability Statement section was missing, datasets could have been mentioned in the full text as (a) external URL links to the data sources, (b) supplemental files (e.g. additional tables, sometimes in PDF) and (c) textual statements about data availability (e.g. ‘available upon request’). As datasets from category 2b and 2c often required additional effort before they could be used in calculations, we limited our data collection to categories 1 and 2a, which led to the task of identifying URL links from PubMed full-text articles. Of course, external URL links in PMC articles do not always refer to datasets. Therefore, we developed a process that combines automatic extraction with manual review, to identify dataset links mentioned in articles. We first downloaded the full texts of 12 324 PMC articles in XML format using E-Fetch queries [8]. All URLs tagged with the markup ‘ext-link’ were then automatically extracted from articles. This included URLs both in the main text and in the citations. These URLs then underwent a normalization step, where extensions like ‘HTTP’ and ‘htm’ were removed, which resulted in a list of 23 467 URLs in total. We then manually reviewed all of them and identified 144 links directing to actual datasets. We noticed that one single dataset can be associated with multiple links. For example, the Johns Hopkins University Dashboard [9] was cited in articles using four different URLs. After merging these different data links that directed to the same dataset, we obtained 128 unique datasets from the verified data links. The complete process of extracting COVID-19 publications from LitCovid and extracting datasets mentioned in full-text COVID-19 publications in PMC was described in Figure 1.

Figure 1

The workflow of screening and collecting publications and datasets from 18 332 PMC articles.

COVID-19 dataset review and analysis

For each of the 128 COVID-19 datasets, we manually reviewed its web pages. We extracted information for 10 descriptive variables: (1) type of the dataset (e.g. epidemiological or genomic data), (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) file format (e.g. CSV), (5) where the dataset was hosted, (6) whether the data were updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PMC paper describing the creation of the dataset and (10) the number of times the dataset was cited by PMC articles (either via URL links or via articles). The definitions and examples of values for the 10 variables are shown in Table 1.

Table 1

Descriptions and examples of metadata variables collected for each dataset

Question	Variable	Description	Examples
Content	Data type	Types of dataset	Epidemiological, clinical, etc.
	Geographic region	The region from where the data were collected	Worldwide, China, United States, etc.
Accessibility	Download	Can user download the dataset	Immediately downloadable versus Request needed
	Data format	Format of the dataset	CSV, XLSX, etc.
	Data hosting	Data repository where the dataset was hosted	GitHub, Mendeley, etc.
	Data update frequency	Whether the dataset was being updated regularly and the last date of update	Regularly updated versus One time only
	License	The type of license used	CC BY 4.0, MIT, etc.
	Metadata availability	Whether the metadata are provided and the metadata format	Machine readable, unstructured or unavailable
Citation	Dataset article	Whether there was a PMC paper that described the dataset	The JHU dataset was described in the PMID 32087114 article
	Citation count	The number of times that the dataset was cited by PMC articles (either via URL links or via data articles)	The JHU dataset was cited by 454 PMC articles

JHU, Johns Hopkins University; CSV, comma separated values; XLSX, Microsoft Excel Open XML Spreadsheet; CC BY 4.0, Creative Commons Attribution 4.0; MIT, The MIT License.

Descriptions and examples of metadata variables collected for each dataset JHU, Johns Hopkins University; CSV, comma separated values; XLSX, Microsoft Excel Open XML Spreadsheet; CC BY 4.0, Creative Commons Attribution 4.0; MIT, The MIT License.

Results

Among 12 324 PMC articles screened, 249 papers included Data Availability Statement sections, and 23 papers provided valid online data sources. Of the papers without the Data Availability Statement (12 075), 3486 papers contained at least one dataset link in the full text. The proportion of COVID-19 articles in PMC that provided at least one dataset link was 28.5% (3509/12 324). In total, 128 unique dataset links were mentioned in 12 324 COVID-19 articles in PMC. The distribution of data types among 128 COVID-19 datasets Note: Imaging is considered as a subset of clinical datasets. Distribution of geographic regions where COVID-19 datasets were collected.

Q1. Content

Data types

Table 2 presents the distribution of types of datasets. As expected, epidemiological datasets (N = 69; 53.9%) constituted the largest portion of our dataset collection. Some of them were created by governmental sources and others by independent statistic suppliers and were aimed at tracking the latest COVID-19 case updates, including confirmed cases, death cases, recovered cases and the number of tests conducted [9–13, 18–23, 31–50, 53–57, 61, 66, 67, 74–76, 132]. Some epidemiological studies focused on the modeling [15, 16, 26, 29, 51, 58, 64, 135] and prediction of transmission patterns [24, 30, 52, 60, 63, 65, 70–73]. Of all datasets, 14.8% provided COVID-19 genome [76–91, 93] or protein sequences [92, 94, 95, 98]. Clinical datasets (N = 15; 11.7%) largely concentrated on three aspects: (1) incubation period as well as other clinical characteristics of COVID-19 patients [96, 104, 107], (2) potential treatments such as vaccines [97, 105] and medications [99] and (3) imaging datasets (N = 3; 2.3%) contained chest computed tomography (CT) images for COVID-19 patients plus others [110-112]. Mobility datasets (N = 7; 5.5%) used track movements trends over time by geography were also present [114-120]. Social studies datasets (N = 4; 3.1%) gathered information about people’s responses when facing the pandemic [121-124]. LitCovid [2] and CORD-19 [113] were the two major COVID-19 literature databases that appeared in PMC articles. Health administration data [100–103, 108] mainly describe hospital capacities, for example the number of ventilators [103] and ICU beds [108] available in hospitals. Other non-biomedical datasets (N = 10; 7.8%), such as climate [133, 134], economic [127], geographical [128] and population data [129, 131], were also discovered in articles investigating the effects of disease transmission and long-term impacts of the pandemic.

Table 2

The distribution of data types among 128 COVID-19 datasets

Data type	Number	Percent
Epidemiology	69	53.9%
Genomics	19	14.8%
Clinical	15	11.7%
Imaging	3	2.3%
Mobility	7	5.5%
Social science	4	3.1%
Healthcare administration	2	1.6%
Literature	2	1.6%
Other	10	7.8%

Note: Imaging is considered as a subset of clinical datasets.

Geographic region

Figure 2 illustrates the geographic regions that the datasets covered. More than half of the datasets (N = 68; 53.1%) incorporated data from around the world. From the total, 18 (14.1%) datasets involved data from China. Multiple datasets related to epidemiology were reported in the United States [66, 69, 90, 102], United Kingdom [40, 54, 62] and India [35, 36, 57] as the coronavirus diseases spread to these countries. Aside from country-level data, Africa [59] and Europe [109] also created datasets covering these entire continents. There were also smaller datasets that covered only states [39, 47], counties [56] and cities [14, 16, 17, 43, 53]. Such datasets were often created by local health departments and incorporated detailed COVID-19 patients’ demographic breakdowns.

Figure 2

Distribution of geographic regions where COVID-19 datasets were collected.

Q2. Accessibility

Download

Among the 128 datasets in our study, 20 datasets did not provide clear downloading information. Users who wish to use these datasets need to contact the owners for download instructions. Therefore, we marked the accessibility of such datasets as ‘request needed’. The remaining 108 (84.4%) datasets were instantly downloadable. Registrations prior to accessing the data are required for 9 out of 108 downloadable datasets.

Data format

Of 108 datasets that could be downloaded instantly, 19 were available to download in multiple formats. CSV (N = 53; 49.1%), XLSX (N = 27; 25.0%) and JSON (N = 10; 9.3%) were three popular formats in dataset exchange. Almost all genetic studies shared data in FASTA. RDS and RDA were two of the common data formats in studies that utilized the R programming language [28, 60, 68, 106, 108, 126]. Imaging datasets typically shared CT images as JPG files. Datasets of protein structures offered data in PDB files [92, 95]. GeoTiFF files were provided in a worldwide population dataset that allowed the data to be projected onto a geographical map [129]. Data formats used by 108 downloadable datasets, among 128 COVID-19 datasets.

Data hosting

As shown in Table 3 below, the most popular data repository is GitHub, incorporating 57 (44.5%) datasets. Of all, six (4.7%) datasets were stored on Mendeley Data, a cloud-based repository for research data from scholarly articles. Individual webpages (N = 55; 43.0%) referred to those datasets accessible only via stand-alone websites, in comparison with those deposited on established data repositories.

Table 3

List of data repositories used by datasets in this study

Repository	Number	Percent
GitHub	57	44.5%
Google drive	7	5.5%
Mendeley	6	4.7%
Kaggle	3	2.3%
Individual web page	55	43.0%

List of data repositories used by datasets in this study

Data update frequency

More than half (N = 74; 57.8%) of the datasets were being updated regularly (often daily or weekly). If data depositors did not offer any information regarding the updating frequency, we treated those datasets as not being updated on a regular basis. We recorded the date of the last update on those datasets. Figure 4 illustrates the number of datasets that stopped updating in each month.

Figure 4

The time of last updates for datasets that were not being updated regularly, among 128 COVID-19 datasets.

License

Table 4 showed the statistics for data licensing. Among the 128 datasets we collected, 39 (30.5%) datasets clearly specified data licenses to allow permitted use of datasets. The COVID-19 Image Data Collection [111] used multiple licenses for different subsets of data. 37.5% (N = 48) stated their own terms and policies for data usage in detail online. 8.6% (N = 11) datasets require users to cite their associated papers when using the data but do not offer other information on data sharing and usage. 23.4% (N = 30) datasets do not release any information regarding data usage.

Table 4

The number of times that a data license is used in dataset collection

License	Number	Percent
MIT	12	9.4%
Creative Commons Attribution 4.0	12	9.4%
GNU General Public License v3.0	7	5.5%
Apache license 2.0	4	3.1%
Creative Commons Zero v1.0 Universal	3	2.3%
Creative Commons Attribution-NonCommercial-ShareAlike 4.0	2	1.6%
Mozilla Public License 2.0	1	0.8%
Self-defined data usage policy	48	37.5%
Citation required	11	8.6%
Unavailable	30	23.4%

Note: One dataset [111] used multiple licenses, thus percentages in this table may not add up. Self-defined data usage: data owners defined their own data usage policy; citation required: data owners only require users to cite their associated papers when using the data.

The number of times that a data license is used in dataset collection Note: One dataset [111] used multiple licenses, thus percentages in this table may not add up. Self-defined data usage: data owners defined their own data usage policy; citation required: data owners only require users to cite their associated papers when using the data.

Metadata availability

Of 108 datasets that are immediately downloadable, 77.8% (N = 84) provide metadata in machine readable formats. Several datasets [40, 74, 78, 125, 130] and data deposited on established data repositories (GitHub, Mendeley and Kaggle) offer application programming interfaces (API) to automatically retrieve metadata. 9.3% (N = 10) datasets provide metadata in free text, which includes information like dataset names, data owners and data description. 13.0% (N = 14) datasets do not release any information on metadata.

Q3. Citation

Dataset article availability

41.4% (N = 53) datasets were described with details in publications on PMC. Of the 53 articles describing datasets, 5 articles described extensively the purpose and techniques of building COVID-19 databases. The main purpose of the remaining 48 articles was to carry out modeling, prediction or other types of analysis in diverse domains, with some description about datasets in the study. These were often the datasets that were not updated on a regular basis: those data were collected, standardized and maintained by the authors themselves for the specific studies.

Citation count

Figure 5 demonstrates the number of citations for each dataset. Typically, a dataset can be cited in two ways in articles: (1) as a URL in the full text and (2) as an article that describes the dataset. It is possible for an article to cite both the URL and the article of the dataset. The number shown in Figure 5 is for the overall citations (both articles and URLs), in which the duplicated citations were removed. The number of citations across different datasets varied heavily. The dataset available in the John Hopkins University Dashboard [9] was the most popular dataset and was cited 454 times. Of the top 10 datasets, 9 were from the epidemiology domain. However, a low number of citations do not necessarily indicate that the dataset has little impact and may just reflect the fact that they did not have enough time to accrue citations yet (i.e. more recently published datasets).

Figure 5

Number of citations for each dataset. The horizontal axis indicates the number of citations of a dataset. The vertical axis label corresponds to the dataset ID in our dataset summary list (included in the Supplementary Data available online at https://academic.oup.com/bib). Table 5 presents the top 10 cited datasets in our study. The John Hopkins University Dashboard [9] had a large number of citations both as an online data link and a publication. Worldometers [33] and CDC [31] are high-impact data sources for COVID-19 case update and cited frequently as external data links. They are used in the Johns Hopkins University Dashboard but, since they do not accrue citations indirectly, their impact may be underestimated. The remaining seven datasets were almost all cited as articles published on PubMed and had none or few URL citations.

Table 5

Top 10 cited datasets

Dataset	Overall citations	URL citations	Article citations
John Hopkins University Dashboard [6]	454	416	275
Real-time estimation of the novel coronavirus incubation time [93]	239	0	239
Worldometers [30]	231	231	0
Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2) [22]	189	0	189
Estimates of the severity of coronavirus disease 2019: a model-based analysis [60]	132	0	132
Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts [25]	104	0	104
Pattern of early human-to-human transmission of Wuhan 2019 novel coronavirus [14]	102	1	102
Early dynamics of transmission and control of COVID-19: a mathematical modelling study [61]	97	0	97
The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak [27]	90	0	90
CDC [28]	87	87	0

Note: The number may not add up to the number of overall citations as we merged URL citations and article citations from the same article.

Discussion

Although no extensive analyses have been carried out on availability, accessibility and type of COVID-19 datasets, discussion on the collection and sharing of COVID-19 data has received great attention among the scientific community: Alamo et al. [136] highlighted a variety of significant open data sources and evaluated the limitations and readability of available data. They concluded that notable progress was achieved by certain scientific communities, particularly among epidemiologists, healthcare specialists, the machine learning community and data scientists. Several studies also reviewed and explored available COVID-19 data in specific domains. Kalkreuth and Kaufmann [137] reviewed publicly available medical imaging resources for COVID-19 cases worldwide. Rubin [138] reported on recent progress in collecting data of ventilated patients confirmed with COVID-19. Robinson and Yazdany [139] described an initiative to collect data about COVID-19 patients with rheumatic diseases. Khalatbari-Soltani et al. [140] listed a series of important socioeconomic characteristics often overlooked when collecting and reporting social science data related to the pandemic. Top 10 cited datasets Note: The number may not add up to the number of overall citations as we merged URL citations and article citations from the same article. In our study, we took a different approach and conducted a systematic review on COVID-19 literature in PMC to identify associated datasets. A number of interesting findings were identified through our analysis. First, although PMC implemented the Data Availability Statement section, the percentage of articles that explicitly provided such information was about 2% (249/12 324) only, indicating a low adoption rate. Nevertheless, there were 3509 (about 28.5%) papers that provided at least one URL link to datasets used in the study, demonstrating a significant portion of researchers are aware of the importance of sharing data. Additionally, all 128 datasets identified in PMC articles allowed users to access their data, and 84.4% (108/128) were available for immediate download, indicating the level of openness of data sharing in COVID-19 research. Epidemiology datasets constituted more than half of our dataset collection, while imaging datasets accounted for 2.3%, indicating the need to develop more datasets for the latter and for related domains, which will probably require worldwide collaboration in order to grow to the same size as epidemiology datasets. As for data format, although FAIR [5] recommends the RDF (Resource Description Framework) format, no dataset in this study has adopted RDF, probably because common machine-readable formats such as CSV, JSON and TXT are easier to understand. We observed two major types of practices in licenses of data usage. Data owners who use established data repositories often use a variety of existing data licenses to grant data usage and sharing. On the other hand, data owners who publish datasets on individual webpages prefer to specify their own terms and policies. Overall, 76.6% (98/128) data owners allow non-commercial use of data and specify the degree of openness by releasing data usage policies. The data update frequency relied heavily on the objectives of creating the dataset. Among 75 datasets only available as online sources, the majority of them were updated regularly for public uses. However, for 41.4% (53/128) datasets that are associated with publications, the authors collected and maintained datasets themselves for different purposes. Five articles aimed at describing how the COVID-19 databases were built, and they discuss data collection, storage and visualization. The remaining 48 articles focused on modeling, predictions or other analysis related to COVID-19. The authors of these analysis articles kept not only data but also codes and tools they used in their own studies. The datasets mentioned in these articles represent the collection of raw data that authors used as input for their analysis. Such data are often limited within a period of time and contain a relatively small number of cases. We observed two approaches for citing datasets: (1) URL citations: citing URLs that led to the data sources and (2) Article citations: citing the article that describes the dataset. After examining the articles that cited datasets in the full text, we also discovered two major purposes of citing datasets: (1) citing a dataset as the data source used in the study and (2) citing a dataset as a general reference. Researchers are typically more likely to have used the dataset if they cite it directly as a URL. On the other hand, when citing a dataset as an article, the authors are more likely to mention it as a general reference instead of citing the data sources. This suggests that a larger number of URL citations to a dataset indicate its higher reuse. However, we also saw that datasets that aggregate data from several sources can be popular and be highly cited, but the data sources they use may not always receive citations. This indicates that we may consider indirect citations when assessing the true impact of a dataset in terms of its utility. Additionally, if a dataset is associated with a dedicated description paper, e.g. the John Hopkins University Dashboard [9] or the Epidemiological Data from the nCoV-2019 Outbreak [18], other papers that used the dataset may cite it as both URLs and papers. One limitation of this study is that we limited our analysis to full-text articles in PMC. Although PMC is the largest full-text article repository in the biomedical domain, there are still about one-third (5992/18 332) of LitCovid papers that are not included in this study due to unavailability at PMC. Considering that LitCovid collects articles from PubMed only, the actual number of COVID-19 articles that are not included in this study could be even higher. In the future, we plan to look into other sources of full-text articles to study COVID-19 dataset status. Additionally, our study did not take into account high-impact datasets cited often by preprints, such as the Public Coronavirus Twitter Dataset [141]. Furthermore, we reviewed only the URLs extracted from articles, instead of other potential types of references that could be revealed had we reviewed the whole text. There is a chance that we missed data source information stated in plain text. We hope to resolve this problem and to expand the dataset collection by introducing natural language processing techniques in our future studies.

Conclusion

We screened 12 324 COVID-19 related full-text articles in PMC and collected 128 unique dataset URLs. By systematically analyzing the collected datasets in terms of content, accessibility and citation, we observed significant heterogeneity in the way these datasets are mentioned, shared, updated and cited. Those findings on current practice on generating, sharing and citing datasets for COVID-19 research can provide valuable insights for future improvements. 128 COVID-19 datasets from 12 324 COVID-19 articles were collected for this systematic review. We conducted a quantitative analysis of dataset contents, accessibility and citations. 84.4% COVID-19 scholarly datasets are available for immediate download. The number of dataset URL citations is a valuable indicator of dataset utility. Click here for additional data file.

62 in total

1. Estimating the burden of United States workers exposed to infection or disease: A key factor in containing risk of COVID-19 infection.

Authors: Marissa G Baker; Trevor K Peckham; Noah S Seixas
Journal: PLoS One Date: 2020-04-28 Impact factor: 3.240

2. COVID-19 in Italy: Dataset of the Italian Civil Protection Department.

Authors: Micaela Morettini; Agnese Sbrollini; Ilaria Marcantoni; Laura Burattini
Journal: Data Brief Date: 2020-04-10

3. Assessing differential impacts of COVID-19 on black communities.

Authors: Gregorio A Millett; Austin T Jones; David Benkeser; Stefan Baral; Laina Mercer; Chris Beyrer; Brian Honermann; Elise Lankiewicz; Leandro Mena; Jeffrey S Crowley; Jennifer Sherwood; Patrick S Sullivan
Journal: Ann Epidemiol Date: 2020-05-14 Impact factor: 3.797

4. SARS-CoV-2 receptor ACE2 and TMPRSS2 are primarily expressed in bronchial transient secretory cells.

Authors: Soeren Lukassen; Robert Lorenz Chua; Timo Trefzer; Nicolas C Kahn; Marc A Schneider; Michael Kreuter; Christian Conrad; Roland Eils; Thomas Muley; Hauke Winter; Michael Meister; Carmen Veith; Agnes W Boots; Bianca P Hennig
Journal: EMBO J Date: 2020-04-14 Impact factor: 11.598

5. An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China.

Authors: Huaiyu Tian; Yonghong Liu; Yidan Li; Chieh-Hsi Wu; Bin Chen; Moritz U G Kraemer; Bingying Li; Jun Cai; Bo Xu; Qiqi Yang; Ben Wang; Peng Yang; Yujun Cui; Yimeng Song; Pai Zheng; Quanyi Wang; Ottar N Bjornstad; Ruifu Yang; Bryan T Grenfell; Oliver G Pybus; Christopher Dye
Journal: Science Date: 2020-03-31 Impact factor: 47.728

6. Projecting hospital utilization during the COVID-19 outbreaks in the United States.

Authors: Seyed M Moghadas; Affan Shoukat; Meagan C Fitzpatrick; Chad R Wells; Pratha Sah; Abhishek Pandey; Jeffrey D Sachs; Zheng Wang; Lauren A Meyers; Burton H Singer; Alison P Galvani
Journal: Proc Natl Acad Sci U S A Date: 2020-04-03 Impact factor: 11.205

7. Impact of international travel and border control measures on the global spread of the novel 2019 coronavirus outbreak.

Authors: Chad R Wells; Pratha Sah; Seyed M Moghadas; Abhishek Pandey; Affan Shoukat; Yaning Wang; Zheng Wang; Lauren A Meyers; Burton H Singer; Alison P Galvani
Journal: Proc Natl Acad Sci U S A Date: 2020-03-13 Impact factor: 11.205

8. Serial interval of novel coronavirus (COVID-19) infections.

Authors: Hiroshi Nishiura; Natalie M Linton; Andrei R Akhmetzhanov
Journal: Int J Infect Dis Date: 2020-03-04 Impact factor: 3.623

9. A framework for identifying regional outbreak and spread of COVID-19 from one-minute population-wide surveys.

Authors: Hagai Rossman; Ayya Keshet; Smadar Shilo; Amir Gavrieli; Tal Bauman; Ori Cohen; Esti Shelly; Ran Balicer; Benjamin Geiger; Yuval Dor; Eran Segal
Journal: Nat Med Date: 2020-05 Impact factor: 87.241

10. Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data, March 2020.

Authors: Tapiwa Ganyani; Cécile Kremer; Dongxuan Chen; Andrea Torneri; Christel Faes; Jacco Wallinga; Niel Hens
Journal: Euro Surveill Date: 2020-04

8 in total

1. Sharing Time-to-Event Data with Privacy Protection.

Authors: Luca Bonomi; Liyue Fan
Journal: IEEE Int Conf Healthc Inform Date: 2022-09-08

2. Continent-wide evolutionary trends of emerging SARS-CoV-2 variants: dynamic profiles from Alpha to Omicron.

Authors: Chiranjib Chakraborty; Manojit Bhattacharya; Ashish Ranjan Sharma; Kuldeep Dhama; Sang-Soo Lee
Journal: Geroscience Date: 2022-07-13 Impact factor: 7.581

3. A Paradigm Shift in the Combination Changes of SARS-CoV-2 Variants and Increased Spread of Delta Variant (B.1.617.2) across the World.

Authors: Chiranjib Chakraborty; Ashish Ranjan Sharma; Manojit Bhattacharya; Govindasamy Agoramoorthy; Sang-Soo Lee
Journal: Aging Dis Date: 2022-06-01 Impact factor: 9.968

4. Asian-Origin Approved COVID-19 Vaccines and Current Status of COVID-19 Vaccination Program in Asia: A Critical Analysis.

Authors: Chiranjib Chakraborty; Ashish Ranjan Sharma; Manojit Bhattacharya; Govindasamy Agoramoorthy; Sang-Soo Lee
Journal: Vaccines (Basel) Date: 2021-06-04

5. Comparative genomics, evolutionary epidemiology, and RBD-hACE2 receptor binding pattern in B.1.1.7 (Alpha) and B.1.617.2 (Delta) related to their pandemic response in UK and India.

Authors: Chiranjib Chakraborty; Ashish Ranjan Sharma; Manojit Bhattacharya; Bidyut Mallik; Shyam Sundar Nandi; Sang-Soo Lee
Journal: Infect Genet Evol Date: 2022-04-13 Impact factor: 4.393

6. A comprehensive analysis of the mutational landscape of the newly emerging Omicron (B.1.1.529) variant and comparison of mutations with VOCs and VOIs.

Authors: Chiranjib Chakraborty; Manojit Bhattacharya; Ashish Ranjan Sharma; Kuldeep Dhama; Govindasamy Agoramoorthy
Journal: Geroscience Date: 2022-08-22 Impact factor: 7.581

7. Omicron (B.1.1.529) - A new heavily mutated variant: Mapped location and probable properties of its mutations with an emphasis on S-glycoprotein.

Authors: Chiranjib Chakraborty; Manojit Bhattacharya; Ashish Ranjan Sharma; Bidyut Mallik
Journal: Int J Biol Macromol Date: 2022-08-08 Impact factor: 8.025

8. An active learning-based approach for screening scholarly articles about the origins of SARS-CoV-2.

Authors: Xin An; Mengmeng Zhang; Shuo Xu
Journal: PLoS One Date: 2022-09-16 Impact factor: 3.752

8 in total