| Literature DB >> 34903654 |
Alex Reinhart1, Logan Brooks2, Maria Jahja3,2, Aaron Rumack2, Jingjing Tang4, Sumit Agrawal5, Wael Al Saeed6, Taylor Arnold7, Amartya Basu8, Jacob Bien9, Ángel A Cabrera10, Andrew Chin2, Eu Jing Chua2, Brian Clark2, Sarah Colquhoun5, Nat DeFries2, David C Farrow5, Jodi Forlizzi10, Jed Grabman5, Samuel Gratzl2, Alden Green3, George Haff2, Robin Han10, Kate Harwood5, Addison J Hu3,2, Raphael Hyde5, Sangwon Hyun9, Ananya Joshi6, Jimi Kim11, Andrew Kuznetsov10, Wichada La Motte-Kerr2, Yeon Jin Lee12,13, Kenneth Lee14, Zachary C Lipton2, Michael X Liu10, Lester Mackey15, Kathryn Mazaitis2, Daniel J McDonald16, Phillip McGuinness5, Balasubramanian Narasimhan17,18, Michael P O'Brien5, Natalia L Oliveira3,2, Pratik Patil3,2, Adam Perer10, Collin A Politsch2, Samyak Rajanala17, Dawn Rucker6, Chris Scott5, Nigam H Shah19, Vishnu Shankar20, James Sharpnack14, Dmitry Shemetov2, Noah Simon21, Benjamin Y Smith5, Vishakha Srivastava2, Shuyi Tan16, Robert Tibshirani17,18, Elena Tuzhilina17, Ana Karina Van Nortwick2, Valérie Ventura3, Larry Wasserman3,2, Benjamin Weaver5, Jeremy C Weiss22, Spencer Whitman5, Kristin Williams10, Roni Rosenfeld2, Ryan J Tibshirani3,2.
Abstract
The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: Operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from deidentified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data are available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making.Entities:
Keywords: digital surveillance; internet surveys; medical insurance claims; open data
Mesh:
Year: 2021 PMID: 34903654 PMCID: PMC8713778 DOI: 10.1073/pnas.2111452118
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 12.779
Data sources available in the COVIDcast API (19), as of date of publication
| Data source | Signals available | First date | Resolution |
| Change healthcare | Percentage of outpatient visits with COVID diagnostic codes or codes indicating COVID-like symptoms; based on deidentified claims data processed by Change Healthcare | 1 February 2020 | County* |
| Doctor visits | Percentage of outpatient visits primarily about COVID-like symptoms, based on deidentified claims data provided by health system partners | 1 February 2020 | County* |
| Hospital admissions | Percentage of new hospital admissions with COVID diagnostic codes, based on deidentified claims data provided by health system partners | 1 February 2020 | County** |
| Quidel | Test positivity rates for COVID-19 antigen tests produced by Quidel | 26 May 2020 | County** |
| SafeGraph | Mobility metrics, such as time away from home or visits to bars and restaurants, based on cell phone mobility data collected by SafeGraph ( | 1 January 2019 | County |
| COVID-19 Trends and Impact Survey | COVID symptoms, social distancing behaviors, mental health, economic impact, behavior (e.g., mask wearing, vaccination attitudes), and COVID testing signals based on daily surveys conducted nationally by Delphi through Facebook ( | 6 April 2020 | County** |
| Health and Human Services | Counts of hospital admissions due to confirmed or suspected COVID-19, as reported by the Department of Health and Human Services | 31 December 2019 | State |
| CovidActNow | COVID-19 testing results, such as positivity rate and number of tests, compiled by CovidActNow from CDC reporting | 2 March 2020 | County* |
| Google symptoms | Trends in Google search volume for terms related to anosmia and ageusia (loss of smell or taste), which correlate with COVID activity, based on data shared by Google ( | 13 February 2020 | County*** |
| Cases and deaths | Confirmed COVID-19 cases and deaths, compiled by JHU CSSE ( | 22 January 2020 | County |
| NCHS mortality | Weekly totals of deaths broken down by cause, such as COVID, flu, or pneumonia, compiled by the National Center for Health Statistics ( | 26 January 2020 | State |
The first group of data sources are produced from data not otherwise available publicly (or only available in limited form); the second group is mirrored from public sources. Asterisks denote availability: *available at > 60% of counties; **available at 20 to 60% of counties; ***available at < 20% of counties. For some signals, location availability varies over time, for example, due to variable reporting volume.
Fig. 1.National trends, from April 2020 to April 2021, of four signals in the COVIDcast API. The auxiliary signals, based on medical claims data and massive surveys, track changes in officially reported cases quite well. (They have all been placed on the same range as reported cases per 100,000 people.)
Fig. 2.Geo-wise correlations with case rates, from April 15, 2020 to April 15, 2021, calculated over all counties for which all signals were available and which had at least 500 cumulative cases by the end of this period.
Fig. 3.Time-wise correlations with case rates, from April 15, 2020 to April 15, 2021, calculated over all counties for which all signals were available and which had at least 500 cumulative cases by the end of this period.
Fig. 4.(Left) Reported cases per day in Bexar County, Texas, during the summer of 2020. On July 16, 4,810 backlogged cases were reported, although they actually occurred over the preceding 2 wk (this shows up as a prolonged spike due to the 7-d trailing averaging applied to the case counts). (Right) Daily CTIS estimates of CLI-in-community showed more stable underlying trends.
Fig. 5.Estimated percentage of outpatient (DV-CLI) displayed across multiple issue dates, with later issue dates adding additional data and revising past data from prior issue dates.
Fig. 6.The 95th percentiles of relative error of early reported values of key signals compared to final values reported much later. For each date between October 15, 2020 and April 15, 2021, the values for each state reported between 10 and 90 d later are compared to “final” versions recorded as of August 13, 2021. Even officially reported case and death data can have large revisions 30 d to 60 d or more after initial reporting; much of this is driven by individual large revisions affecting specific states and dates, rather than by systematic changes affecting all states and dates.
Fig. 7.CTIS estimates of the percentage of people willing to get vaccinated, back on January 20, 2021, compared to CDC reporting of the percentage of people vaccinated, on July 20, 2021. Each point is a county (with at least 250 survey responses between January 14–20, 2021), colored by its parent United States Census region.