Literature DB >> 35358171

Near real-time surveillance of the SARS-CoV-2 epidemic with incomplete data.

Pablo M De Salazar¹, Fred Lu^2,3, James A Hay¹, Diana Gómez-Barroso^4,5, Pablo Fernández-Navarro^4,5, Elena V Martínez^5,6, Jenaro Astray-Mochales⁷, Rocío Amillategui⁴, Ana García-Fulgueiras⁸, Maria D Chirlaque⁸, Alonso Sánchez-Migallón⁸, Amparo Larrauri^4,5, María J Sierra^6,9, Marc Lipsitch¹, Fernando Simón^5,6, Mauricio Santillana^1,2,3,10, Miguel A Hernán¹¹.

Abstract

When responding to infectious disease outbreaks, rapid and accurate estimation of the epidemic trajectory is critical. However, two common data collection problems affect the reliability of the epidemiological data in real time: missing information on the time of first symptoms, and retrospective revision of historical information, including right censoring. Here, we propose an approach to construct epidemic curves in near real time that addresses these two challenges by 1) imputation of dates of symptom onset for reported cases using a dynamically-estimated "backward" reporting delay conditional distribution, and 2) adjustment for right censoring using the NobBS software package to nowcast cases by date of symptom onset. This process allows us to obtain an approximation of the time-varying reproduction number (Rt) in real time. We apply this approach to characterize the early SARS-CoV-2 outbreak in two Spanish regions between March and April 2020. We evaluate how these real-time estimates compare with more complete epidemiological data that became available later. We explore the impact of the different assumptions on the estimates, and compare our estimates with those obtained from commonly used surveillance approaches. Our framework can help improve accuracy, quantify uncertainty, and evaluate frequently unstated assumptions when recovering the epidemic curves from limited data obtained from public health systems in other locations.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35358171 PMCID： PMC9004750 DOI： 10.1371/journal.pcbi.1009964

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Assessing the effectiveness of interventions during outbreaks requires real-time characterization of new infections. Epidemic curves, which describe the number of individuals infected over time, are frequently used to monitor the dynamics of an outbreak. Constructing these curves is particularly challenging for novel pathogens because testing protocols and surveillance systems may not be repurposed quickly enough. Ideally, epidemic curves should document new infections based on the date of exposure to the infectious agent for each individual [1]. Because information on the date of exposure for each patient is usually unavailable, epidemic curves are often constructed based on the first detectable clinical event: onset of symptoms. In practice, however, detected cases are rarely documented at onset of symptoms. Rather, surveillance procedures tend to rely on confirmed diagnoses, which are typically reported with a delay of days or weeks after symptoms onset [2-4]. As a result, the number of cases based on the onset of symptoms on any given day is unknown until those cases are reported days or weeks later. This problem is sometimes referred to as “backfill bias”, a term from economics [5] that was later applied to infectious disease tracking [6,7]. Reporting delays complicate timely decision-making [8]. Further, similar populations with different notification procedures may have different reporting delays [9] and the distribution of delays may change over time [10,11]. One way to address the problem of delayed notification is to statistically predict the number of cases with onset of symptoms today based on historical observations of how the number of cases on a given day were later revised to reflect updated information. This methodology is referred to as “nowcasting” [6,10]. A recently proposed approach to nowcasting, NobBS or Nowcasting by Bayesian Smoothing [10], complements the estimated delay distribution and historical data with the intrinsic autocorrelation from the transmission process. NobBS has been shown to perform better than previous nowcasting methods for different infectious diseases [10]. Nowcasting requires that the date of symptom onset is collected for all cases that are eventually reported. However, real-time surveillance systems cannot guarantee the ascertainment of the date of the clinical event for all reported cases, even if all cases are reported. Therefore, nowcasting the epidemic curve based on symptoms onset requires imputation of the missing dates of symptoms onset. The nowcast epidemic curve can then be used to estimate the time-varying reproductive number, that is, the number of secondary infections arising from a single infection on a particular day [12,13]. The estimation of the reproductive number also requires parametric assumptions on the generation interval (the time between a primary and a secondary infection). Here, we present a three-step approach to estimate in near-real time the epidemic curve and the time-varying reproductive number (R) in the presence of reporting delays and incomplete data on the date of onset of symptoms. First, we impute the missing date of onset of symptoms data using historical distributions derived from reported line-list data. Second, we use NobBS to estimate case counts up to the present while adjusting for reporting delays. Third, we estimate the time-varying reproductive number using the nowcasted epidemic curve. We apply the approach to data reported during the early stages of the SARS-CoV-2 outbreak in two regions of Spain.

Methods

Ethics statement

The surveillance protocol was approved by the Inter-territorial Council of the Spanish National Health System. Although individual informed consent was not required, all data were pseudonymised to protect patient privacy and confidentiality. The study was also reviewed by the Institutional Review Board of the Harvard T.H. Chan School of Public Health, Boston, MA (US)

Surveillance data

We applied our methodology to Madrid and Murcia, two regions of Spain with very different characteristics. Madrid has 6.7 million residents, is highly interconnected both nationally and internationally, has the highest population density and urbanicity of the country, is situated in the geographic (inland) center, and had a seroprevalence for SARS-CoV-2 of 11.5% at the end of the study period [14]. Murcia has 1.4 million residents, average connectivity, population density and urbanity, is geographically situated in the coastal periphery and had a seroprevalence of 1.6% at the end of the study period. Each region reported daily counts of PCR-confirmed COVID-19 cases to the Spanish Ministry of Health [15] and individualized data on date of report (DOR) and, for a proportion of cases, the date of symptom onset (DOS) to the Spanish System for Surveillance at the National Center of Epidemiology (RENAVE) through the Web platform SiViEs (System for Epidemiologic Surveillance in Spain) [16]. We conducted the analyses in each region using cumulative data available at three overlapping periods in the outbreak: early analysis period when reported cases reached maximum counts (spanning March 1-March 27), intermediate analysis period shortly after the peak of the epidemic curve (March 1-April 9), and late analysis period when the epidemic curve was close to zero (March 1-April 16). We chose these three periods because their ending points correspond to distinct times when decisions about epidemic control were considered in Spain, and thus the limitations arising from the availability of data were especially relevant at those points. All analyses were implemented in R version 4.0.2. Our approach has three steps.

Step 1: Imputation of missing data

We imputed the missing DOS by randomly assigning values drawn from the distribution of reporting delay (the period between DOS and DOR), conditional on DOR, in individuals with known DOS. Of note, the reporting delay distribution conditional on the DOR is different from both the unconditional delay distribution (the distribution of all reporting delays) and the “forward” delay distribution conditional on the DOS (if my DOS is today, how long do I wait until DOR?). Inferring reporting delays from the unconditional delay distribution or the "forward" delay distribution conditional on the DOS is known to generate biased epidemic curves [17]. We first assumed that missing DOS occurred at random with respect to symptom onset date, and that reporting delays conditional on DOR can be modeled over time t and location i as a negative binomial distribution with mean parameter μ and dispersion parameter θ. Therefore, samples of the missing backward reporting delays can be obtained from the parametric approximation of the observed backward reporting delays over the same time and location, , resulting in the following model: We dynamically estimated the parametric delay distribution at each day by regressing the backward delay distribution from all cases with available DOS pooled over a period of time τ comprising all dates between the day of imputation d and a lag u (i.e., τ = d−u,…d) and using maximum likelihood [9]. Using observations over τ, instead of only using observations at d allows us to increase the size of the observed delays at each location for the fitting step. However, to adjust for variation on the reporting patterns due to the day of the week (i.e., reduced reporting during weekends compared to weekdays) as well as different short-term dynamical trends (i.e. increasing/decreasing trend over a week), we modeled the mean delay conditional to the categorical predictor w being the reporting date weekday or weekend: For the main analysis we pooled from a 3 -days period when the number of available observations were 50 or more, and sequentially increased τ to 7 or 10 days if the total number of observations in τ were smaller than 30. When the number of observations included in the longest period of time (10 days) were less than 30, we simply imputed the missing delays by subtracting the observed mean delay from the DOR of each case with incomplete information. Last, for imputing the missing DOS at each reporting day d and aiming to reliably estimating the uncertainty around the epidemic trends, we sampled from a randomly generated negative binomial distribution where and were modeled as having both a normal distribution with the mean set to the estimated and , and the standard deviation set to the standard error of the mean. We resampled 100 times to generate 100 time series of cases with complete DOS-DOR for each region i. The sum of the observed and imputed cases became the total cases used for nowcasting. For further details on the imputation model see S1 Text. We conducted several sensitivity analyses to explore the impact of the method’s assumption on the estimates. First, we repeated the main model approach after masking DOS (hiding original data as missing values) in a random 10% and a random 40% sample of the cases with available reporting delay. Second, after masking DOS in a random 40%, we used two alternative models for imputing the missing reporting delays: a) using a single mean parameter μ over days and region (stronger assumption than the main approach) b) using a mean parameter μ and dispersion parameter θ estimated always from the distribution of the observed delays pooled from a 7-days period (τ) instead of a 3-days period, where τ = d−6,…d (stronger assumption than the main approach but weaker than a). Third, we masked DOS in a random 40% sample of cases and then imputed the missing DOS by subtracting the observed mean reporting delay from the reporting date for each case [17] (a procedure hereinafter referred to as backshifting). Last, we evaluated how deviations from the missing-at-random assumption would impact the imputation by randomly masking 20% cases with delays shorter than the median delay from that analysis period, or masking those cases with delays longer than the median.

Step 2: Nowcasting the epidemic curve

After imputation, all reported cases have a DOS. However, a number of cases with DOS before day t will be reported after day t. We therefore used NobBS [10] to nowcast the number of yet unreported cases at t. Briefly, NobBS uses historical information on the reporting delay to predict the number of not-yet-reported cases in the present using a log-linear model of the number of cases. We used the NobBs R package (v1.2), which compiles in JAGS using the rjags package (v4.10) to compute the number of cases reported with a particular delay. Cases are modeled as a negative binomial process. Further, the implementation of NobBS requires the specification of 1) a sliding window for the time-varying reporting delay, 2) a maximum delay allowed for the window, and 3) a set of priors of the negative binomial distribution parameters. We specified the moving window as the 75% of the total number of days in each period, the maximum delay as the maximum window minus 1 day, and weakly informative priors for the NobBs parameters. Detailed formulation of the nowcast model and prior distributions can be found in S1 Text. To assess uncertainty in the estimation, we produced a nowcast series with 10,000 posterior samples for each time point of the 100 imputed case count series in each region. We then pooled the samples and calculated the nowcast median and 2.5 and 97.5 percentiles for each day [18]. In sensitivity analyses, we a) used a fixed window of 4 weeks, b) varied the length of the window depending on the observed reporting delay over time, and c) generated the nowcasts using cases by report day and backshift by the mean reporting delay. Further, we evaluated the nowcast performance in more detail by assessing the changes on the nowcast estimates each day between March 25- April 8, 2020.

Step 3: Estimation of time-varying reproductive number R

We estimated the time-varying reproduction number R using two approaches by Wallinga and Teunis (WT) and Cori et al. (C) [13,19], both implemented in the R package epiEstim (v2.2.1). We used the nowcast estimates and a mean generation interval (the time between a primary and a secondary case infection) of 5 days with a standard deviation of 1.9 [20]. Details on the generation interval distribution can be found in S1 Text. We computed R from the 10,000 NobBS samples to produce the mean and 95% credibility interval of the R (for computational ease, we used a random sample of size 100 from the 10,000 samples since results did not materially change with a larger sample). Because WT and C use different forward and backward-looking computational approaches, respectively, a relative delay approximating the mean generation interval of WT relative to C is to be expected [17]. Further, downward biased WT estimates are expected for the last period of nowcast of length equivalent to the generation time, as they would lack sufficient data for computation. In sensitivity analyses, we used available cases by DOS without imputation and nowcasting (i.e., without adjusting for missingness and censoring), observed cases by DOR with and without backshifting by the mean reported delay, and also used a longer generation interval (mean = 7.5, SD = 3.5) [21]. Last, we evaluated how significant changes in ascertainment (50% lower for the first 2 or 4 weeks of transmission) can impact the nowcast reconstruction and the R estimates.

Results

Imputation of missing data

The proportion of missing DOS among reported cases varied by region and epidemic phase. In Madrid, the percentage of missingness was 53% of 32,723 reported cases in the early analysis period, 37% of 50,745 in the intermediate analysis period, and 16% of 56,057 in the late analysis period. The corresponding numbers in Murcia were 25% of 831, 23% of 1433, and 11% of 1602. The distribution of the reporting delay was also region-specific and changed over time (S1 Fig). The first row of Fig 1 for each region shows the weekly counts of cases by DOR with observed and missing DOS. The second row of Fig 1 shows the epidemic curve by DOS after the median of imputed cases (grey) was added to the observed cases with available DOS (blue). As expected, the uncertainty of the imputation increases with the proportion of missingness. Compared with the main analysis, the simpler but less flexible negative binomial model showed some limitations when a high proportion of DOS were missing (S1 and S2 Texts and S2 Fig). Additionally, the curve was skewed when the assumption of missing-at-random was violated (S2 Text and S3 Fig).

Fig 1

Epidemic curves and reproductive numbers estimated using the data available during the early, intermediate, and late analysis of the initial SARS-CoV-2 outbreak in the regions of Madrid and Murcia, Spain, March 1-April 16, 2020.

DOS: date of onset of symptoms; DOR: date of report; Lines are median estimates, ribbons span 2.5 and 97.5 percentiles. Vertical lines indicate the day when R <1 (red dashed line for WT, purple dashed line for C).

Epidemic curves and reproductive numbers estimated using the data available during the early, intermediate, and late analysis of the initial SARS-CoV-2 outbreak in the regions of Madrid and Murcia, Spain, March 1-April 16, 2020.

Nowcasting the epidemic curve

The second row of Fig 1 shows the epidemiologic curve after nowcasting (yellow line). Nowcasting reconstructed the epidemic curve more reliably than unadjusted case counts, either by DOR or DOS, even in earlier phases. This is more easily seen in Fig 2: nowcasted case counts in the intermediate period represent more accurately those estimated in the latest period, either compared with the curve of raw observed (not missing) case counts (B and E) or the curve modeled by mean backshift (C and F). However, the nowcasting procedure can be biased when there is a high proportion of missing data and sudden changes in the reported case numbers close to the end of the current observation (S3 Text).

Fig 2

Epidemic curves estimated using the data available during the early and intermediate analysis of the initial SARS-CoV-2 outbreak in the regions of Madrid and Murcia, Spain, March 1-April 9, 2020, and comparison with curves obtained in late period of analysis March 1-April 16 (dashed grey line and grey ribbon).

Epidemic curves estimated using the data available during the early and intermediate analysis of the initial SARS-CoV-2 outbreak in the regions of Madrid and Murcia, Spain, March 1-April 9, 2020, and comparison with curves obtained in late period of analysis March 1-April 16 (dashed grey line and grey ribbon).

Showing nowcast estimates (orange) in Madrid and Murcia for the early (A and D) and intermediate (B and E) period analysis, observed cases with known date of onset of symptoms for the same period (blue columns), and for comparison with more complete data, those estimated in the late period of analysis (dashed grey line and ribbon); C and F showing cases by date of report back shifted by the mean delay (red line) together with nowcast estimates for the late period analysis and observed cases with known date of onset of symptoms (shadowed blue columns). DOS: date of onset of symptoms; DOR: date of report; Lines are median estimates, ribbons span 2.5 and 97.5 percentiles. Nowcast curves are smoother than curves of case counts by report date because of removal of noise related to the reporting process (such as weekday dependency). Also, the peak of case counts by symptom onset was several days earlier in the nowcast curves (around March 16 in both regions) than in the curves of reported cases (around March 25–26). The uncertainty in the nowcast increases with uncertainty in the imputation of DOS, as can be seen in the large uncertainty band in the early analysis. The nowcast curve trends were relatively robust to small-to-moderate changes in missingness and different parameterizations of the NobBS function, such as the selection of the sliding window of analysis. However, the time of peak of case-counts can significantly vary with data availability and assumptions (S2 Text, S2 Fig). Further, major changes in ascertainment and/or reporting rates can substantially bias the nowcast estimates (S4 Text, S8 Fig)

Estimation of the time-varying reproductive number

The third row of Fig 1 shows the R estimates using the nowcasted curve and including uncertainty from previous steps. The precision of R increased when more observed cases became available over time, as seen in the third row of Fig 1 for both regions. R estimates computed using nowcasted curves show an earlier reduction of R than those obtained from raw case counts by DOR as seen in Fig 3; further, R estimates from case counts by DOR showed significantly higher values followed by a steeper slope for both approaches, and more noise when using Cori’s approach compared to the estimates from the nowcasted curves. R estimates obtained using observed case counts by DOS (i.e., without performing imputation and nowcasting) shifted in both locations towards lower values, particularly for the first periods of analysis, which in turn lead to a 6–9 days earlier estimates of the critical point where R becomes <1; the estimates improved once more information was available for the last period (S4 Text, S7 and S8 Figs).

Fig 3

Reproductive numbers estimated from nowcasted cases by DOS vs observed cases by DOR using Cori et al (A,C) and Wallinga and Teunis (B,D) approaches for Madrid and Murcia, March 1-April 09, 2020.

Reproductive numbers estimated from nowcasted cases by DOS vs observed cases by DOR using Cori et al (A,C) and Wallinga and Teunis (B,D) approaches for Madrid and Murcia, March 1-April 09, 2020.

Showing the reproductive numbers estimated for Madrid (A,B) and Murcia (C,D). Estimates were computed using Cori et al (A,C) or Wallinga and Teunis (B,D) approaches and either nowcasted cases by DOS vs observed cases by DOR. A generation interval of mean 5 (1.9 SD) was used. Log2scale for the y-axis is used to facilitate visualization. Though the WT and C approaches had similar trajectories, WT reached lower values earlier, with a delay between them approximating the mean generation time (5 days), the rationale being described previously [17]. This led to a consistent 4–6 days difference in the median estimated time for R becoming <1, which can be seen in Fig 1 lower row for both regions. This consistency was lost when using case counts by DOR, as seen in Fig 3. As expected, R estimates using the WT method for the last days of analysis were biased downward, which precludes its use for the last period of length equal to the generation interval.

Discussion

We proposed a three-step approach to characterize an outbreak in near real-time by adjusting for incomplete data and reporting delays. We applied this approach to the early SARS-CoV-2 outbreak in two regions of Spain. Our findings showed that a country-wide lockdown control was followed by a substantial decline in diagnosed cases shortly thereafter, around March 14–20 in Madrid and around March 17–23 in Murcia. Nowcasted case counts were more accurate and consistent with true transmission than the unadjusted curve by DOS or DOR (as documented using complete data that became available after the study period). For example, our approach could identify SARS-CoV-2 epidemic control by lockdown in Spain almost a week earlier than when using reported cases by DOR. Our approach has several limitations. First, its validity relies on the assumption that the date of symptoms onset is missing at random, that reporting delays can be approximated using a parametric approach, and that the available historical data are sufficient to parameterize the unknown reporting delay. These assumptions might be violated in many different ways depending on factors such as the accuracy, quality or procedures of the surveillance systems. However, our sensitivity analyses indicate that the overall trajectory of the epidemic curve was relatively robust to small-to-moderate departures from these assumptions. Second, our approach underperforms when little information is available for training the nowcasting algorithm, especially when good estimates of the reporting delay distribution cannot be obtained. This limitation might be particularly important a) in the very early stages of an emerging outbreak, when deficits among surveillance procedures, such as challenges in data submission to surveillance systems, may heavily impact the availability and consistency of data over time, and b) when the difference in the number of reported cases between consecutive days is substantially large, as for example, when there is a substantial reduction in case reporting during weekends. This limitation might preclude public health action based on day-to-day changes of the estimated number of infections; instead, changes in the number of infections during larger periods should be evaluated. Alternatively, more refined nowcasting models dealing with the forward reporting dynamics [22] could improve the reliability of the estimates. Third, our estimates were sensitive to the choice of imputation and nowcasting procedures when the date of symptoms onset was unknown for a high percentage of confirmed cases. Fourth, our approach requires that the degree of underreporting of cases is relatively constant over time. Major changes in case reporting will bias the R estimates [23,24]. However, R estimates remain unbiased if the proportion of unascertained or underreported observations remains time invariant [17,25]. All the previous limitations underscore that a good performance of the nowcasting approach can only be achieved by adequately specifying the model to account for the actual reporting process in the region that is analyzed. Future approaches including additional terms in the regression models can be explored aiming to better account for the reporting dynamics and the challenge of imputing under a small number of observations, while performing formal evaluation procedures can help to support selection of models with highest nowcasting accuracy [22]. Last, our estimates could be improved by reconstructing the epidemic curve by the date of infection rather than that of symptoms onset, though this would require more complex methods given that the temporal delay from infection to symptom onset is much harder to characterize [17,26-29]. The overall findings of our work are consistent with the evaluation of a similar 3-step approach proposed recently [22] and analyses using synthetic data [17]. Nevertheless, we extend the analysis by focusing on, and illustrating, key aspects of the method and assumptions that support further adaptation of the approach to surveillance in other settings; by comparing different regions and periods of analysis; and by providing alternative models for reconstructing the epidemic curve and their evaluation while using existing computational tools. Development of ready-to-use approaches for epidemic dynamics modelling help surveillance services to appropriately present data for efficient epidemic control, but understanding the limitations of the procedure and the impact of prespecified assumptions is critical for interpretation. Our approach provides a systematic analysis on key assumptions and implementation procedures frequently used to characterize emerging outbreaks. We propose a disease surveillance framework that acknowledges and adjusts for biases arising from real-world observational challenges, and is capable of providing objective, quantifiable, and systematic information, to aid the decision-making process in real-time outbreak mitigation efforts.

Additional information on model specification.

(PDF) Click here for additional data file.

Sensitivity of the imputation step.

(PDF) Click here for additional data file.

Sensitivity of the nowcasting step.

(PDF) Click here for additional data file.

Sensitivity of the R estimation step.

(PDF) Click here for additional data file.

Empirical distribution and approximated functions of the reporting delay conditional on report date in the regions of Madrid and Murcia, Spain, March 1-April 16, 2020.

The blue columns represent the true observed proportion. The grey line and ribbon represent the median and 95% uncertainty interval of the prediction under the fitted negative binomial distribution. (PDF) Click here for additional data file. Plotting the sum of imputed and observed epidemic curves (black line median, ribbon 95% CI) in the regions of Madrid and Murcia, Spain, March 1-April 16, 2020, estimated after A,F) randomly masking 10% of available reporting delays and using the main imputation approach, B,G) randomly masking 40% of available reporting delays and using the main imputation approach, C,H) randomly masking 40% of available reporting delays and estimating from a 7-day window of observations and allowing additional variance for the dispersion parameter D,I) randomly masking 40% of available reporting delays and using a single value for the mean in the negative binomial, and E,J) randomly masking 40% of available reporting delays and imputing by mean delay backshifting. Blue columns represent true observed case counts by day of symptoms onset. (PDF) Click here for additional data file. Plotting the sum of imputed and observed epidemic curves (black line median, ribbon 95% CI) in the regions of Madrid and Murcia, Spain, March 1-April 16, 2020, estimated after A, B) randomly masking 20% of available reporting delays among cases with delays longer than the median and C, D) randomly masking 20% of available reporting delays among cases with delays shorter than the median. Blue columns represent true observed case counts by day of symptom onset. (PDF) Click here for additional data file.

Evaluation of the reconstructed epidemic curves at each day between March 25- April 8 using the data available during the intermediate period of analysis of the initial SARS-CoV-2 outbreak in the regions of Madrid and Murcia, Spain.

Nowcast estimates in Madrid and Murcia (orange lines are median estimates, ribbons span 2.5 and 97.5 percentiles, observed cases with known date of onset of symptoms for the late period of analysis (blue columns), cases with imputed date of onset of symptoms for the late period of analysis (grey columns) and nowcasted uncertainty range for the late period of analysis (grey ribbon); Mon: Monday, Tue: Tuesday, Wed. Wednesday, Thu.: Thursday, Fr.: Friday, Sat.: Saturday, Sun.: Sunday. (PDF) Click here for additional data file.

Epidemic curves estimated using alternative values for the NobBS sliding window and the data available during the early analysis of the initial SARS-CoV-2 outbreak in the regions of Madrid and Murcia, Spain, March 1–27, 2020, comparison with backshifted epidemic curves by report date, and curves obtained in late period of analysis March 1-April 16 using the main approach.

Nowcast estimates in the intermediate period of analysis (yellow lines represent the median and ribbons span the 2.5 and 97.5 percentiles) 1) using a fixed window of 28 days (A and E), 2) using a window that accounts for 75% of the delays available from the latest period of observations (B and F) 3) using a window that account for 99% of the observed delays (C and G); and 4) imputing through backshifting the report date by mean delay (D and H). Dashed lines and light grey ribbon represent nowcasted cases later in time using the main approach (late period of analysis). Faded blue columns represent observed cases by DOS, faded grey columns represent median imputed cases by DOS; dark grey ribbons (in A, B, C, D, E and F) represent observed plus imputed 2.5 and 97.5 percentiles. (PDF) Click here for additional data file.

Epidemic curves estimated using alternative values for the NobBS sliding window and the data available during the intermediate analysis of the initial SARS-CoV-2 outbreak in the regions of Madrid and Murcia, Spain, March 1-April 9, 2020, comparison with backshifted epidemic curves by report date, and curves obtained in late period of analysis March 1-April 16 using the main approach.

Showing nowcast estimates in the intermediate period of analysis (yellow lines represent the median and ribbons span the 2.5 and 97.5 percentiles) 1) using a fixed window of 28 days 2) using a window that accounts for 75% of the delays available from the latest period of observations (B and F) 3) using a window that account for 99% of the observed delays (C and G); and 4) imputing through backshifting the report date by mean delay (D and H). Dashed lines and light grey ribbon represent nowcasted cases later in time using the main approach (late period of analysis). Faded blue columns represent observed cases by DOS, faded grey columns represent median imputed cases by DOS; dark grey ribbons (in A,B,C, D, E and F) represent observed plus imputed 2.5 and 97.5 percentiles. (PDF) Click here for additional data file.

Reproductive numbers estimated comparing 2 different generation intervals on nowcasted curves and curves by report date using the data available during the intermediate analysis of the initial SARS-CoV-2 outbreak in the regions of Madrid and Murcia, Spain, March 1-April 9, 2020.

Showing sensitivity analysis of the R estimates. Rows A and D plots R for WT (red) and C (purple) approaches using the nowcast estimates and a generation interval of mean 5 (1.9 SD). Estimates are also obtained using a longer generation interval of mean 7.5 (3.4 SD) shown in rows B and E; Rows C and F shows R estimates using the shorter serial interval but calculated from the observed cases by date of report. Vertical lines indicate the day when R <1 (red dashed line for WT, purple dashed line for C). (PDF) Click here for additional data file.

Reproductive numbers estimated using only available cases by DOS vs. nowcasted curves and curves by report date using the data available during the three periods of analysis the initial SARS-CoV-2 outbreak in the regions of Madrid and Murcia, Spain, March 1-April 9, 2020.

Lines are median estimates, ribbons span 2.5 and 97.5 percentiles. Vertical lines indicate the day when R <1 (R estimated from nowcasted curves are shown in red dashed line for WT, purple dashed line for C; R estimated from available cases by DOS are shown in blue). (PDF) Click here for additional data file. Epidemic curves and R estimates using the data available during the the intermediate analysis of the initial SARS-CoV-2 outbreak in the regions of Madrid and Murcia, Spain, March 1-April 9, 2020 (A,D and G,J) and comparison of the same estimate approach after randomly subtracting 50% of cases during the first 2 weeks (without including peak transmission, B,E and H,K) and during the first 4 weeks (including peak transmission, C,F and I,L). DOR: date of report; Lines are median estimates, ribbons span 2.5 and 97.5 percentiles. Vertical lines indicate the day when R <1 (red dashed line for WT, purple dashed line for C). (PDF) Click here for additional data file. 15 Feb 2021 Dear Dr. Martinez, Thank you very much for submitting your manuscript "Near real-time surveillance of the SARS-CoV-2 epidemic with incomplete data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Benjamin Muir Althouse Associate Editor PLOS Computational Biology Tom Britton Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: This paper proposed a two-step approach to reconstruct the epidemic curve in nearly real time, firstly imputing the cases with missing symptom onset dates based on those with complete data and secondly nowcasting the epidemic curve to address the right censoring issue. This method addresses a very common issue faced by infectious disease epidemiologists worldwide, especially during the ongoing outbreak of COVID-19. The paper is well written and clearly presented. I have two comments that I hope the authors can address during revision. First, since an important application of the method is to estimate Rt in real time, I would like to see a comparison of the Rt estimates with and without the proposed method. Currently, many reported Rt values for COVID-19 were based on the observed number of cases by symptom onset date (e.g., early papers based on data from Wuhan: PMID: 32275295 and 32674112). It will be good to make a direct comparison in this paper and discuss the implications for current COVID-19 surveillance. Second, the authors pointed out one limitation of their approach is that it requires the ascertainment rate does not change significantly overtime. But the testing capacity in many countries has changed a lot since the outbreak. It will be good to use simulations to understand the impact of the violations of this assumption so that we can better interpret the results when applying their method. Reviewer #2: The authors propose a three-step analysis of COVID-19 case reporting data for real-time monitoring of the epidemic situation and illustrate it based on the example of surveillance data from the first phase of the pandemic in Spain. The proposed analysis is based on three steps: 1) Imputation of missing dates of symptom onsets (DOS) based on the observed “backward” reporting delay distribution conditional on the date of report (DOR) from all cases with available DOS; 2) Nowcasting of the epidemic curve (number of cases with DOS on a given day) utilizing a previously published Bayesian model (Nowcasting by Bayesian Smoothing); 3) Estimation of the time-varying reproduction number R(t) based on two different approaches. The proposed analysis is illustrated in two Spanish regions at three different time points and results of the procedure for the first two time-points are evaluated based on retrospectively available data (at time point 3) to compare and evaluate different modeling choices. Research on the adequate interpretation of COVID-19 case reporting numbers and real-time surveillance data is a timely and very important topic. Nowcasting can be a valuable analysis tool to adjust reported case numbers for reporting delays. The authors provide an interesting application of such a nowcasting approach, there are however several major aspects that require clarification/revision: 1) Missing references and novelty: the proposed three-step analysis is not new, a similar approach of imputation of missing DOS, Bayesian nowcasting, and estimation of R(t) was already proposed and published in Guenther, et al. (2020). There it was shown that it is necessary to adjust for changes in the “forward” reporting delay distribution over time to avoid a bias in the estimation of the epidemic curve. 2) Missing methodological details and modeling choices: 2.1. Imputation: the authors model the reporting delay distribution given DOR based on a Negative Binomial model with varying mean per day and region and constant overdispersion parameter per region. Especially in times of low reporting numbers, modeling a single mean parameter per day might yield rather unstable results. On the other hand, the overdispersion of the NegBinom distribution might vary over time as well (as also described by the authors based on empirical data). Did the authors perform any (formal) model selection for the imputation model? The imputation implies a missing at random assumption and the authors show in Supplemental section S1 that this imputation works quite well when the assumption is fulfilled (DOS missing for randomly selected individuals). In reality, the assumption might however be violated, e.g., if DOS is missing mainly because of pre-symptomatic cases. The current sensitivity analysis does not address this problem and this should at least be discussed. Furthermore, the possible extent of such a violation might be quantified in a a sensitivity analysis by setting the DOS of for cases with missing DOS to the DOR or e.g. DOR+2 in a sensitivity analysis. 2.2. Nowcasting: the description of the nowcasting model is very brief and hard for me to understand (even when additionally consulting the original NobBS publication). The selection and specification of the hyperparameters and priors is not motivated and incomplete. What was the maximum considered delay? Does the model assume a delay distribution that is independent of day t in the whole moving window? It is unclear to me what exactly the parameters alpha_t and beta_d as defined by the authors are and what assumptions the mentioned uniform priors imply. The notation and/or the prior do not correspond to the original publication of NobBS? 2.3. Estimation of R(t). The authors estimate R(t) based on the methods of Wallinga and Teunis and Corri et al. and note correctly that they are conceptually different. It would be helpful to provide information on the correct interpretation of the different estimated R(t)'s at a given timepoint t. Is there any advantage of one approach over the other? Two further questions/remarks: How was the generation time/serial interval distribution specified? Because its definition is “forward looking”, the Wallinga and Teunis method is biased downwards for all days t for which no information on the number of disease onsets on day t+d is available for all number of days d that have a relevant probability mass in the generation time distribution (e.g., because t+d>current day T). For those days t it appears not to be meaningful to report the estimated R(t) based on WT. 3) Study design: How and why did the authors choose the 3 specific time-points for showing the results of the nowcast? For the sake of illustration this might be a reasonable choice, but a thorough evaluation of the performance of the proposed approach should be made based on nowcasts for every day in the study period (i.e., comparison with “retrospective truth”). In this case coverage frequencies of prediction intervals can be investigated and systematic biases might be detectable. The performance of different model specifications might also be compared quantitatively, e.g. based on scoring rules. 4) Performance of analysis/nowcasting model: the available information on the performance of the proposed analysis is insufficient. Especially around the first time-point in the Madrid region the estimated epidemic curve appears to be strongly biased upwards and the (retrospectively observable) decrease in the epidemic curve is not identified by nowcasting. This appears to be the most interesting time point with respect to real-time surveillance. To adequately judge the performance of the proposed analysis approach and its value for real-time surveillance a more formal quantitative evaluation would be necessary (see point 3.) Where does this strong bias at time point 1 come from, is it due to changes in the reporting delay distribution over time (e.g., increased reporting speed due to less cases/fewer workload for the health authorities)? At what day is the peak in the epidemic curve in Madrid around March 15 identified based on the nowcasting? 5) Nowcasting with different moving window: The presented results with respect to the specification of the moving window are difficult to follow. Based on figure S3 it appears like the maximum considered reporting delay was changed to 5 or 6 days in panel A/D, and for all days before T-5 or T-6 the so far available case counts are considered as final numbers. Based on the data presented in Figure S1 this appears not to be a reasonable analysis and it is therefore not possible to compare the effect of different moving windows for the estimation of the delay distribution. Furthermore, results for the first timepoint of analysis are missing and would be of main interest. 6) Unfortunately, there is a lack of data and code to reproduce the analysis. This makes it difficult to understand the analyses performed and hinders the reproducibility and application to data from different regions. Literature F. Günther, A. Bender, K. Katz, H. Küchenhoff, and M. Höhle. Nowcasting the COVID-19 pandemic in Bavaria. Biometrical Journal, 2020. https://doi.org/10.1002/bimj.202000112. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: No: The authors refer to a website (in Spanish) that provides aggregated Spanish COVID-19 case numbers. However, as far as I could see, it does not seem to be directly possible to obtain the person-specific or aggregated information on disease onset and reporting date, which are necessary for the proposed analyses of the manuscript. It is also not clear whether the historical data for the three different timepoints of analysis can be retrieved. Also, no code is available to reproduce the analyses. A code and data repository for reproducing and adapting the analyses would be desirable. ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see 30 Mar 2021 Submitted filename: Response_Nowcasting_Plos_Comp_Bio.docx Click here for additional data file. 3 May 2021 Dear Dr. Martinez, Thank you very much for submitting your manuscript "Near real-time surveillance of the SARS-CoV-2 epidemic with incomplete data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Benjamin Muir Althouse Associate Editor PLOS Computational Biology Tom Britton Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I think the authors misunderstand my first comment. The authors provided new results by applying WT and Cori's method on DOR for comparison. But what I am asking for is a comparison with direct estimate of Rt on DOS without imputation and nowcasting (as have been done in many studies during the pandemic). From Figure 1, we can see that imputation and nowcasting can change the epidemic curves substantially, but the impact on the Rt estimate is not clear. It will be good for the authors to show the difference and discuss the implications on epidemic control. Reviewer #2: Dear Authors, Thank you very much for providing a revised version of the manuscript and for the answers to my previous comments, through which several of my questions were answered. I have, however, some remaining questions/concerns that have to be addressed before I can recommend publication: Despite the additional results and analyses, I am still not convinced about the performance and the added value of the proposed approach in the presented application scenarios, especially for the data in the first period in the Madrid region, which appears to me as the most important with respect to a (near) real-time assessment of the pandemic situation. This is due to the following reasons: 1) I am confused by the changes of the nowcast results for Madrid and Murcia at March, 27 (first period) compared to the initial submission. The results in Figure 1 and especially Fig 2A/D do not correspond to the results of Fig 2A/D in the initial submission of the manuscript. The bias in the results is not visible anymore but you did not state any changes with respect to the model or the utilized data in your response to the reviewer comments or the manuscript. 2) The results in the new sensitivity analysis (Fig S4) that incorporate more data (since they are based on the data from the second period restricted to what was available until March, 27) appear to be worse than the results presented in (the new) Fig 2A. This is very confusing and the introductory sentences for Supplementary Section S3 do currently not make sense to me. You state "As shown in Figure 1 in the main analysis, for the early period of analysis in Madrid, the nowcasting approach did not perform sufficiently well", but Figure 1/2A do now show quite accurate results. I am, however, not sure where they are actually coming from. 3) I am skeptical about your interpretation of the results of Figure S4. You state "We also observed that successive increases in the availability of data rapidly corrected the nowcast estimates as soon as March 28 (Figure S4 B and F)." The nowcast of March 29 (Figure S4-C) shows however, a strong bias into the opposite direction (i.e., an underestimation of the epidemic curve). The results from data up until march 27 and up until March 29 are qualitatively very different (and the situation was even worse with the data that was actually available until March 27 with more missing DOS). Altogether, in two out of the four days of the new sensitivity analysis, the 95%-PIs do not cover the retrospective "true" epidemic curve (based on information available at the third period) for several days close to "now". This raises the questions whether the approach would really be helpful in a real-time assessment of the current situation (and is capable "to aid decision-making process" as you state in the discussion) during the first period in Madrid and whether the uncertainty quantification of the method is adequate. Based on these results, and despite the difficulties you have with the availability of the data, I would really recommend doing a proper quantitative evaluation of the performance of the nowcast model over a longer period of time to prove adequate performance. In addition, you should clarify the issues around Figs. 1 and 2 in your manuscript and discuss the problems with the nowcasting approach during the first period in more detail in the manuscript. Some additional comments with respect to the imputation of missing DOS: Based on the description of the imputation model in the manuscript, I still do not completely understand what the main model for imputation is. You write: "We assumed that missing DOS occur at random with respect to symptom onset date, and that reporting delays conditional on DOR can be modeled over time t and location i as a negative binomial distribution with mean parameter \\mu_{i,t} and dispersion parameter \\theta_i estimated using maximum likelihood [9]." Based on the code from github, it looks like you utilized a model log(\\mu_{i,t}) = \\beta_{0,i} + \\beta_{1,i}*t i.e., you assumed a linear time-trend for modeling the (log)-expectation of the reporting delay. This seems to be a rather unflexible model, but might be flexible enough for your data. However, I am not sure if you really used this model as you later say with respect to the sensitivity analyses: "mean parameter mu_{i,t} and dispersion parameter theta_i,t [were] estimated from the distribution of observed delays pooled from t − tau to t (weaker assumption than the main approach with \\tau=7)". I am somewhat confused what the final "main approach" is, please clarify this and specify the model using a clear description or formula! For incorporating the imputation uncertainty into the nowcast you write that you "resampled 100 times to generate 100 time series of cases with complete DOS-DOR for each region i, allowing \\mu_{i,t} and \\theta_i to vary randomly under a normal distribution with the standard deviation \\sigma set to the sampling error." I am not sure what you mean by "sampling error". I assume that you sample from the negative binomial distribution with parameters coming from draws of the normal distribution with parameters \\hat{mu} and \\hat{SE} of the corresponding parameters from the imputation model? It would then probably be better to replace "Sampling error" by "standard error" and also specify the expectation of the normal distribution you are sampling from. Additional general comments: - What do the vertical lines in Figure 3BD indicate? They do not seem to correspond to the time-points when R(t) crosses 1. Furthermore, you might consider plotting R(t) on a log-scale as this better describes the multiplicative interpretation of the reproductive number (and increases readability of your figures with very high values from the raw case counts). - In the discussion, you mention that "Our findings showed that a country-wide lockdown control led to a substantial decline in cases shortly thereafter, around March 14-20 in Madrid and around March 16-21 in Murcia." I think that such causal statements are not covered by the presented analyses and it would be better to speak of a temporal association. I would also refrain from comparing the nowcasting results and the curve of case counts with respect to their consisteny with the date of the country-wide lockdown in the Results section. - The description of the first-order Random walk in Section S1 seems to include an typo. Expectation of the normal distribution should probably be \\alpha_{t-1} ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No: Due to data confidentiality reasons only an anonymized line-list data is provided in addition to code that enables reproduction/mimicking of parts of the presented analyses. ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 5 Jul 2021 Submitted filename: Response_v2.docx Click here for additional data file. 4 Aug 2021 Dear Dr. Martinez, Thank you very much for submitting your manuscript "Near real-time surveillance of the SARS-CoV-2 epidemic with incomplete data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. Please explicitly address all of reviewer 2's comments; more fully justify the novelty of this approach; and while you explore the limitations and potential problems of you analysis, you do not offer any solutions or suggestions on how to deal with such uncertainties in an actual applied analysis scenario. Address this as well, please. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Benjamin Muir Althouse Associate Editor PLOS Computational Biology Tom Britton Deputy Editor PLOS Computational Biology *********************** Please explicitly address all of reviewer 2's comments; more fully justify the novelty of this approach; and while you explore the limitations and potential problems of you analysis, you do not offer any solutions or suggestions on how to deal with such uncertainties in an actual applied analysis scenario. Address this as well, please. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: All my comments have been addressed. Reviewer #2: Dear authors, I still have some major comments and remarks w.r.t to the current form of the manuscript that should be addressed: Imputation model Thanks for the clarifications with respect to the imputation model(s) in your last version of the manuscript. Based on these, I have some comments: A) General description of Imputation model: 1. The formulation of the imputation model based on the formulas eq1-eq3 in S1 continues to be unclear, Eq 2 and Eq 3 seem to erroneous. In a way, they imply that the parameters are supposed to follow an additional model (as e.g., in a Bayesian hierarchical model) but this seems not to be the case based on the written text. Furthermore, it does not make sense that mu_it follows a normal distribution with expectation mu_it (i.e., itself) and why lamda_i,t is supposed to follow a normal distribution with expectation mu_i,t. Also it does not make sense that the variance parameters/standard deviation is shared between mu_i,t and theta_it. 2. From what I understand from the written text the backward delay distribution for reporting day t, t=1,…T, is just estimated by ML and you estimate the expectation and overdispersion parameter of a negative binomial distribution, where the data correspond to the (observed) delays of all individuals with reporting date t-1, t, and t+1. In this case, the parameters do not have any distribution but are assumed to be fixed, and only their estimates hat(mu) and hat(theta) have an approximate distribution/associated standard error based on ML-theory. 3. You then perform the imputation by sampling parameters of a negative binomial distribution based on the (marginal) approximate normal distribution of the estimated parameters hat(mu) and hat(theta) and sampling reporting delays for each individual from the corresponding distribution. Please provide a valid description of your mathematical models! B) General comments w.r.t. to the model (in real-time surveillance) as I understood it: 1. Estimating the reporting delay distribution for day t =1, …, T based on the days t-1 and t+1 appears to be questionable out of two reasons: First, in real-time surveillance the data for T+1 is not available (where T is the current date, I.e., “now”). Second, infectious disease reporting data usually follows strong weekly patterns, and the authors mention this also for the Spanish data. This translates also to the backward delay distribution (e.g., if Sundays have few reported cases, the (average) reporting delays should be bigger on Mondays). It might therefore be problematic to estimate the reporting delay distribution for, e.g., a Sunday by aggregating over the collected data from Saturday-Monday, or for a Tuesday by aggregating over Monday-Wednesday. In future work you might just consider modelling the backward reporting delay distribution based on, e.g., a parametrical statistical (regression) model where you can account, e.g., for changes over time, week-day effects, etc. It is then also possible to perform a formal model statistical model evaluation/selection. You could add this option to the discussion. 2. The different approaches to model the reporting delay distribution for (3 consecutive) days with smaller than 50 reported cases compared to days with more than 50 reported cases raises some questions (this seems to be mainly relevant for Murcia). You write that theta is set to 1 for such days, and that this implies that P(d=a)>P(D=b), for all aP(D=1). This seems to be a questionable assumption, as reporting delay distributions usually have a peak of probability mass at a specific delay >0, as also seen in the data (Figure S1). Furthermore, I am irritated by the estimated delay distribution for Murcia for March 8 (Figure S1). The reported cases from Match 7-March 9 seem to be <50 (Figure 1), but the estimated delay distribution seems to have biggest probability mass for a delay of d=2. This seems to be inconsistent with the described imputation model. I do not think that this aspect plays a major role for the general results, but presented results should be consistent with the described modelling approach! Of note: A more general statistical model could also help with the problems arising on days with only few observations. C) Reporting of the results: - Caption in Figure S2 C,H seems to be wrong, the model is not more flexible than the revised main model. D) Code: The code in the GitHub repository page does still not correspond to the imputation model you are describing in the manuscript (as of July, 27, 2021)! Nowcasting: - Thanks for showing the results of your nowcasting method for additional days, this helps to judge the performance of the model better and the finding of a bias due to short-term changes in reporting activities on weekends is interesting and important to discuss. - I agree with the conclusion that such biases complicate the interpretation of daily results, and I think it is good that this aspect is now being discussed. Your formulation “when there are large daily changes in the reporting case counts, such as that observed related to weekends, which might preclude public health action based on daily estimates” is however, a bit vague and seems to be grammatically wrong. - In addition, it might be worth mentioning that consistent patterns in the reporting activity can also be considered in more refined nowcasting-models. This can, e.g., be done by letting the probability of being reported on day t+d (d=1, …, D) for an individual with disease onset at day t vary based on the weekday of day t+d and estimating those weekday effects within the nowcasting model. In the NobBS model this would correspond to modelling beta_td instead of only beta_d, e.g., based on a parametrical model that accounts for weekday changes. - I think you should mention in the discussion that a good performance of a nowcasting approach can only be achieved when the model for estimating the reporting delays is adequately specified to account for the actual reporting process in the region that is analysed. - Minor comment: In supp-note S3 you write “As shown in Figure 1 in the main analysis”: This should probably refer to Figure 2. Estimation of R(t), new Figure 3: - please add telling headings to the sub-figures and a legend to the plots. Use unique or reasonably shared colours for the different estimates of R(t), it is confusing that there are two blue lines - why does the Walinga-Teunis based estimate for Murcia based on DOR (?, blue line in the new Fig 3D) ends at April 4? In addition, I find it surprising that the two estimates in 3D are so similar (until around March 25), do you have any explanation for this? Is this sub-figure and/or the caption correct or is this the estimate based on all available DOS without imputation and nowcasting? Discussion - You mention the central limitation of the case reporting data of a potentially time-varying dark-figure in reported cases and write “However, Rt estimates remain unbiased if the proportion of incomplete observations remains time invariant [17,25], as it is likely the case in our analysis.“ It is, however, not clear why the proportion of underreporting is “likely time invariant”. This should be either motivated more clearly or removed from the discussion. In Supp. Note 4 you write that “a 50% lower ascertainment occurring during the initial period of analysis […] is a plausible change, especially for the first weeks of transmission of COVID-19”. In addition, the wording “proportion of incomplete observations” appears to be unclear in the context of nowcasting. From my understanding the issue is about the proportion of missing observations/cases (underreporting) and not incomplete observations (i.e., cases without disease onset information). General comments A) You should revise the entire manuscript and supplementary texts for clarity of wording, correct use of statistical terms, and accurate definition of technical terms and mathematical/statistical variables. Please ensure that the captions of the figures are correct and self-explanatory, that colour choices are consistent and comprehensible, and that each figure has a legend. Besides the aspects I mentioned already above (e.g., mathematical model for reporting delays and various formulations), I found the following formulations in the current version of the main text to be unclear: Methods, Step 1: Imputation of missing data: - Despite my comment on the last version of the manuscript, you continue to use the term “sampling error” wrong. You write: “We resampled 100 times to generate 100 time series of cases with complete DOS-DOR for each region , allowing mu_it and sigma_it to vary randomly under a normal distribution with mean set to their estimated values, and standard deviation set to the sampling error”. The sampling error is usually defined as “the difference between a sample statistic used to estimate a population parameter and the actual but unknown value of the parameter”. This error cannot be estimated base on a single dataset. I think you should replace this by the “standard error of the estimate”. - For the third imputation model in the sensitivity analysis, you write: “Third, we masked DOS in a random 40% of cases and then imputed DOS for all cases reported at any given t and then imputed the same date for all cases reported at any given t as the difference between the observed mean delay and the reported date”. This is unclear and the formulation “difference between the observed mean delay and the reported date” does not really make sense. You could describe this approach more clearly as an imputation by “subtracting the observed mean reporting delay from the reporting date for each case”. Here you might also define the term “backshift” that you are using later in the manuscript. Methods, Step 2: Nowcasting the epidemic curve - Please define your mathematical notation! What does lower-case t stand for? In the previous paragraph on imputation, you used it for each day in the whole analysis period. For describing the problem of nowcasting, it should probably stand for the last day in the observation period, this is often referred to as capital T. Results, Nowcasting the epidemic curve: - You write “However, when there is a high proportion of missing data the nowcasting procedure is sensitive to large changes in reported cases”. It is not clear what this means. Being sensitive to large changes is not necessarily something bad. I guess you mean that that large and sudden changes in reported case numbers close to the end of the current observation period (e.g., due to an unmodelled change in reporting activity during weekends), can bias/distort the estimates of your nowcasting procedure. - Directly in the following you write “the bias decreases as more data becomes available”. Here it is not clear what bias you are referring to and when it decreases. After seeing more data from consecutive days? - In the last sentence you write: “Further, the occurrence of major changes in ascertainment rates relative to the true epidemic trends might bias the nowcast estimates”. I am not exactly sure what you mean by “relative to the true epidemic trends”. Results, Estimation of R(t) - “The precision of Rt increased over time when more information became available”. Please be more precise what you mean by more information. Is it more reported disease onset dates and less imputation or just a bigger number of cases (higher incidence)? Or both? - Please add a legend to Fig 3! B) Please update the GitHub repository with the code that you are actually using in your analysis, currently it seems to be outdated and your data availability statement is therefore not correct! ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No: Available code does not match described methods and adresses only parts of the analysis. ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 3 Jan 2022 Submitted filename: Response_To_reviewer.pdf Click here for additional data file. 24 Feb 2022 Dear Dr. Martinez, We are pleased to inform you that your manuscript 'Near real-time surveillance of the SARS-CoV-2 epidemic with incomplete data' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Benjamin Althouse Associate Editor PLOS Computational Biology Tom Britton Deputy Editor PLOS Computational Biology *********************************************************** Please make sure that the symbols appear as they should -- the version I have has boxes in place of many symbols. 28 Mar 2022 PCOMPBIOL-D-20-02097R3 Near real-time surveillance of the SARS-CoV-2 epidemic with incomplete data Dear Dr De Salazar, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofia Freund PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

22 in total

1. Estimating in real time the efficacy of measures to control emerging communicable diseases.

Authors: Simon Cauchemez; Pierre-Yves Boëlle; Guy Thomas; Alain-Jacques Valleron
Journal: Am J Epidemiol Date: 2006-08-03 Impact factor: 4.897

2. Improving the evidence base for decision making during a pandemic: the example of 2009 influenza A/H1N1.

Authors: Marc Lipsitch; Lyn Finelli; Richard T Heffernan; Gabriel M Leung; Stephen C Redd
Journal: Biosecur Bioterror Date: 2011-06

3. Epidemiology of measles in Taiwan: dynamics of transmission and timeliness of reporting during an epidemic in 1988-9.

Authors: M S Lee; C C King; C J Chen; S Y Yang; M S Ho
Journal: Epidemiol Infect Date: 1995-04 Impact factor: 2.451

Review 4. Extending backcalculation to analyse BSE data.

Authors: C A Donnelly; N M Ferguson; A C Ghani; R M Anderson
Journal: Stat Methods Med Res Date: 2003-06 Impact factor: 3.021

5. Reporting errors in infectious disease outbreaks, with an application to Pandemic Influenza A/H1N1.

Authors: Laura F White; Marcello Pagano
Journal: Epidemiol Perspect Innov Date: 2010-12-15

6. Quantifying reporting timeliness to improve outbreak control.

Authors: Axel Bonačić Marinović; Corien Swaan; Jim van Steenbergen; Mirjam Kretzschmar
Journal: Emerg Infect Dis Date: 2015-02 Impact factor: 6.883

7. Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia.

Authors: Qun Li; Xuhua Guan; Peng Wu; Xiaoye Wang; Lei Zhou; Yeqing Tong; Ruiqi Ren; Kathy S M Leung; Eric H Y Lau; Jessica Y Wong; Xuesen Xing; Nijuan Xiang; Yang Wu; Chao Li; Qi Chen; Dan Li; Tian Liu; Jing Zhao; Man Liu; Wenxiao Tu; Chuding Chen; Lianmei Jin; Rui Yang; Qi Wang; Suhua Zhou; Rui Wang; Hui Liu; Yinbo Luo; Yuan Liu; Ge Shao; Huan Li; Zhongfa Tao; Yang Yang; Zhiqiang Deng; Boxi Liu; Zhitao Ma; Yanping Zhang; Guoqing Shi; Tommy T Y Lam; Joseph T Wu; George F Gao; Benjamin J Cowling; Bo Yang; Gabriel M Leung; Zijian Feng
Journal: N Engl J Med Date: 2020-01-29 Impact factor: 176.079

8. Nowcasting by Bayesian Smoothing: A flexible, generalizable model for real-time epidemic tracking.

Authors: Sarah F McGough; Michael A Johansson; Marc Lipsitch; Nicolas A Menzies
Journal: PLoS Comput Biol Date: 2020-04-06 Impact factor: 4.475

9. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2).

Authors: Ruiyun Li; Sen Pei; Bin Chen; Yimeng Song; Tao Zhang; Wan Yang; Jeffrey Shaman
Journal: Science Date: 2020-03-16 Impact factor: 47.728

10. Nowcasting the COVID-19 pandemic in Bavaria.

Authors: Felix Günther; Andreas Bender; Katharina Katz; Helmut Küchenhoff; Michael Höhle
Journal: Biom J Date: 2020-12-01 Impact factor: 1.715

4 in total

Review 1. [Lessons learnt from COVID-19 surveillance. Urgent need for a new public health surveillance. SESPAS Report 2022].

Authors: María José Sierra Moros; Elena Vanessa Martínez Sánchez; Susana Monge Corella; Lucía García San Miguel; Berta Suárez Rodríguez; Fernando Simón Soria
Journal: Gac Sanit Date: 2022 Impact factor: 2.479

2. Nowcasting COVID-19 deaths in England by age and region.

Authors: Shaun R Seaman; Pantelis Samartsidis; Meaghan Kall; Daniela De Angelis
Journal: J R Stat Soc Ser C Appl Stat Date: 2022-06-15 Impact factor: 1.680

3. Estimation of heterogeneous instantaneous reproduction numbers with application to characterize SARS-CoV-2 transmission in Massachusetts counties.

Authors: Zhenwei Zhou; Eric D Kolaczyk; Robin N Thompson; Laura F White
Journal: PLoS Comput Biol Date: 2022-09-01 Impact factor: 4.779

4. Tracking changes in SARS-CoV-2 transmission with a novel outpatient sentinel surveillance system in Chicago, USA.

Authors: Reese Richardson; Emile Jorgensen; Philip Arevalo; Tobias M Holden; Katelyn M Gostic; Massimo Pacilli; Isaac Ghinai; Shannon Lightner; Sarah Cobey; Jaline Gerardin
Journal: Nat Commun Date: 2022-09-22 Impact factor: 17.694

4 in total