Literature DB >> 35604952

Measuring the unknown: An estimator and simulation study for assessing case reporting during epidemics.

Christopher I Jarvis^1,2, Amy Gimma^1,2, Flavio Finger^1,2,3, Tim P Morris⁴, Jennifer A Thompson¹, Olivier le Polain de Waroux², W John Edmunds^1,2, Sebastian Funk^1,2, Thibaut Jombart^1,2,5,6.

Abstract

The fraction of cases reported, known as 'reporting', is a key performance indicator in an outbreak response, and an essential factor to consider when modelling epidemics and assessing their impact on populations. Unfortunately, its estimation is inherently difficult, as it relates to the part of an epidemic which is, by definition, not observed. We introduce a simple statistical method for estimating reporting, initially developed for the response to Ebola in Eastern Democratic Republic of the Congo (DRC), 2018-2020. This approach uses transmission chain data typically gathered through case investigation and contact tracing, and uses the proportion of investigated cases with a known, reported infector as a proxy for reporting. Using simulated epidemics, we study how this method performs for different outbreak sizes and reporting levels. Results suggest that our method has low bias, reasonable precision, and despite sub-optimal coverage, usually provides estimates within close range (5-10%) of the true value. Being fast and simple, this method could be useful for estimating reporting in real-time in settings where person-to-person transmission is the main driver of the epidemic, and where case investigation is routinely performed as part of surveillance and contact tracing activities.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35604952 PMCID： PMC9166360 DOI： 10.1371/journal.pcbi.1008800

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

This is a PLOS Computational Biology Methods paper.

Introduction

The response to infectious disease outbreaks increasingly relies on the analysis of various data sources to inform operation in real time [1,2]. Outbreak analytics can be used to characterise key factors driving epidemics, such as transmissibility, severity, or important delays like the incubation period or the serial interval [2]. Amongst these factors, the amount of infections remaining undetected in the affected populations is a crucial indicator for assessing the state of an epidemic, and yet this quantity is often hard to estimate in real time [3-6]. Indeed, estimation of the overall proportion of individuals infected (attack rates) typically requires time-consuming serological surveys [7-9] which may not be achievable in resource-limited, large-scale emergencies such as the 2014–2016 Ebola virus disease (EVD) outbreak in West Africa [10], or the more recent EVD outbreak in Eastern provinces of the Democratic Republic of the Congo (DRC) [11,12]. As an alternative, one may attempt to quantify reporting, i.e. the proportion of all infections which result in notified cases. Unfortunately, this quantity is also hard to estimate, and usually requires the analysis of epidemiological and genomic data through complex methods for reconstructing transmission trees [13-15] or transmission clusters [16]. Such requirements can mean that by the time estimates are available, decisions have already been made, or the outbreak situation has changed [17-19]. Therefore, simpler approaches are needed for estimating reporting and help inform outbreak response operations. Methods for estimating reporting during an outbreak should ideally exploit data which is routinely collected as part of the outbreak response. In diseases where dynamics are mostly governed by person-to-person transmission, case investigation and contact tracing can be powerful tools for understanding past transmission events as well as detecting new cases as early as possible [11,20-23]. For vaccine-preventable diseases, contact tracing can also be used for designing ring vaccination strategies, as seen in recent EVD outbreaks in the DRC [11,20]. These data also contain information about reporting. Intuitively, the frequency of cases whose infector is a known and reported case is indicative of the level of reporting: the more frequently case investigation identifies a known infector, the higher the corresponding case reporting should be. Conversely, cases with no known epidemiological link after investigation are indicative of unobserved infections, and therefore under-reporting. In this article, we introduce a method to estimate case reporting from contact tracing data. This approach, designed during the Ebola outbreak in Eastern DRC [11,12], was originally aimed at assessing case reporting in a context where insecurity made surveillance difficult, and under-reporting likely [12]. The approach utilized transmission chain data and calculated the proportion of cases with a known epidemiological link as a proxy for reporting. We provide a derivation of the estimator and explain the rationale of this approach and assess its performance using simulated outbreaks of different sizes with varying levels of reporting. Based on the simulation results, we make some suggestions regarding the use of this method to inform strategic decision making during an outbreak response.

Methods

We present the analytical derivation of our method of estimating reporting, defined as the proportion of cases actually notified during an outbreak. We then describe the simulation study, using the ADEMP (Aim, Data generating mechanism, Estimand, Methods, Performance measures) framework as described by Morris et al 2019 [24,25], used to evaluate the performance of the methods under various conditions.

Estimating reporting from epidemiological links

Our method exploits transmission chains derived from case investigation and contact tracing data. The data considered are secondary cases for which epidemiological investigation was successfully carried out, and for which a single likely infector could be clearly identified. We thus distinguish i) cases for which the identified infector is listed amongst reported cases (cases with a known infector) and ii) cases for which the identified infector is not listed amongst the reported cases (cases without a known infector). Importantly, cases without any known exposure, or cases for which multiple epidemiological links make it hard to disentangle a single likely infector, are excluded from the analysis. The rationale for the approach is to consider the proportion of cases with a known infector as a proxy for the proportion of infections (including asymptomatic but infectious individuals) reported. The proportion of cases with a known infector is by definition the proportion of infectors who were reported (Fig 1), so that the reporting probability π can be estimated as where n is the number of secondary cases (infectees) with a known infector and n is the number of secondary cases without a known infector.

Fig 1

Rationale of the method for estimating reporting.

This diagram illustrates transmission events inferred by case investigation of reported secondary cases, with arrows pointing from infectors to infectees. Darker shades are used to indicate documented transmission events, while lighter shades show unknown infectors. Numbers of secondary cases with (blue) or without (orange) known infectors are used to estimate the reporting probability. This example uses an approximate reporting of 50%.

Rationale of the method for estimating reporting.

Derivation of estimator for reporting

We define m—number of reported infectors m—number of unreported infectors n—number of secondary cases (infectees) with known infector n—number of secondary cases without known infector R—reproduction number, i.e. average number of secondary cases by case; we assume reported and unreported infectors have the same distribution of R π—reporting probability following some unspecified probability distribution with unknown probability parameter such that where secondary cases are assumed to follow the same reporting distribution as primary infections. The expected number of reported infectees with a known infector is Similarly, the expected number of reported infectees without a known infector is From this we have that By definition Therefore and replacing the expectations with their estimates from the data, we get the estimator

Uncertainty for reporting

The uncertainty associated with this estimation can be estimated using various methods for computing confidence intervals of proportions. Using the standard approach for standard errors for a proportion we have that Here, we used exact binomial confidence intervals which can be calculated: Where n = n + n, total number of secondary cases. F(c; d1, d2) is the c quantile from an F-distribution with d1, d2 degrees of freedom and 1−α is the confidence level.

Simulation study

Aim

We aim to test the performance of the method for different outbreak sizes and actual reporting, in terms of bias, coverage, and precision (in an operational context) using simulated outbreaks.

Data generating mechanism

We considered twelve data-generating mechanisms (three reporting rates by four reported outbreaks sizes) and performed 4000 repetitions per mechanism. Each repetition corresponded to a hypothetical outbreak with a known transmission tree. To simulate the reporting process, cases were removed randomly from the transmission chains using a Binomial process with a probability (1—reporting). We will thus distinguish the total outbreak size, which represents all cases in the outbreak, and the reported outbreak size, which represents the number of cases reported. For simplicity, we assumed that all cases reported were investigated, so that it is known if they had a documented epidemiological link, or not, amongst reported cases. For each outbreak (repetition) we removed observations so that reporting was 25%, 50%, or 75%. Therefore a single simulated outbreak will give three different observed outbreaks. We categorised the simulations into reported outbreak sizes of 1–99, 1–499, 500–999, 1000+.

Outbreak simulation

We used the R package simulacr [26] to simulate outbreaks, the reporting process, and the subsequently observed transmission chains. simulacr implements and extends individual-based simulations of epidemics previously used to evaluate transmission tree reconstruction methods [13,14,27]. In its basic form, simulacr implements a Poisson branching process in which the reproduction numbers (R) is combined with the infectious period to determine individual rates of infection. Here, to account for potential heterogeneity in transmission, we have drawn individual values of R from a Gamma distribution fitted to empirical data from the North Kivu EVD epidemic (rate: 1.2; shape: 2; corresponding mean: 1.7). The resulting branching process being a combination of Poisson processes with Gamma-distributed rates is therefore a Negative Binomial branching process. The infectiousness of a given individual i at time t is, noted λ, is calculated as: where R is the reproduction number for individual i, s is their date of symptom onset, and w is the probability mass function of the duration of infectiousness (time interval between onset of symptom and new secondary infections). New cases generated at time t+1 are drawn from a Poisson distribution with a rate Λ summing the infectiousness of all cases: where n is the number of susceptible individuals and n the total population size, so that the branching process includes a density-dependence in which rates of infection decrease with the proportion of susceptibles. Transmission trees are built by assigning infectors to newly infected individuals according to a multinomial distribution in which potential infectors have a probability λ / ∑ λ of being drawn. The dates of symptom onset and case notification are generated for each new case using user-provided distributions for the incubation time and reporting delays. Simulations run until any of the set duration of the simulation is reached (here, 365 days). Here, we used parameters values and distributions in line with estimates from the Eastern DRC Ebola outbreak [12,28], the details of which are provided in Table 1. All code used for running these simulations is available from https://github.com/jarvisc1/2020-reporting.

Table 1

Parameters used for simulating outbreaks.

Fixed values
Maximum duration of the outbreak	365 days
Incubation time distribution	Discretised gamma distributionmean of 9.7 days, sd = 5.5 days.
Infectious period distribution	Discretised gamma distributionmean = 5 days, sd = 4.7 days.
Reproduction number distribution	Gamma distribution:rate of 1.2 shape of 2.
Variable values
Population size*	200, 500, 1000, 2000, 5000, 7500, 10000, 15000, 20000
Outbreak size*	10–99, 100–499, 500–999, 1000+
Proportion of cases not reported	0.25, 0.50, 0.75

*Population size is controlled in each simulation, the outbreak sizes are determined after the outbreaks have been simulated and the proportion of cases not reported have been removed.

Parameters used for simulating outbreaks.

This table details input parameters used for simulating outbreaks using the R package simulacr. Fixed values were used for all simulations, and reflect the natural history of the 2018–2020 Eastern DRC Ebola outbreak. Variable values changed across simulations. *Population size is controlled in each simulation, the outbreak sizes are determined after the outbreaks have been simulated and the proportion of cases not reported have been removed.

Estimand: Reporting

We considered a single estimand π the level of reporting.

Method

For each repetition we calculated the proportion of the number of cases with a known infector over the total number of reported cases, that is the estimator . We further calculated the standard error and 95% exact binomial confidence intervals.

Performance measures

The performance of the method was measured using bias, coverage, and precision. For bias and coverage, the Monte-Carlo standard errors were calculated to quantify uncertainty about the estimates of the performance [29]. The equations used are detailed in Table 2 and were taken from Morris et al [24]. In addition, results were classified according to different ranges of absolute error, for a more operational interpretation of the results.

Table 2

Metrics used to measure performance in the simulation study.

Performance measure	Definition
Bias	δ=E[θ^]–θ where θ is the true value and θ^ is the estimate of value
Coverage	If we define a confidence interval (θ^low,θ^upp) as the P(θ^low≤θ≤θ^upp)=ψ where ψ∈[0, 1] then a 95% CI is when P(θ^low≤θ≤θ^upp)=0.95. It follows that coverage is the P(θ^low≤θ≤θ^upp).
Precision
Model basedstandard error	E[Var^(θ^)]
Empirical based standard error	Var(θ^)
Absolute error	\|θi^−θ\|

Bias is the difference between the expected value and the true value. It was measured by taking the difference between the average estimate of reporting versus the true reporting. Unbiasedness is a desirable statistical quality but a small amount of bias may be tolerated in exchange for other desirable qualities of an estimator. The estimates of reporting were presented visually by displaying the estimates of all 4000 simulations for each scenario. Coverage is the percentage of CIs containing the true value. In the case of a 95% CI this should contain the true value 95% of the time. We counted the number of repetitions where the true value was contained in the 95% CI and divided by the total number of repetitions. The coverage was visualised through the use of Zip plots. This new visualisation created by Morris et al [24], helps to assess the coverage of a method by viewing the CIs directly. Assessing an expected 95% coverage with a Monte-Carlo standard error of 0.35 requires 3877 simulations [24] which is well within our 4000 simulations. Precision represents how close the estimates are to each other. The model-based and empirical standard error were also calculated to provide an indication of precision. The model based standard error is the root of the mean estimated variance, and the empirical standard error represents the spread of the estimates. This gives an indication of how much the point estimates vary across simulations based on the level of reporting and sample size. Although the method may give unbiased estimates with good coverage under repeated sampling, an imprecise method could lead to large differences from the true value when applied to a single dataset (that is, confidence intervals may cover the true value honestly but are wide). We further explored the impact of bias and precision of the estimator by considering the deviations of the estimates from the true value termed absolute error. The absolute error is defined as the absolute difference between the estimated reporting and its true value, expressed as percentages. For instance, estimates of 43% and 62% for a true reporting of 50% would correspond to absolute errors of 7% and 12%, respectively. During a disease outbreak, decisions are frequently made in the face of large uncertainties, and small absolute differences in the estimated level of reporting are unlikely to result in strategic changes. Therefore, as a perhaps more operationally relevant metric, we categorised results according to how far from the true value estimates were, using an arbitrary scale: very close (≤5% absolute error), close (≤10%), approximate (≤15%) or inaccurate (≤20%).

Sensitivity to R values

In order to explore the sensitivity of the method to the distribution of R, we repeated the simulations and analyses using different distributions of the reproduction number, using Gamma(rate = 0.95, shape = 2) and Gamma(rate = 1.475, shape = 2), resulting in average R values of 2.1 and 1.3, respectively, broadly in line with values reported in the literature for other EVD outbreaks [30].

Results

Bias

There was very little bias across all the simulated scenarios (Table 3 and Fig 2). For outbreaks with over 100 cases all estimates of bias were 0 with decreasing Monte Carlo error from 0.04 to 0.01 as the size of the reported outbreak increased. For outbreaks reported as less than 100 cases the bias was -0.1 for reporting of 0.50 and 0.75 and 0 for 0.25 with Monte Carlo error of 0.07. Table 3 presents the bias for each scenario and it can be seen that all of these estimates were within one standard error from zero, suggesting reasonable confidence that this is an overall unbiased estimator.

Table 3

Performance measures from 4000 simulation by reported outbreak size and true reporting level.

Estimate (Monte-carlo standard error).

		Reported outbreak size
Performance measures (MCSE)	Proportion reported	10–99	100–499	500–999	1000 or more
Bias	0.25	0 (0.07)	0 (0.03)	0 (0.02)	0 (0.01)
	0.5	-0.01 (0.07)	0 (0.04)	0 (0.02)	0 (0.01)
	0.75	-0.01 (0.07)	0 (0.04)	0 (0.02)	0 (0.01)
Coverage	0.25	95.7% (0.3)	94.1% (0.4)	94.4% (0.4)	93% (0.4)
	0.5	92.6% (0.4)	92.4% (0.4)	91.3% (0.4)	91.2% (0.4)
	0.75	92.3% (0.4)	91.5% (0.4)	89.2% (0.5)	88.6% (0.5)
Model standard error	0.25	0.065 (0)	0.024 (0)	0.015 (0)	0.01 (0)
	0.5	0.061 (0)	0.038 (0)	0.019 (0)	0.011 (0)
	0.75	0.059 (0.001)	0.036 (0)	0.014 (0)	0.011 (0)
Empirical standard error	0.25	0.071 (0.001)	0.025 (0)	0.016 (0)	0.01 (0)
	0.5	0.07 (0.001)	0.044 (0)	0.022 (0)	0.012 (0)
	0.75	0.068 (0.001)	0.043 (0)	0.017 (0)	0.013 (0)

Fig 2

Comparison of estimated versus actual reporting.

This graph shows the results of reporting estimated by the method for 4000 simulated outbreaks, broken down by outbreak size category (y-axis). Each dot corresponds to an independent simulation. The vertical red bars indicate the average within each category. True reporting used in the simulations is indicated by colors.

Comparison of estimated versus actual reporting.

Performance measures from 4000 simulation by reported outbreak size and true reporting level.

Estimate (Monte-carlo standard error).

Coverage

The coverage varied across the simulated scenarios with all but reported outbreak size 10–99 with reporting at 0.25 displaying under-coverage (Fig 3). The coverage was poor with all coverage estimates more than one standard error away from 95%, and most several standard errors away (Table 3). There was some suggestion of the counterintuitive pattern that coverage decreased as the reporting increased and that coverage decreased as the outbreak size increased.

Fig 3

Zip plot of showing coverage results.

This graph shows the 95% confidence intervals estimated by the method, broken down by reported outbreak size category and true reporting value. The vertical axis represent the fractional centile of |Z| where and π is reporting. The confidence intervals are ranked by their level of coverage and thus the vertical axis can be used to determine the proportion of confidence intervals that contain the true value where 0.95 would represent a coverage of 95%.

Zip plot of showing coverage results.

Precision

The model based standard error was below 0.07 for all estimates and below 0.04 for reported outbreaks of over 100 cases. Similar patterns are seen for the empirical standard error. Imprecise estimates were most marked when reported outbreaks were less than 100 cases and had 0.75 reporting. The precision increased (model based and empirical standard error decreased) as the reported outbreak size increased (Fig 2 and Table 3). Overall the precision appears reasonable when outbreaks are larger than 100.

Absolute error

Results showed that the estimates were rarely more than 15% away from the true reporting value in all simulation settings (Fig 4 and Table 4). The absolute error was negligible in all larger reported outbreaks (500 cases and above), with nearly all estimates very close (within 5%) to the true reporting value. Performance decreased in smaller outbreaks, but most estimates remained close (within 10%) to the true value. Results were worse in smaller outbreaks (10–99 reported cases), but even there about half of the estimates were very close (within 5%) to the true value, and more than 80% of estimates were within 10% of the target.

Fig 4

Absolute error in reporting estimation.

Table 4

Comparison of absolute error from 4000 simulations between true reporting levels and estimate of reporting by reported outbreak size and true reporting level.

		Absolute error from true value
Proportion reported	Reported outbreak size	≤ 5%	≤ 10%	≤ 15%	≤ 20%
0.25	10–99	2213 (55.3%)	3376 (84.4%)	3849 (96.2%)	3973 (99.3%)
	100–499	3817 (95.4%)	4000	4000	4000
	500–999	3995 (99.9%)	4000	4000	4000
	1000+	3999 (100%)	4000	4000	4000
0.5	10–99	2110 (52.8%)	3430 (85.8%)	3860 (96.5%)	3978 (99.4%)
	100–499	2981 (74.5%)	3899 (97.5%)	3998 (100%)	4000
	500–999	3905 (97.6%)	4000	4000	4000
	1000+	4000	4000	4000	4000
0.75	10–99	2400 (60%)	3575 (89.4%)	3835 (95.9%)	3942 (98.6%)
	100–499	3067 (76.7%)	3890 (97.2%)	3991 (99.8%)	4000
	500–999	3988 (99.7%)	4000	4000	4000
	1000+	3992 (99.8%)	4000	4000	4000

Absolute error in reporting estimation.

This graph shows, for different simulation settings, the proportion of results within a given margin of absolute error, expressed as the absolute difference between the true and the estimated reporting (in %). Rows correspond to different outbreak size categories (outbreak size as reported). True reporting is indicated in color. Repeating the analyses with different R distributions in line with reproduction numbers reported in other outbreaks [30] showed negligible impact of R on bias, coverage, precision, and absolute error (Tables A and B in S1 Text). Coverage was the most sensitive to the change in R, which slightly decreased with higher mean R values. Overall though, variations were negligible compared to variations of coverage with epidemic size.

Discussion

We have presented a new estimator for the levels of reporting in an outbreak based on the proportion of cases with known infectors, which can be derived from case investigation data. Using simulated outbreaks to assess the performance of the method, we found that this approach generally had little bias, reasonable precision, but poor coverage. Across all simulations, estimated reporting was most often within 10% of the true value, suggesting the method will retain operational relevance under different settings. The results were not sensitive to the range of reproduction numbers simulated in the scenarios, suggesting that the method can be applied in settings of somewhat higher and lower transmission. Simulation results indicate a first limitation of the method lies in the analysis of smaller outbreaks. Overall, the approach performed better in larger outbreaks, with all metrics pointing to improved results in outbreaks of more than 100 case investigations. This observation suggests that our method may struggle to identify heterogeneities in reporting across time, space, or sections of the population (e.g. age groups) if the corresponding strata have small numbers of reported cases. It also means that estimates of reporting made in the early stages of an outbreak, when few cases have been reported and investigated, will be prone to larger statistical uncertainty. Our approach also assumes a uniform sampling of the transmission tree over the time period on which the analysis is carried. It would in theory be prone to under-estimating reporting when entire branches of the transmission tree remain unobserved. For instance, if an epidemic is spreading in a location where surveillance is totally absent, a substantial number of cases may remain unnoticed, and such under-reporting would not be accounted for in our estimates. As a consequence, our method is best applied for estimating the overall reporting over geographic areas and time periods where surveillance has not varied drastically, and which are large enough to yield sufficient case investigations (typically at least 100) for reporting to be accurately estimated. Note that in terms of outbreak response, decisions for altering surveillance strategies would generally be made at coarse geographic scales and considering months of data, so that our method should retain operational value despite its inability to detect changes in reporting at small temporal or spatial scales. Similarly, the fact that a single overall value of reporting is estimated for the data considered also implies that changes in reporting across different transmission chains will be overlooked. In situations where reporting varies across chains, for instance if super-spreading events are systematically ‘better’ investigated, the estimated reporting would effectively be an average of the reporting levels of the different chains weighted by their respective numbers of successful case investigations. We also assumed that the reproduction number (R) was independent from the reporting process, so that reported source cases cause the same average number of secondary cases as non-reported ones. This condition may not always be met, for instance if unreported individuals tend to cause more super-spreading events. In the context of Ebola, this may occur through community deaths, in which funeral exposure of a large number of relatives may give rise to a new cluster of cases from a single, unreported source case. Under such circumstances, we would expect our method to under-estimate reporting, although this should be further quantified by dedicated simulation studies. Another limitation of our method relates to data availability and quality. Our approach relies on case investigation data, a time-consuming but often standard process of contact tracing usually requiring interviews of patients and/or their close relatives. There are several possible outcomes from such investigation: i) identifying a single likely infector amongst reported cases (cases with a known infector) ii) establishing that the infector was not amongst the reported cases (cases without a known infector) iii) failing to identify a single likely infector. Our approach requires case investigations to fall within the first two categories. In practice, the second and third situations may be easily confused—the third likely being the most frequent. To avoid such confusion, we would recommend recording investigation outcomes as two separate questions: Has a single likely infector been identified? And if yes, is this individual listed amongst reported cases? In our simulations, we assumed for simplicity that all reported cases were successfully investigated, so that the reported outbreak size effectively corresponds to the number of data points available for the estimation. In practice, the actual sample size will be the number of case investigations which led to identifying a single source case (reported, or not). As our method performs better in larger datasets, the data requirement for estimating reporting from transmission chains will involve substantial field work. This also implies that case investigations need to be thorough. Indeed, in situations where the infector has actually been reported, but investigations failed to identify the epidemiological links with their secondary cases, our approach will tend to under-estimate reporting by a factor directly proportional to the frequency of mis-identified links. As a consequence, the proposed methodology is mostly applicable to diseases for which person-to-person transmission can be reliably traced through epidemiological investigation such as EVD. In disease settings where transmission chains are harder to establish, such as COVID-19 where pre-symptomatic and asymptomatic transmission plays an important role, we recommend resorting to other surveillance approaches such as serological surveys to estimate reporting. Unfortunately, alternative approaches for estimating under-reporting are very demanding in terms of data, typically needing to combine information on dates of onset, location of the cases, full genome sequences of the pathogen for nearly all cases, good prior knowledge on key delays (e.g. incubation period, serial interval) [13,16], and ideally contact tracing data [14]. These methodologies are also much more complex and computer-intensive, as they either involve the reconstruction of transmission trees [13,14] or of outbreak clusters [16]. In contrast, the approach introduced here is fast and simple, and can be used in real time to estimate reporting based on data routinely collected as part of contact tracing activities and surveillance. We evaluated the performance of the method using simulated EVD outbreaks in line with estimates of transmissibility and epidemiological delays of the Eastern DRC Ebola epidemic [12,28], as this was the original context in which the method was developed. Further work should be devoted to investigating the method’s performance for other diseases and different epidemic contexts. In particular, it would be interesting to study the potential impact of correlations between transmissibility and under-reporting, i.e. situations in which non-reported cases may exhibit increased infectiousness and cause super-spreading events.

Conclusion

In this paper, we provided a derivation of a straightforward and pragmatic estimator to real-time estimation of case reporting in outbreak settings, and tested this approach under a range of simulated conditions. The method exhibited little bias, reasonable precision, and while coverage was suboptimal under some settings (in large outbreaks with higher reporting), most estimates were within a reasonable range (10–15%) of the true value. This suggests the method will be useful for informing the response to outbreaks in which person-to-person transmission is the main driver of transmission, and where enough (ideally > 100) chains of transmissions can be retraced through epidemiological investigation. Table A. Performance measures from 4000 simulation by the mean of the R distribution, reported outbreak size, and true reporting level. Table B. Comparison of absolute error from 4000 simulations between true reporting levels and estimate of reporting by the mean of the R distribution, reported outbreak size, and true reporting level. (DOCX) Click here for additional data file. 5 Mar 2021 Dear Dr Jombart, Thank you very much for submitting your manuscript "Measuring the unknown: an estimator and simulation study for assessing case reporting during epidemics" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Benjamin Muir Althouse Associate Editor PLOS Computational Biology Tom Britton Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors here seek to solve a critical problem in outbreak settings, the difficulty in understanding the reporting fraction in near-real time, without costly and time-consuming serological and genomic studies. The approach is simple (assuming extensive contact tracing and case investigation data exist) and intuitive, and the authors have structured the paper clearly. My comments below seek to improve an approach for which I have high enthusiasm and believe could be very useful to the field. More clarity in the derivation/formalism and more extensive sensitivity analyses are need to show the validity of the approach, and several parts of the approach and simulation should be tailored to better address real-world outbreak scenarios (e.g., time-varying reporting, shorter time scale simulations). 1. I was confused whether this method accounts for undetected secondary cases. It seems the expression E(n_u + n_k) = (m_r+m_K)*R would imply complete detection of the secondary cases. The expressions for E(n_k) and E(n_u), though, both have a pi term that I believe is account for under-reporting of secondary cases. Why is that not present in E(n_u + n_k)? Perhaps time indexing the estimator pi would help resolve some confusion. My understanding is that: pi_t = m_{r,t} / (m_{r,t} + m_{u,t}) n_{t+1} = (m_{r,t} + m_{u,t})*R (where n_{t+1} is the total number of secondary cases caused by index cases in generation t) n_{u, t+1} = m_{u,t}*R n_{r, t+1} = m_{r,t}*R n_{u, detected, t+1} = m_{u,t}*R*pi_{t+1} n_{r, detected, t+1} = m_{r,t}*R*pi_{t+1}, where I've replaced n_u from the paper with n_{u, detected, t+1}, the number of detected secondary cases infected by an unknown infector of the previous generation, t, and n_k with n_{r, detected, t+1} for the number of detected secondary cases infected by a known infector of generation t. The derivation then relies on the assumption that, at least over the interval of a generation time, pi_t ~ pi_{t+1}. Does this fit with the authors' logic? I think the assumption pi_t ~ pi_{t+1} would be mostly valid in many scenarios, but should be explicitly stated. 2. Time-indexing the reporting fraction would help indicate whether or how this method can be used to estimate a time-varying reporting fraction, as reporting fractions seem rarely to be constant (and cause the must frustration when they are not constant). This would be a critical expansion of the methods, and additional simulations with a non-constant reporting fraction would improve the paper. This formalism could also help clarify to which time interval estimates apply. That is, should we consider pi_hat = n_k / (n_k + n_u), to use the original notation, as a lagged estimator by one generation time? How do reporting delays factor in? It's a fine point, but given recent discussion of cohort vs case reproductive numbers in this journal (https://doi.org/10.1371/journal.pcbi.1008409) and the relationship between time-varying reporting and R_t, perhaps of interest to the intended audience. 3. I am concerned about the situation where an infector is reported, but not known to be the infector of a given case. That is, what happens when the assumption that "all cases reported were investigated, so that it is known if they had a documented epidemiological link, or not, amongst reported cases" breaks down? I understand this may be less common in the context of EVD, but seems more likely in the case of COVID, for example, for which this method would be very useful. In this situation, the unknown but detected infector would lead to inflation of the denominator (i.e., over-counting of n_u) and underestimation of pi. Some discussion or simulation (e.g. by pruning known transmission links, not solely index cases) would be useful. This would be a larger expansion, but it does seem like it would be possible to incorporate some probabilistic reconstruction (rather than just known/unknown links) to account for scenarios of several likely infectors, or an unknown infector, as described in the discussion. 4. I can't speak as much to the theoretical validity of using standard errors for a proportion and exact binomial confidence intervals, though I am concerned by the low coverage. My two primary concerns, both partially addressed in the limitations, are (1) possible links/correlation of detection in a transmission chain (i.e., a secondary case of an undetected index is less likely to be detected itself; it would seem this might have less influence on uncertainty and may be more of a structural issue, that pi_{t+1} depends on m_{u,t} vs m_{r,t}) and (2) when there is relationship between R and pi (e.g., super-spreading events more likely to be detected), or even just overdispersion in R leading to larger variance in n and hence pi. The discussion does partially address #1, saying the approach would "be prone to under-estimating reporting when entire branches of the transmission tree remain unobserved"; but what if there is not complete non-report of certain chains, and just systematic under-reporting? Similarly, the authors note that if super-spreaders were more likely to be undetected, "we would expect our method to under-estimate reporting, although this should be further quantified by dedicated simulation studies". Are the authors not able to perform these simulations? Does simulacr allow for negative binomial branching? 5. The definitions of bias and coverage are very clear. I was confused, though, by the statement that "the model based standard error is mean of the square of the bias", when Table 2 lists the model based standard error as the square root of Var(pi). How are the authors defining Var(pi) (presumably distinct from the standard error of pi_hat on pg 6?) 6. It seems all models were run for 365 days with a fixed distribution of R, and thus it was differences in population size and reporting that dictated reported outbreak size. It would greatly improve the paper to see the patterns in bias/precision broken down by R and time scale, not just a proxy of population size. That is, does a reported outbreak of n=100 across one year due to low R and small population behave the same as a reported outbreak of n=100 across one month with higher R? Exploring these shorter time frames would seem to be highly relevant for the outbreak scenarios in which this method would be applied. I also suspect in short time scales and smaller populations are where differences in a Poisson vs negative binomial branching process (or even just differences in the distribution of R) could become most important in appropriately considering uncertainty. Reviewer #2: The authors describe a novel method for estimating the proportion of cases that are observed during an epidemic. Overall the manuscript is well presented and the methods and results described clearly. I raise (major) points below about the appropriateness of the measure used to estimate reporting (cases with "known" infectors) and the generalisability of their findings across outbreaks of different pathogens. 1. The key measure from the epidemiological data used is "cases with a known infector". I initially looked for information on how a "known infector" is classified in the Methods, but found it later in the Discussion. "Our approach relies on case investigation data, a time-consuming process usually requiring interviews of patients and/or their close relatives. There are several possible outcomes from such investigation: i) a single infector can be identified..." etc As this definition and the description of the underlying data is important it should be visible much earlier in the paper. My impression is that determining "who infected whom" is difficult in the absence of genomic data. Contact tracing methods may be effective when a pathogen is "rare", i.e. a primary infection occurs in a village in a person who has travelled from outside, and then a second case is reported in the same household within the serial interval. But such examples are likely rare? When transmission is widespread and there are multiple transmission chains circulating simultaneously, inferring who infects whom is very difficult. Therefore the accuracy of interviews to determine infector/ infectee pairs is likely to decrease as the outbreak size increases. Do the authors agree that this is a potential source of bias that should be accounted for in their simulations? 2. It is implied that the method is generalisable across outbreaks with different pathogens. In using data only from a single pathogen in a single location, however, this claim is suspect and not proven. Could simulations be performed which are calibrated to the natural history of other pathogens? In particular, I wonder if the method would be successful at inferring the level of reporting in infections with a high proportion of asymptomatic cases or with a variable serial interval. In both of these cases, I suspect it would be more difficult to correctly assign infector/ infectee pairs. 3. The underlying model does not account for an over-dispersed distribution of secondary cases (negative binomial distribution, or "super-spreading"). Although this is mentioned as a limitation in the Discussion, it does raise questions about the appropriateness of the epidemiological model, given that overdispersion in the secondary case distribution is a key driver of epidemics. There are estimates of the dispersion parameter k for Ebola virus (e.g. 0.37 in Lau et al. PNAS 2017), therefore the authors could use this information in their model. For instance, if one person is identified as an infector, what is the probability that they did indeed only infect one other person? When k is small, this probability is low and there are likely several other cases non-reported. 4. In the introduction the authors characterise alternative measures to estimate the reporting rate as difficult or impractical, although the data collecting method for their own analysis is later described as "a time-consuming process". There is no mention of prevalence surveys to estimate the true number of cases - e.g. the REACT study which is ongoing in the UK for COVID-19. How would (a small number of) cluster-randomised prevalence surveys perform as a method to estimate reporting, given that interviewing cases is "time consuming", and likely subject to bias (see point 1)? ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see 27 Jul 2021 Submitted filename: response letter.pdf Click here for additional data file. 19 Sep 2021 Dear Dr. Jarvis, Thank you very much for submitting your manuscript "Measuring the unknown: an estimator and simulation study for assessing case reporting during epidemics" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Benjamin Muir Althouse Associate Editor PLOS Computational Biology Tom Britton Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I thank the reviewers for their thorough responses to my comments. This paper would continue to benefit from refinement of how the methods and its valid applications are presented, as discussed below. 1. I still find there to be a bit of an internal conflict in the described applications of this method. The method is described in the abstract and introduction as "useful for estimating reporting in real-time" and as a tool that can be used "to inform strategic decision making during an outbreak response". The proposed use case in the author's response, however, is to guide decisions for altering surveillance strategies using "months of data", rendering concerns about temporal and spatial variations of lesser importance. To me, that's not exactly a 'real-time' application and doesn't allow for operational results in the critical early weeks or months (depending on case rate) of an outbreak. Relatedly, while I think a popular use of reporting fraction estimates is to understand current burden (think of the extensive "nowcasting" COVID literature), this method would appear to be ill-suited except in the case that extensive investigation of multiple parallel transmission chains can be completed quickly. Clarifying the valid uses of this method earlier in the paper, specifically the sample size limitations and associated limitations in temporal and spatial resolution and the bias from incomplete tracing, would improve the framing of this paper. I would soften language in the introduction, particularly in paragraph 2 when describing "timely estimation" amidst changing outbreak situations. 2. Thank you to the authors for providing the reference for their definitions of model based standard error and empirical standard error. The definition of the empirical SE is now more clear. I am still confused, though, by the definition of the model-based standard error. The definition E[(theta_hat - theta)^2] given in Table 3 is the classical definition of the mean squared error; this is also the name used in Table 6 of Morris et al. The written definition that the model-based square error (the MSE) is "the mean of the square of the bias" is confusing, in that in that MSE = E[Bias_i^2], if Bias_i = theta_{hat,i} - theta, but MSE!=Bias^2, when Bias = E(theta_hat)-theta, as defined earlier in Table 3. Did the authors intend to use a different quantity than the MSE? If not, it would be more clear and better match common terminology to refer to the model-based square error as the MSE. If the model standard error is in fact the MSE, I would expect MSE = Var(theta_hat) + Bias^2, where Bias = E(theta_hat)-theta. This relationship does not appear to hold in Table 3 (e.g., for pi=0.25 and outbreak size 10-99, I would expect MSE ~ 0.071^2 = 0.005). The derived quantity in the authors' code may be incorrect. The model based standard error (`mod_se_bias`, or `model_se` as used in Table 3) appears to be the square root of the mean across all simulations of the squared reporting probability standard errors [that is, sqrt( mean( reporting_probability * (1 - reporting_probability) / n_reported ) )]. What is defined as `rmse` or `root_mean_squared_error` appears to be correct [that is, sqrt( mean( (reporting_fraction - true_value)^2 ) )], but does not appear to be used in Table 3. 3. Do the authors mean "ii) established that the infector was NOT amongst the reported cases (cases without a known infector)."? I think it is worth addressing the difference between establishing that the infector was not amongst reported cases and being unable to establish whether the infector is among reported cases; the former implies more certainty than I think there often is in the case of unidentified infectors. 4. The sentence "As a consequence, the proposed methodology is mostly applicable to diseases for which person-to-person transmission can be achieved through epidemiological investigation such as EVD" is unclear; do the authors mean that "person-to-person transmission can be reliably traced" or similar? This is a critical assumption that merits more treatment in the methods (e.g., following the definitions of cases with/without known infectors in "Estimating reporting from epidemiological links"). 5. I would explicitly state in the derivation that reported and unreported infectors are assumed to have same distribution of R (or do the authors believe that only the average R must be constant, in which case this should be stated). 6. I believe the definition of the exact (Clopper Pearson) interval includes the multiplier (n_k + 1) in the denominator of the second term, rather than n_k alone, for the upper bound. The correct definition was implemented in the code. 7. Code for reproducing simulations and manuscript figures is generally well commented. This could be user error, but I was unable to run the parallelized outbreak simulations without adding the pkg:: operator whenever `simulacr` functions were required (simulacr::make_disc_gamma in create_raw_simulation_list and simulacr::simulate_outbreak in recursive_minimum_outbreak, specifically). Better documentation for which version/branch to use and/or a code release would be helpful. Reviewer #2: The authors have provided robust responses to the reviewer comments and made modest alterations to their manuscript. My remaining concern is that the method has not been applied to estimate reporting in the original EVD dataset. To pose as a question - what was the fraction of cases reported during the EVD epidemic in DRC? Without this application the manuscript feels incomplete, particularly as there are references in both the paper and the response to reviewers that this study is EVD focussed rather than a generalisable tool. It also undermines the point made in the Introduction that the fraction of cases reported is a "key epidemiological indicator" and an "essential factor to consider". If reporting the level of reporting is so important, then why not report it? As the method is "fast and simple" this should presumably not take long. In addition, some context on the original dataset would be beneficial. There are pointers to other studies, but the Methods should include, as a minimum, the locations where contact tracing data were collected, the dates of collection, and by whom they were collected (Public health ministry/ NGO). Ideally a template of the interview form used for contact tracing should be included as supplementary information. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 3 Feb 2022 Submitted filename: reporting_response letter_v2.docx Click here for additional data file. 20 Apr 2022 Dear Dr. Jarvis, We are pleased to inform you that your manuscript 'Measuring the unknown: an estimator and simulation study for assessing case reporting during epidemics' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Benjamin Althouse Associate Editor PLOS Computational Biology Tom Britton Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I thank the reviewers again for their thoughtful responses and edits. I have no further comments. This is an excellent paper - congratulations to all. Reviewer #2: The study is insufficient in its current form as the main parameter of interest, the proportion of cases reported, is not inferred from contact tracing data. It is not sufficient to use the method only on simulated data given the epidemiological focus of the manuscript, and the method is not interesting enough by itself to warrant publication in PLoS Comp Biol. The authors need to apply their method to at least one real world dataset to demonstrate that i) it is feasible to perform with routinely collected data from case investigations, and ii) that estimating "reporting" is insightful for outbreak response. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 17 May 2022 PCOMPBIOL-D-21-00139R2 Measuring the unknown: an estimator and simulation study for assessing case reporting during epidemics Dear Dr Jarvis, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Anita Estes PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

27 in total

1. Modelling under-reporting in epidemics.

Authors: Kokouvi M Gamado; George Streftaris; Stan Zachary
Journal: J Math Biol Date: 2013-08-13 Impact factor: 2.259

2. Epidemiological and viral genomic sequence analysis of the 2014 ebola outbreak reveals clustered transmission.

Authors: Samuel V Scarpino; Atila Iamarino; Chad Wells; Dan Yamin; Martial Ndeffo-Mbah; Natasha S Wenzel; Spencer J Fox; Tolbert Nyenswah; Frederick L Altice; Alison P Galvani; Lauren Ancel Meyers; Jeffrey P Townsend
Journal: Clin Infect Dis Date: 2014-12-15 Impact factor: 9.079

3. The Ongoing Ebola Epidemic in the Democratic Republic of Congo, 2018-2019.

Authors: Oly Ilunga Kalenga; Matshidiso Moeti; Annie Sparrow; Vinh-Kim Nguyen; Daniel Lucey; Tedros A Ghebreyesus
Journal: N Engl J Med Date: 2019-05-29 Impact factor: 91.245

4. The design of simulation studies in medical statistics.

Authors: Andrea Burton; Douglas G Altman; Patrick Royston; Roger L Holder
Journal: Stat Med Date: 2006-12-30 Impact factor: 2.373

5. When are pathogen genome sequences informative of transmission events?

Authors: Finlay Campbell; Camilla Strang; Neil Ferguson; Anne Cori; Thibaut Jombart
Journal: PLoS Pathog Date: 2018-02-08 Impact factor: 6.823

Review 6. Outbreak analytics: a developing data science for informing the response to emerging pathogens.

Authors: Jonathan A Polonsky; Amrish Baidjoe; Zhian N Kamvar; Anne Cori; Kara Durski; W John Edmunds; Rosalind M Eggo; Sebastian Funk; Laurent Kaiser; Patrick Keating; Olivier le Polain de Waroux; Michael Marks; Paula Moraga; Oliver Morgan; Pierre Nouvellet; Ruwan Ratnayake; Chrissy H Roberts; Jimmy Whitworth; Thibaut Jombart
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2019-07-08 Impact factor: 6.237

7. A graph-based evidence synthesis approach to detecting outbreak clusters: An application to dog rabies.

Authors: Anne Cori; Pierre Nouvellet; Tini Garske; Hervé Bourhy; Emmanuel Nakouné; Thibaut Jombart
Journal: PLoS Comput Biol Date: 2018-12-17 Impact factor: 4.475

8. First cases of coronavirus disease 2019 (COVID-19) in France: surveillance, investigations and control measures, January 2020.

Authors: Sibylle Bernard Stoecklin; Patrick Rolland; Yassoungo Silue; Alexandra Mailles; Christine Campese; Anne Simondon; Matthieu Mechain; Laure Meurice; Mathieu Nguyen; Clément Bassi; Estelle Yamani; Sylvie Behillil; Sophie Ismael; Duc Nguyen; Denis Malvy; François Xavier Lescure; Scarlett Georges; Clément Lazarus; Anouk Tabaï; Morgane Stempfelet; Vincent Enouf; Bruno Coignard; Daniel Levy-Bruhl
Journal: Euro Surveill Date: 2020-02

9. Estimating Chikungunya prevalence in La Réunion Island outbreak by serosurveys: two methods for two critical times of the epidemic.

Authors: Patrick Gérardin; Vanina Guernier; Joëlle Perrau; Adrian Fianu; Karin Le Roux; Philippe Grivard; Alain Michault; Xavier de Lamballerie; Antoine Flahault; François Favier
Journal: BMC Infect Dis Date: 2008-07-28 Impact factor: 3.090

10. Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data.

Authors: Thibaut Jombart; Anne Cori; Xavier Didelot; Simon Cauchemez; Christophe Fraser; Neil Ferguson
Journal: PLoS Comput Biol Date: 2014-01-23 Impact factor: 4.475