Literature DB >> 35996714

Data-driven approach in a compartmental epidemic model to assess undocumented infections.

Guilherme S Costa¹, Wesley Cota¹, Silvio C Ferreira^1,2.

Abstract

Nowcasting and forecasting of epidemic spreading rely on incidence series of reported cases to derive the fundamental epidemiological parameters for a given pathogen. Two relevant drawbacks for predictions are the unknown fractions of undocumented cases and levels of nonpharmacological interventions, which span highly heterogeneously across different places and times. We describe a simple data-driven approach using a compartmental model including asymptomatic and pre-symptomatic contagions that allows to estimate both the level of undocumented infections and the value of effective reproductive number R t from time series of reported cases, deaths, and epidemiological parameters. The method was applied to epidemic series for COVID-19 across different municipalities in Brazil allowing to estimate the heterogeneity level of under-reporting across different places. The reproductive number derived within the current framework is little sensitive to both diagnosis and infection rates during the asymptomatic states. The methods described here can be extended to more general cases if data is available and adapted to other epidemiological approaches and surveillance data.

Entities: Chemical

Keywords: Data-driven modeling; Epidemic spreading; Epidemic surveillance; Undocumented infections

Year: 2022 PMID： 35996714 PMCID： PMC9385215 DOI： 10.1016/j.chaos.2022.112520

Source DB: PubMed Journal: Chaos Solitons Fractals ISSN： 0960-0779 Impact factor: 9.922

Introduction

Our contemporary society has faced an unprecedented threat imposed by the COVID-19 pandemics, caused by the pathogen SARS-CoV-2, evidencing the importance, limitations, and subtleties of using compartmental epidemic models for the forecasting or nowcasting of pandemic scenarios [1], [2], [3], [4], [5]. After two years of intensive investigation, much has been learned with respect to the virology of SARS-CoV-2 in humans [6], [7], [8], [9]. Among other achievements, several key aspects of the transmission were unveiled [5], [10], [11], [12] and efficient vaccines have been developed [13]. Variants of of the original strain [14], [15] give rise to new and more aggressive outbreaks due to reinfection and raised contagion rates that tend to become endemic, circulating among humans indefinitely with new outbreaks emerging seasonally [16]. Whilst the biology of the virus and interaction with human hosts is better understood, other crucial aspects of the epidemiology, specially the behavioral ones, remains unpredictable even at a short-term, varying across time and location. In particular, the non-pharmaceutical interventions (NPIs), such as face masks, testing policies and social distancing have played a central role on the spreading of SARS-CoV-2 [17], [18], [19], [20]. The aforementioned NPIs contribute for reduction of the contagion rates in an unpredictable way, such that the effective contagion rate along the time must be inferred from count case series via likelihood or other calibration methods [21], [22]. A fundamental epidemic characteristic of the SARS-CoV-2 contagion in humans is its high transmission before the onset of the symptoms [5], [11], [23], the presymptomatic individuals, and even the contagion by those who never manifest relevant symptoms [24], [25], the true asymptomatic individuals. The latter could be accessed by mass testing and contact tracing, for example. Seroprevalence studies for different phases and regions [14], [26] reveal population incidences of antibodies for SARS-CoV-2 in levels much higher than the case counts reported by the epidemiological surveillance systems. So, the case fatality ratio (CFR), defined as the ratio between the numbers of diagnosed deaths and cases, can differ substantially from the infection fatality ratio (IFR), defined as the fraction of all infections (documented or not) that evolve to death [26], [27]. The level of under-reporting, in which the CFR is greater than the IFR, varies widely in different seroprevalence inquiries [26] due to several uncontrolled factors such as the testing policies (only symptomatic cases, contact tracing, etc.), availability of tests (low or high income places), sensitivity of tests (antigen or PCR), and seeking for medical care, among others. The relation between seropositivity and immunity is not fully established and new emerging variants always open paths for reinfections and new outbreaks [28]. Therefore, estimating the level of undocumented infections across different places and times remains a challenge. Epidemic models of statistical inference were developed to access the amount of undocumented infections of SARS-CoV-2. For example, Pullano et al. [29] estimated that 9 out 10 cases of symptomatic infections were not ascertained by the surveillance system in France from 11 May to 28 June 2020, during the first epidemic wave of COVID-19, suggesting that large numbers of symptomatic cases of COVID-19 did not seek for medical advice. Lu et al. [30] considered four complementary approaches to estimate the cumulative incidence of symptomatic cases of COVID-19 in the US and concluded that on April 4 of 2020 the estimated case count was 5 to 50 times higher than the official counts of positive tests across the different states. Subramanianan et al. [31] used a model including testing information to fit the case and serology data from New York City, from March to June of 2020, to estimate a low proportion of symptomatic cases (13 to 18% of the total infections), and that the reproductive number could be larger than often assumed. Similarly, Irons and Raftery [32] used a similar approach to estimate that approximately 60% of the infections were not diagnosed by tests in USA as of March 7, 2021. Hallal et al. [33] carried out two seroprevalence studies, the first in May 2020 and the second in June 2020, in 133 municipalities of Brazil and estimated that only 10.3% of all infections were documented. Due to the importance of asymptomatic or pre-symptomatic transmission, the corresponding compartments were soon included in mathematical models for COVID-19 [11], [34], [35], [36], [37]. However, it is concomitantly an additional source of uncertainty in the initial conditions. Predictive scenarios of the first SARS-CoV-2 outbreak were either semi-quantitative [34], [38], [39] or based on Bayesian inference using reported cases’ series [35], [40], [41]. Brazil is an example, certainly not an exception, of highly heterogeneous responses to COVID-19 pandemics due to the lack of coordinated policies across different administrative layers [42], in addition to the intrinsic variability of social-economic indexes across the country impacting directly the epidemiological outcomes. Therefore, a mechanistic approach for simulation of epidemic spreading with asymptomatic transmission calls for a systematic way to determine the initial conditions. The contribution of asymptomatic infections and testing policies to the effective reproductive number [43] through surveillance counts is an important issue [31], [32]. The basic reproductive number is defined as the average number of secondary infections generated by a single infected individual introduced in a completely susceptible population, commonly represented by . The effective reproductive number is given by , where is the fraction of susceptible population (who can be infected by the pathogen) at time . This definition, under the hypothesis of homogeneous mixing, is the simplest one and can be generalized to stratified compartments [43]. The reproductive number can also be estimated directly from case counts using statistical inference models [21], as reported for COVID-19 pandemics across the world [18], [35], [42], [44]. In this present work, we describe a mechanistic approach to estimate the number of undocumented infections (symptomatic or not) using the epidemic surveillance data for confirmed cases and deaths. The method is grounded on a compartmental epidemic model including both documented and undocumented compartments, the latter not counted by epidemiological surveillance. The present approach allows to determine the effective reproductive number, the level of under-reporting and initial conditions using the date of diagnosis. The approach can be promptly modified or generalized for other types of data, epidemic compartments, and population stratification. The method shares similarities with the recent approaches to estimate undocumented cases [24], [29], [30], [32], such as the use of reported infections and deaths. The central difference is that our approach is essentially mechanistic and not Bayesian. We applied the method across different geographical scales of two Brazilian states, namely Paraná (PR) and Espírito Santo (ES), using time series with dates of COVID-19 diagnosis available by the epidemiological surveillance of the respective states. The time window of investigation was from 1 January to 31 July of 2021, encompassing the second epidemic wave in Brazil driven mainly by the Gamma variant [42], [45]. We report variable levels of under-reporting across different places and times. We were able to estimate initial conditions for the hidden compartments and effective infection rates along the time, which yielded an efficient short-time forecast for the series of confirmed cases. Despite the basic reproductive number being explicitly dependent on the asymptomatic transmission rate, the analysis indicates that undocumented infections seem to not alter significantly the effective reproductive number for the analyzed series. The remaining of this paper is organized as follows. The methodology is detailed in Section 2. The epidemic compartmental model and some analytical results are presented in Sections 2.1, 2.2, respectively. The data-driven approach to estimate the under-reporting level from epidemiological surveillance counts is described in Section 2.3 while a description of the epidemic rates and their relation with testing rates are presented in Section 2.4. The eigenvalue approach to determine the initial conditions is presented in Section 2.5. Application of the method to epidemiological data is presented in Section 3 and the main conclusions of the work are discussed in Section 4.

A mechanistic approach to estimate undocumented cases

Compartmental model

Following a mechanistic approach for population fractions, an epidemic process with presymptomatic, asymptomatic, and undocumented transmissions are investigated using a compartmental model [43] under the homogeneous mixing hypothesis. Individuals are grouped according to their epidemic states in the following compartments: Susceptible (S) who can be infected; exposed (E) who were infected but is not contagious yet; asymptomatic (A) who are infectious but do not present symptoms; symptomatic (I) ones who may seek for medical care due to the presence of symptoms; undocumented recovered (R) who have been infected, healed but not diagnosed; deceased (D) who died due to COVID-19; two compartments of diagnosed cases for SARS-CoV-2 including individuals who were asymptomatic (C) or symptomatic (C) at the moment of testing; and the corresponding recovered compartments for confirmed cases R and R. We assume constant rates and spontaneous transitions implying that the time last in a given infectious compartment is exponentially distributed [43], in contrast with the Biology of infectious pathogens where one expects a peaked distribution that excludes very short and very long exiting times. However, multiple infectious compartments soften the problem producing peaked distributions for the total infectious time with a negligible probability of recovering shortly [43], [46]. The epidemiological model and rates are schematically depicted in Fig. 1.

Fig. 1

Schematic representation of the epidemic model including the following compartments: susceptible (S), exposed (E), asymptomatic (A), symptomatic (I), recovered (R, R, and R), deceased (D), and confirmed cases (C and C). The transition and respective rates are indicated by arrows. The infectious compartments are depicted with the symbol . The infection processes, represented by the dashed line, involve the interaction between susceptible and one of the infectious compartments, happening with rates , X=A, I, C, and C, which may depend on the compartment. Susceptible persons in contact with infectious individuals (asymptomatic or symptomatic) become exposed with rates and , respectively. While the complete model allows the infection by diagnosed individuals C and C with rates and , respectively, here we assume and for sake of simplicity, such that confirmed cases are assumed to be isolated and do not contribute for new infections. The remaining transitions are represented in Fig. 1. Exposed individuals become asymptomatic with rate . The latter can evolve to a symptomatic state with rate , recover with rate , or be diagnosed by tests with rate moving to the confirmed compartment C. Similarly, the undocumented symptomatic individuals can recover with rate or be diagnosed and become C with rate . The clinical state of confirmed cases evolves as does the undocumented ones with rate represented by primed symbols , and . We assume the same recovering times for documented and undocumented cases using the following constraints A confirmed case (C) can die (D) with rate while undocumented deaths are neglected, again, for sake of simplicity. The true asymptomatic and the presymptomatic cases are implicitly considered with transitions AR (CA RA) and AI (ACA CI), respectively. Compartments CA and CI are simplifications of a more complex dynamics including seeking for test, time for results, and isolation. Assuming a constant population , where is the number of individuals in the compartment X, the above transitions can be summarized in the following set of differential equations where , , is the corresponding population fraction in the compartment X.

Analytical results

The basic reproductive number is straightforwardly computed and given by When infection of documented cases C and C are neglected the expression simplifies to Consider a more intuitive parameterization in terms of the probabilities and that infected individuals are diagnosed during the asymptomatic or symptomatic phases, respectively, which can be computed from the compartmental model and are given by One can also show that an exposed individual ends diagnosed with probability where . The first and second terms of Eq. (6) are due to diagnosis during asymptomatic and symptomatic phases, respectively. Recovering without diagnosis happens with probability . Therefore, we can determine a simple relation between the final number of documented () and undocumented () infections defining the under-reporting coefficient as where We can also analytically determine the model’s IFR, represented by , considering the probabilities that exposed individuals evolve to death passing through C compartment or not, which are or , respectively. The IFR becomes

Estimating under-reporting from epidemic surveillance counts

We describe how testing probabilities can be estimated from surveillance count series with the aid of the compartmental model of Fig. 1. Let and represent the cumulative series of confirmed cases and deaths. The CFR computed for reported cases within a given time window is given by: in which and refer, respectively, to the increment in the number of cases and deaths in the interval, is the initial time chosen for calibration, and is a delay between death and positive test report. The CFR is given by the conditional probability that an infection evolves to death given that it was diagnosed. If and are events of death and diagnosis of infected individual, respectively, we have that where we have used the Bayes rule for conditional probabilities, the model hypothesis that only diagnosed individuals evolve to death, and the fact that . The under-reporting coefficient, which can be expressed as is extracted directly from data using Eq. (10) once IFR is assumed to be known.

Epidemic rates

The rates , , , , and are biological and can, in principle, be found in epidemiological surveys [6], [7], [8], [9], [12], [47]. The parameters and depend on behavioral aspects such as the number of potential infectious contacts per unit of time [19], [35], [39]; prophylactic attitudes by means of NPIs [48], [49], [50]; infectiousness and prevalence of new variants [14], [42], [51]; to cite only some of the most prominent issues. Similarly, the confirmation rates and depend on several behavioral and socioeconomic factors being highly influenced by testing policies [39], [52], [53]. All these aspects are very heterogeneously distributed across time and different places. To estimate the confirmation rates and , we plug Eqs. (6), (12) to obtain Despite its simplicity, Eq. (13) is very handy since it relates the testing rates (or probabilities) with epidemiological parameters ( and ) and quantities () which can, in principle, be obtained directly from data using Eq. (10). Therefore, if the ratio is given, the testing rates can be estimated as Eq. (14) is mathematically consistent, i.e. , if the following conditions are satisfied is used to estimate and for a given ratio , which may play a role on the determination of infection rates and ; see Section 2.5. A sensibility analysis of the ratio can be used to verify whether the results are little sensitive to this choice (it was the case for all data analyzed in this work); otherwise the ratio must be determined using some calibration or likelihood method. Eventually, surveillance data can provide a CFR smaller than the IFR estimates which is inconsistent with the present approach and the method is not applicable in these situations.

Assessing hidden compartments from epidemic surveillance data

Epidemiological surveillance provides the number of confirmed cases, deaths, date of first symptoms, or diagnosis; nothing with respect to the other compartments is commonly available. Actually, in the real scenario, the situation is much more complicated due to delays and other complex issues on surveillance counts [54], [55]. While the epidemic model is more general allowing to tackle different documented compartments (C, C, R, R, and D), due to the usual unavailability of all corresponding data, we aggregate all confirmed cases into a single compartment to be compared with the cumulative case series , which is reckoned by Eq. (2f) of the epidemic model. We remark that, if necessary, the partition of documented cases can be determined using the model rates. For example, the fraction of cases that end in the compartment R is . Neglecting transmission by confirmed cases, we then estimate the infection rates and concomitantly with the initial conditions using the following calibration procedure: Select the time interval for which the reported case series will be analyzed. This time window should be short enough to assume that infection rates and are approximately constant, but sufficiently large to have significant amount of data; Using the CFR estimated with Eq. (10) and the IFR, determine the probability using Eq. (14) for a given ratio , assumed to be a parameter of the method. Consider an adiabatic approximation assuming that susceptible population varies much more slowly than the other compartments such that one can neglect its variation and take as being constant over the investigated period. Start with guessed initial values for the products and (to be fitted with data). Determine the number of undocumented cases at using the under-reporting coefficient calculated using Eq. (12) and the number of confirmed cases from case counting, where is a transient time to be chosen accordingly the epidemiological series. Remember that encompasses all confirmed compartments; see Eq. (8). Under these conditions the compartmental model provides a closed linear system for the infectious compartments where the Jacobian is given by We assume that the solution is ruled by the leading term where is the principal eigenvector corresponding to the largest eigenvalue of , providing the following relations among initial conditions Using again , a closed system of initial conditions for is obtained with the integration of Eq. (2f) to obtain where is the increment of confirmed cases, available from data, during the interval . If we obtain Finally, the susceptible population is determined as where , implying that , and the infection rates self-consistently estimated as and . Eqs. (2b) to (2f) are integrated in the interval and the dispersion with respect to the case counts is computed as The parameters and are incremented interactively and steps (iv) to (vii) are implemented using a bisection method to minimize . In other words, a mesh with discrete values of , with mesh space , is varied searching for the minimal value of . Then, the mesh space is reduced and the analysis repeated around the pair that yielded the lowest is the preceding step. This process is iterated 1000 times. The choice of the transient time should compensate new epidemic factors such as new variants and waning immunity that lead to reinfections and new outbreaks. Another factor that can alter the susceptible population is the vaccination which also confers variable levels of immunity against infections. Vaccination also impacts both the IFR and CFR, such that the updated estimates of the IFR should be considered if the count series fueling the analysis is concomitant with vaccination, as the case of our current analysis; see Section 3.1.

Results

Parameters and epidemic series

We applied the method to two types of count series available for Brazil, hereafter named Type-I and Type-II. The former consists of count series using release dates provided by epidemic surveillance departments of Brazilian federative units2 which are aggregated and publicly available for all 5570 Brazilian municipalities [56]. These data do not yield the date of diagnosis and may present uncontrolled bias caused by reporting delays and should be used with care. The Type-II data sets contain dates of diagnosis and first symptoms onset. In this work, we use the publicly available Type-II data for Paraná (PR) [57] and Espirito Santo (ES) [58] states. The data are publicly available in the cited resources and the data aggregated for different municipalities, used in the present work, is available elsewhere [59]. A full description of these datasets can be found in the Supplementary Material (SM) [60]. We fixed the average values of the parameters d and d so that the mean incubation time is of 6.4 d [6], [35]. The mean recovery time for symptomatic individuals was taken as d [61]. Following [34], [35], asymptomatic cases were assumed to have the same recovering time such that . Finally, we remark that rates , associated to the transition between confirmed compartments, are determined by Eqs. (1a) to (1c). Uncertainty analysis was done drawing , , and from Gamma distributions with standard deviation of 1.3 d , while an uniform distribution with 10% of uncertainty were used for calibrated and . The IFR is the most critical parameter of our analysis. Since the time window we analyzed is concomitant with vaccination, a progressive reduction in the IFR is expected. To estimate the IFR reduction due to vaccines we proceeded as follows. The age-dependent IFR profile reported by Verity [27], which yields an exponential increase with age and average IFR 0.68%, was considered. The number of persons who completed the vaccination (two or one shot depending on the vaccine type) as a function of time was extracted from surveillance systems, publicly available at Ref. [56]. Demographic data were obtained from Instituto Brasileiro de Geografia e Estatística (IBGE) [62]. Brazil followed a decreasing age prioritization strategy where elderly were vaccinated first down to the young population. We consider group ages, in which corresponds to yr, to yr,…, to and assume that all vaccines shots were distributed according to this sequence. Using data for states, both vaccination rates and demographics [62], we calculated the average IFR as follows. Without vaccines, the average IFR is given by where and are, respectively, the IFR and population fraction in the age group . If is the total fraction of the vaccinated population, the lower age group who were vaccinated is given by where is the fraction of the group age who was vaccinated. Finally, if is the IFR reduction of the vaccinated population of age group , the corrected IFR becomes For sake of simplicity, we assumed that and uniform across all age groups. These parameters are consistent with typical protection rates associated to vaccines used in Brazil (Pfizer-Biotech, Sinovac and Astrazeneca). The IFR as a function of time for the four investigated states are presented in Fig. 2. The lower IFR for Amazonas’s state reflects its young population (see SM [60]), while similar patterns are observed for the other analyzed states. Obviously, this is a simplified approach aiming at being qualitatively correct rather than quantitatively accurate. The used data is available in the SM [60].

Fig. 2

Infection fatality ratio as a function of time, estimated for São Paulo (SP), Amazonas (AM), Paraná (PR), and Espírito Santo (ES) states considering their demographics and vaccination rates.

Under-reporting coefficient

The evolution of using Type-I count series of two capital cities of Brazil, which were severely impacted by COVID-19 second infection wave, namely Manaus and São Paulo [42], are presented in Fig. 3, for which the estimated delays between case and death confirmations were and days, respectively; see Figs. S1(a) and (b) in the SM [60]. The delay is obtained by shifting the time series such that the peaks of deaths and cases coincide. We consider as January 1, 2021. Evolution patterns of are different for these municipalities. While Manaus presents a high level of under-reporting (10 to 25) along the whole analyzed time series, in São Paulo, increases from approximately 3 at the beginning of 2021 to 17 in June.

Fig. 3

Evolution of under-reporting coefficient for the capital cities of (a) Manaus and (b) São Paulo estimated using moving time windows of three weeks for Type-I count series (see main text) as reported by state’s surveillance departments [56]. The confidence interval of 95% is shown in the shaded region. Evolution of under-reporting coefficient for (a) PR and (b) ES states using time windows of three weeks. Two immediate regions of each state are presented in the corresponding panels. (c) Evolution of the CFR computed using delays d and 20 d for PR and ES states, respectively. Under-reporting coefficients for all immediate regions of (d) PR and (e) ES and the for states (indicated by arrows) computed when the CFR is low (January 2021) and high (April 2021). We analyzed Type-II count series for PR and ES states aggregating data of municipalities into immediate regions defined by IBGE [62] as a group of nearby municipalities of a same state with intense interchange for immediate needs (purchasing, work, healthcare, education, and so on). Case and death series for the PR state present a delay of d between death and positive test report. For ES state this delay is d. The evolution of computed with counts aggregated by states and two selected immediate regions are shown in Figs. 4(a) and 4(b) for PR and ES, respectively. Curves for the 28 and 8 immediate regions of PR and ES, respectively, with the confidence intervals are available in Figs. S2 and S3 of the SM [60]. Note that CFR, Fig. 4(c), and present different temporal patterns despite the correlation stated by Eq. (12). The second relevant outcome is the substantial variation of undocumented infection along the time and across different places with varying approximately one order of magnitude in Figs. 4(a) and 4(b). The under-reporting coefficient for all immediate regions of both PR ans ES states are presented in Figs. 4(d) and 4(e); the chosen dates correspond to low and high CFR in the respective state counts. The differences between immediate regions can differ largely in a same time window. The space–time variability reflects the high diversities of outbreak across different places, due to unsynchronized and unequal responses to pandemics besides demographic, economic, and developmental heterogeneity of states as predicted [34] and later observed [42] for the first epidemic wave in Brazil.

Fig. 4

Evolution of under-reporting coefficient for (a) PR and (b) ES states using time windows of three weeks. Two immediate regions of each state are presented in the corresponding panels. (c) Evolution of the CFR computed using delays d and 20 d for PR and ES states, respectively. Under-reporting coefficients for all immediate regions of (d) PR and (e) ES and the for states (indicated by arrows) computed when the CFR is low (January 2021) and high (April 2021).

Determination of the initial conditions

To apply the calibration method of Section 2.5, we performed the analysis for case counts of the PR state shown in Fig. 5; see Fig. S4 on the SM [60] for the ES state. We further simplified the analysis assuming the same infection rate for both asymptomatic and symptomatic individuals prior diagnosis confirmation, , implying in a single parameter to fit the data. The ratio between testing probabilities of symptomatic and asymptomatic individuals is fixed to . The calibrated curves match each other within the confidence intervals for a variation of one order of magnitude in this ratio. Typical calibration curves are presented in Fig. 5(a)–(i) for different times using a 14-day moving window of calibration. A forecast of one week is also presented to verify the calibration robustness, reproducing very well the short-term progression of the cumulative case count time series. The method also performs very well for smaller geographical scales such as immediate regions; see Fig. S5 of the SM [60]. The goodness of the fit is quantitatively verified considering two simple statistical analysis of the regressions: the Pearson coefficient and mean absolute percent error (MAPE) [63]. The MAPE is calculated from to d as where and . Both and MAPE indices are shown in the corresponding panels of Figs. S5 and S4 of the SM [60]. All Pearson coefficients are statistically significant (-value ) for all regressions while the MAPE values are at most of order of %, indicating an excellent match between model forecasting and data.

Fig. 5

Calibration curves for PR state in different time windows of 14 days indicated by the vertical lines. Initial day is indicated in the top of each panel. One week of forecasting is also shown. Symbols are the cumulative cases’ counts while lines with shaded regions represent the calibrated curves and the corresponding confidence interval of 95%. Evolution of the undocumented compartments (exposed, asymptomatic and symptomatic) for the PR state since January 1, 2021. Error bars represent the confidence interval of 95%. The evolution of three undocumented epidemic compartments, central to setup the initial conditions (exposed, asymptomatic, and symptomatic), yielded by the calibration method for PR state from January to May 2021 is presented in Fig. 6. Remark that the ratio between the total amount of infected individuals and the number of confirmed cases at a given day, Fig. S1(c) of the SM [60], is much higher than the under-reporting coefficient shown in Fig. 4 since the latter refers to the final epidemic chain, where an infection ends documented or not, whereas the former refers to the amount of infected individuals in a given day which has not been documented yet. Comparing the undocumented infectious populations shown in Fig. 6 with the daily confirmed cases shown in Fig. S1(c) of the SM [60], one sees that the peaks of prevalence of undocumented infections happen slightly before peaks of incidence of confirmed cases. Fig. 6 shows an increase of the unconfirmed cases in the same period (middle April to May of 2022) when the under-reporting was higher for the PR state; Fig. 3(a). One explanation for this behavior is the vaccination which leads to less aggressive manifestation of the infections and lower seeking for medical attention and testing.

Fig. 6

Evolution of the undocumented compartments (exposed, asymptomatic and symptomatic) for the PR state since January 1, 2021. Error bars represent the confidence interval of 95%.

Effective reproduction number

The effective reproduction number for the PR state, calculated following the standard definition for compartmental models given by with given by Eq. (4) and given by the calibration procedure described in Section 2.5, is presented in Fig. 7 . The calibration is sensitive to the variations and inflections in case count series used in the calibration, where the mean value of oscillated between approximately 0.9 and 1.2 in the analyzed period. We performed a sensibility analysis of and verified that its value is almost independent of the testing rates of asymptomatic compartments. More precisely, the curves of collapses within the confidence interval when the ratios between testing probabilities are varied by one order of magnitude. We also analyzed the sensibility of the asymptomatic and symptomatic infection rates varying the ratio by a factor 2 for fixed and, again, the reproduction number coincides within the confidence intervals. According to Eq. (4), depends explicitly on asymptomatic and symptomatic infection rates, which are affected by the choice of the ratio in the calibration procedure. However, the increase (or reduction) of infection rates and in the calibration is compensated with the reduction (or increase) in the population such that is insensitive to the choice of this ratio.

Fig. 7

Evolution of effective reproductive number computed for the PR state considering and . The confidence interval of 95% is shown in the shaded region for the black curve.

Discussion

The pandemic caused by the SARS-CoV-2 led to unprecedented efforts gathering scientific community, epidemic surveillance, public authorities, and communication systems to provide almost real-time updated and publicly available counts for diagnosed infections, deaths, and other important statistics for COVID-19 spread across the globe. Available epidemic series, however, are still not ideal due to our limited capacity in documenting all infections in the due time. Moreover, these limitations vary enormously across different places and at different moments. However, this opens new avenues for construction and improvement of tools to extract information which are not explicit in data. A particularly promising strategy is the data-driven approach [34], [35], [38] where mathematical and mechanistic models are fueled by data, allowing to estimate variables which are not explicitly available. In the case of SARS-CoV-2 infections, the important class of asymptomatic or pre-symptomatic infections, in which individuals transmit the pathogen even without symptoms, are crucial being very costly to be detected in epidemic surveillance systems. In the present work, we follow a data-driven approach using a compartmental model to estimate the amount of undocumented cases in the epidemic compartments which are not directly accessible in surveillance systems. The method allows to estimate the fraction of undocumented infections using case fatality ratio (CFR) and biological parameters that, in principle, can be estimated in controlled studies, in particular the infection fatality ratio (IFR). We applied the method to epidemic series of diagnosed cases and deaths of two Brazilian states where days of the symptoms onset were available. We selected the first semester of 2021 when Brazil was struck by a second epidemic wave of COVID-19, mainly driven the Gamma variant (lineage P.1). We calculated an under-reporting coefficient , giving the ratio between infections which ends diagnosed or not from count series of cases and deaths of COVID-19. Our analysis reports a large variation of along the time and also across different locations at a same period. The under-reporting coefficients are used to estimate testing rates, which are inputs used in a calibration method that allows to estimate the initial condition for the undocumented compartments, in particular the asymptomatic and exposed ones. While, on the one hand, the presented numbers should be not interpreted as accurate estimates of actual epidemic prevalence, on the other hand, they clearly demonstrate that the infected individuals that can potentially seek for medical assistance are a minor part of all cases. Interestingly, the effective reproduction number is almost insensitive to the testing rate of asymptomatic cases, confirming that undocumented infections do not affect this important epidemic indicator. The method can be generalized for stratified data including age contact matrices [64] or metapopulation approaches [34], [35]. However, the main lesson is that initial conditions for undocumented compartments can be inferred using a simple mechanistic approach, based on compartmental models fueled by epidemiological series of diagnosed death and cases. Nonetheless, the accuracy of methods depends on good estimates of biological parameters, mainly the IFR that changes as the epidemic scenario is altered. For example, vaccination is expected to reduce IFR while the emergence of more aggressive variants can increase it. We developed a simple data-driven approach to estimate the IFR evolution in terms of time series with vaccination rates. As a forthcoming continuation of the present work, we could investigate different time distributions for epidemic transitions, akin to applied epidemiology, using, for example, Monte Carlo approaches. Code and data: Fortran and Python codes used for calibration and processing the epidemic series were made publicly in [59]. A description of the datasets and codes can be found in the SM [60].

CRediT authorship contribution statement

Guilherme S. Costa: Conceptualization, Formal analysis, Data curation, Investigation, Methodology, Software, Visualization, Writing – review & editing. Wesley Cota: Conceptualization, Data curation, Methodology, Software, Validation, Visualization, Writing – review & editing. Silvio C. Ferreira: Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Writing – original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

51 in total

1. Modelling the impact of testing, contact tracing and household quarantine on second waves of COVID-19.

Authors: Alberto Aleta; David Martín-Corral; Ana Pastore Y Piontti; Marco Ajelli; Maria Litvinova; Matteo Chinazzi; Natalie E Dean; M Elizabeth Halloran; Ira M Longini; Stefano Merler; Alex Pentland; Alessandro Vespignani; Esteban Moro; Yamir Moreno
Journal: Nat Hum Behav Date: 2020-08-05

2. Estimates of the severity of coronavirus disease 2019: a model-based analysis.

Authors: Robert Verity; Lucy C Okell; Ilaria Dorigatti; Peter Winskill; Charles Whittaker; Natsuko Imai; Gina Cuomo-Dannenburg; Hayley Thompson; Patrick G T Walker; Han Fu; Amy Dighe; Jamie T Griffin; Marc Baguelin; Sangeeta Bhatia; Adhiratha Boonyasiri; Anne Cori; Zulma Cucunubá; Rich FitzJohn; Katy Gaythorpe; Will Green; Arran Hamlet; Wes Hinsley; Daniel Laydon; Gemma Nedjati-Gilani; Steven Riley; Sabine van Elsland; Erik Volz; Haowei Wang; Yuanrong Wang; Xiaoyue Xi; Christl A Donnelly; Azra C Ghani; Neil M Ferguson
Journal: Lancet Infect Dis Date: 2020-03-30 Impact factor: 25.071

3. Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia.

Authors: Qun Li; Xuhua Guan; Peng Wu; Xiaoye Wang; Lei Zhou; Yeqing Tong; Ruiqi Ren; Kathy S M Leung; Eric H Y Lau; Jessica Y Wong; Xuesen Xing; Nijuan Xiang; Yang Wu; Chao Li; Qi Chen; Dan Li; Tian Liu; Jing Zhao; Man Liu; Wenxiao Tu; Chuding Chen; Lianmei Jin; Rui Yang; Qi Wang; Suhua Zhou; Rui Wang; Hui Liu; Yinbo Luo; Yuan Liu; Ge Shao; Huan Li; Zhongfa Tao; Yang Yang; Zhiqiang Deng; Boxi Liu; Zhitao Ma; Yanping Zhang; Guoqing Shi; Tommy T Y Lam; Joseph T Wu; George F Gao; Benjamin J Cowling; Bo Yang; Gabriel M Leung; Zijian Feng
Journal: N Engl J Med Date: 2020-01-29 Impact factor: 176.079

4. A modelling approach for correcting reporting delays in disease surveillance data.

Authors: Leonardo S Bastos; Theodoros Economou; Marcelo F C Gomes; Daniel A M Villela; Flavio C Coelho; Oswaldo G Cruz; Oliver Stoner; Trevor Bailey; Claudia T Codeço
Journal: Stat Med Date: 2019-07-10 Impact factor: 2.373

5. Quantifying asymptomatic infection and transmission of COVID-19 in New York City using observed cases, serology, and testing capacity.

Authors: Rahul Subramanian; Qixin He; Mercedes Pascual
Journal: Proc Natl Acad Sci U S A Date: 2021-03-02 Impact factor: 11.205

6. Quantifying the impact of quarantine duration on COVID-19 transmission.

Authors: Peter Ashcroft; Sonja Lehtinen; Daniel C Angst; Nicola Low; Sebastian Bonhoeffer
Journal: Elife Date: 2021-02-05 Impact factor: 8.140

7. SARS-CoV-2 reinfection caused by the P.1 lineage in Araraquara city, Sao Paulo State, Brazil.

Authors: Camila Malta Romano; Alvina Clara Felix; Anderson Vicente de Paula; Jaqueline Góes de Jesus; Pamela S Andrade; Darlan Cândido; Franciane M de Oliveira; Andreia C Ribeiro; Francini C da Silva; Marta Inemami; Angela Aparecida Costa; Cibele O D Leal; Walter Manso Figueiredo; Claudio Sergio Pannuti; William M de Souza; Nuno Rodrigues Faria; Ester Cerdeira Sabino
Journal: Rev Inst Med Trop Sao Paulo Date: 2021-04-23 Impact factor: 1.846

8. Effective containment explains subexponential growth in recent confirmed COVID-19 cases in China.

Authors: Benjamin F Maier; Dirk Brockmann
Journal: Science Date: 2020-04-08 Impact factor: 47.728

9. Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions.

Authors: Jonas Dehning; Johannes Zierenberg; F Paul Spitzner; Michael Wilczek; Viola Priesemann; Michael Wibral; Joao Pinheiro Neto
Journal: Science Date: 2020-05-15 Impact factor: 47.728

10. SARS-CoV-2 antibody prevalence in Brazil: results from two successive nationwide serological household surveys.

Authors: Pedro C Hallal; Fernando P Hartwig; Bernardo L Horta; Mariângela F Silveira; Claudio J Struchiner; Luís P Vidaletti; Nelson A Neumann; Lucia C Pellanda; Odir A Dellagostin; Marcelo N Burattini; Gabriel D Victora; Ana M B Menezes; Fernando C Barros; Aluísio J D Barros; Cesar G Victora
Journal: Lancet Glob Health Date: 2020-09-23 Impact factor: 26.763