Literature DB >> 34857815

New statistical model for misreported data with application to current public health challenges.

David Moriña^1,2, Amanda Fernández-Fontelo³, Alejandra Cabaña⁴, Pedro Puig^5,4.

Abstract

The main goal of this work is to present a new model able to deal with potentially misreported continuous time series. The proposed model is able to handle the autocorrelation structure in continuous time series data, which might be partially or totally underreported or overreported. Its performance is illustrated through a comprehensive simulation study considering several autocorrelation structures and three real data applications on human papillomavirus incidence in Girona (Catalonia, Spain) and Covid-19 incidence in two regions with very different circumstances: the early days of the epidemic in the Chinese region of Heilongjiang and the most current data from Catalonia.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34857815 PMCID： PMC8640038 DOI： 10.1038/s41598-021-02620-5

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

There has been a growing interest in the past years to deal with data that is only partially registered or underreported in the time series literature. This phenomenon is very common in many fields, and has been previously explored by different approaches in epidemiology, social and biomedical research among many other contexts[1-5]. The sources and underlying mechanisms that cause the underreporting might differ depending on the particular data. Some authors consider a situation where the registry is updated with time and therefore the underreporting issue is mitigated[6]. That leads to temporary underreporting while this work is focused on permanent underreporting, where the registered data are never updated in order to become more accurate. From the methodological point of view, several alternatives have been explored, from Markov chain Monte-Carlo based methods[5] to recent discrete time series approaches[7,8]. Several attempts to estimate the degree of underreporting in different contexts have been done[9], although there is a lack of models incorporating continuous time series structures and handling underreporting. One of the fields where the interest in addressing the underreporting issues is higher is the epidemiology of infectious diseases. In the last few years, many approaches to deal with underreported data have been suggested with a growing level of sophistication from the usage of multiplication factors[10] to several Markov-based models[11,12] or even spatio-temporal modelling[13]. Even a new R[14] package able to fitting endemic-epidemic models based on approximative maximum likelihood to underreported count data has been recently published[15]. This work presents two examples where such phenomenon appears. Human papillomavirus (HPV) is one of the most prevalent sexually transmitted infections. It is so common that nearly all sexually active people have it at some point in their lives, according to the information provided by the United States’ Centers for Disease Control and Prevention (CDC)[16]. Generally, the infection disappears on its own without inducing any health problem, but in some cases it can produce an abnormal growth of cells on the surface of the cervix that could potentially lead to cervical cancer. HPV infection is also related to other cancers (vulva, vagina, penis, anus, ) and other diseases like genital warts (GW). The fact that most cases of HPV infection are asymptomatic causes that public health registries might be potentially underestimating its incidence. The underreporting phenomenon in HPV data from the discrete time series point of view has been recently studied[7]. There is an enormous global concern around 2019-novel coronavirus (SARS-CoV-2) infection in the last few months, leading the World Health Organization (WHO) to declare public health emergency[17]. As the symptoms of this infection can be easily confused with those of similar diseases like Middle East Respiratory Syndrome Coronavirus (MERS-CoV) or Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV), its incidence has been notably underreported, especially at the beginning of the outbreak in Wuhan (Hubei province, China) by December 2019.

Methods

The proposed methodology is described in detail in this section, along with an introduction of the real data examples used to illustrate its performance. All the analyses developed to generate the results reported in this paper were conducted in R and the figures were generated using the R packages ggplot2[18] and ggfortify[19].

Application examples

The first real example, discussed in detail in “Example: HPV infection incidence” section is aimed to analyze the series of weekly cases of HPV infection in Girona in the period 2010–2014. This data set is available from the Health Department of the Catalan Government (https://www.ics.gencat.cat/sisap/diagnosticat/principal?patologia=Papil%B7loma&lang=en). The second example (“Example: Covid-19 incidence in the region of Heilongjiang” section), regarding the daily SARS-CoV-2 infection in the Chinese region of Heilongjiang in the period 2020/01/22–2020/02/26, was collected from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University GitHub repository (https://github.com/CSSEGISandData/COVID-19/). The third real example, described in “Example: Covid-19 incidence in Catalonia” section is again focused on Covid-19 infection but in Catalonia in the period 2021/05/16–2021/06/20, and showing a completely different behavior. This data set is freely available from the Health Department of the Catalan Government (https://dadescovid.cat/static/csv/casos_sexe_municipi.zip). These examples were chosen because there is a great consensus among the scientific community that both diseases (HPV and Covid-19) are severely underreported, and the three series present very different behavior, so they allow us to illustrate the performance of the proposed methodology in very different situations. No data processing was conducted in any case beyond selecting the regions and time periods of interest. The final data sets and R codes used to obtain the described results are available in the Github repository https://github.com/dmorinya/MisRepARMA.

Model definition

Consider an unobservable process with an AutoRegressive Moving Average (ARMA(p, r)) structure defined bywhere is a Gaussian white noise process with . The ARMA processes belong to the family of so called linear processes. Their importance relies on the fact that any stationary nondeterministic process can be written as a sum of a linear process and a deterministic component[20]. These models are very well known, have been used in many applications since their introduction in the early 1950’s and are general and flexible enough to be useful in a wide range of different contexts. Most used statistical software packages include functions that allow straightforward fitting of this family of models, so it seems a natural choice in the present work. In our setting, this process cannot be directly observed, and all we can see is a part of it, expressed as The interpretation of the parameters in Eq. (2) is straightforward: q is the overall intensity of misreporting (if the observed process would be underreported while if the observed process would be overreported). The parameter can be interpreted as the overall frequency of misreporting (proportion of misreported observations). The proposed model is a particular case of Hierarchical Mixtures-of-Experts (HME) modelling (see[21,22] for instance), with an ARMA process instead of a linear model in the hidden layer.

Model properties

Consider that the unobserved process follows an ARMA(p, r) model as defined in Eq. (1). As can be seen in Appendix 1 (Supplementary Material), the observed process has mean and variance . The autocorrelation function of the observed process can be written in terms of the features of the hidden process aswhere is the autocorrelation function of the unobserved process . A situation of particular interest is the case , meaning that all the observations might be underreported and that a simpler model for excluding the parameter might be suitable In this case, however, the observed process would be a non-identifiable ARMA(p, r) model as the parameter q cannot be estimated on the basis of the methodology described in the following section.

Estimation

The likelihood function of the observed process is not easily computable but the parameters of the model can be estimated by means of an iterative algorithm based on its marginal distribution, using the R packages mixtools[23] and forecast[24,25]. The main steps are described in detail below: To account for potential trends or seasonal behaviour, covariates can be included in the described estimation process expressing the observed series as , where are the covariates, so its stationarity is ensured. The coefficients, , can be estimated by Ordinary Least Squares (OLS). Additionally, a parametric bootstrap procedure with 500 replicates is used to estimate standard errors and build confidence intervals based on the percentiles of the distribution of the estimates. In order to make the described methodology easily accessible to statisticians and data scientists, it has been compiled in the form of the R package MisRepARMA[26]. Additionally, non expert users facing this issue can also use an adapted version of the package through the web application https://dmorina.shinyapps.io/MisRepARMA/. Following Eq. (2), the observed process can be written as , where is an indicator of the underreported observations, following a Bernoulli distribution with probability of success . The marginal distribution of is a mixture of two normal random variables and respectively, where and . This fact can be used to obtain initial estimates for q and . Using the EM algorithm (specifically on the E-step), the posterior probabilities (conditional on the data and the obtained estimates) can be computed. This can be done using, for instance, the R package mixtools. Using the indicator obtained in the previous step, the series is divided in two: One including the underreported observations (treating the non-underreported values as missing data) and another with the non underreported observations (treating the underreported values as missing data). An ARMA model is fitted to each of these two series and a new is obtained by dividing the fitted means. A mixture of two normals is fitted to the observed series with mean and standard deviation fixed to the corresponding values obtained from the previous step, and a new is estimated. Steps (ii) and (iii) are repeated until the quadratic distance between two consecutive iterations is below a fixed tolerance level. Once the parameter estimates are stable according to the previous criterion, the underlying process is reconstructed as , and an ARMA model is fitted to the reconstructed process to obtain , , , and .

Results

The results of the proposed methodology over a comprehensive simulation study and an application on two real data sets are shown in this Section.

Simulation study

A thorough simulation study has been conducted to ensure that the model behaves as expected, including AR(p), MA(r) and ARMA(p, r) for structures for the hidden process with values for the parameters , , q and ranging from 0.1 to 0.9 for each parameter (some combinations of parameters have been omitted for or to ensure stationarity). For ARMA(p, r) structures with or the parameters covered the same range (0.1 to 0.9) but with a difference of 0.2 instead of 0.1 for computational feasibility. Only average absolute bias, interval coverage and 95% confidence interval corresponding to are shown in Table 1, as higher order models behave in a very similar manner (see Supplementary Material for details). These values are averaged over all combinations of parameters. Additionally, standard AR(1), MA(1) and ARMA(1, 1) models were fitted to the same simulated series without accounting for their underreporting structure.

Table 1

Model performance measures (average absolute bias, average interval length and average coverage) summary based on a simulation study.

Structure	Parameter	Bias	AIL	Coverage (%)
AR(1)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha }$$\end{document}α^	0.004	0.100	94.92
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{q}$$\end{document}q^	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<10^{-3}$$\end{document}<10-3	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<10^{-3}$$\end{document}<10-3	93.14
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\omega }$$\end{document}ω^	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<10^{-3}$$\end{document}<10-3	0.050	93.69
Standard AR(1)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha }$$\end{document}α^	0.500	0.124	0.96
MA(1)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }$$\end{document}θ^	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<10^{-3}$$\end{document}<10-3	0.116	96.02
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{q}$$\end{document}q^	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<10^{-3}$$\end{document}<10-3	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<10^{-3}$$\end{document}<10-3	94.79
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\omega }$$\end{document}ω^	− 0.001	0.050	90.26
Standard MA(1)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }$$\end{document}θ^	0.499	0.124	1.23
ARMA(1, 1)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha }$$\end{document}α^	0.003	0.161	95.66
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }$$\end{document}θ^	0.005	0.211	96.97
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{q}$$\end{document}q^	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<10^{-3}$$\end{document}<10-3	0.001	94.91
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\omega }$$\end{document}ω^	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<10^{-3}$$\end{document}<10-3	0.050	94.06
Standard ARMA(1, 1)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha }$$\end{document}α^	0.492	3.056	52.48
Standard ARMA(1, 1)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }$$\end{document}θ^	0.509	3.055	51.14

Model performance measures (average absolute bias, average interval length and average coverage) summary based on a simulation study. For each autocorrelation structure and parameters combination, a random sample of size has been generated using the function arima.sim from R package forecast[24,25]. Different sample sizes () have also been considered to study the impact of sample size on accuracy and the results are reported in the Supplementary Material. The performance of the proposed methodology is summarised in Tables S1–S4 for and 1000 respectively. Average absolute bias is similar regardless of the sample size, while average interval lengths (AIL) are higher and interval coverages are poorer (around 75% for ) for lower sample sizes as could be expected. Several bootstrap sizes () were also considered and the difference between them were negligible, so only results corresponding to bootstrap replicates are reported. It is clear from Table 1 that ignoring the underreported nature of data (labeled as Standard models in the table) leads to highly biased estimates with extremely low coverage rates, even with larger average interval lengths. This is especially relevant when the intensity or frequency of underreported observations is high.

Example: HPV infection incidence

The series of weekly cases of HPV infection in Girona in the period 2010–2014 was previously analyzed as a discrete INAR(1) hidden Markov process[7]. In a similar way, we aim to analyze the corresponding series of incidence, and an AR process of order 1 seems to be adequate (see Fig. 1). Additionally, the AR(1) structure has the lowest AIC when compared to similar alternative models like AR(2), ARMA(1, 1) and MA(1) (AICs are 299.31, 300.47, 300.49 and 299.68 respectively). According to Eq. (3), the autocorrelation function of the observed process when the hidden process has an AR(1) structure takes the form , where

Figure 1

Sample autocorrelation coefficients (red points) and estimated regression line (black solid line) of .

. In particular, in this case we can write , so a statistically significant intercept of this linear regression model (estimating the parameters by ordinary least squares method) could be interpreted as an evidence of underreporting, as in this case (). It is clear from Fig. 1 that the estimated regression line does not cross the origin, so the behavior of the observed process is consistent with an underlying underreported AR(1) process. Sample autocorrelation coefficients (red points) and estimated regression line (black solid line) of . By means of the estimation method described in “Estimation” section, it can be seen that the estimated model for the hidden process is , being the observed process , The estimated parameters are reported in Table 2.

Table 2

Bootstrap means and standard errors of the proposed model for the HPV example.

Parameter	Bootstrap mean	Bootstrap SE
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mu _{\epsilon }}$$\end{document}μϵ^	0.575	0.100
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha }$$\end{document}α^	0.114	0.056
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\omega }$$\end{document}ω^	0.832	0.135
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{q}$$\end{document}q^	0.238	0.068

Bootstrap means and standard errors of the proposed model for the HPV example. These results are highly consistent with those previously reported in the literature for the number of HPV cases obtained through a discrete time series approach[7] and can be interpreted in a very straightforward way. Moreover, this new methodology can be used to model the incidence of the disease instead of the number of cases, accounting for potential changes in the underlying population. The estimated intensity of underreporting is , with 95% confidence interval (0.106, 0.371). The registered and estimated evolution of HPV incidence within the study period (2010–2014) can be seen in Fig. 2.

Figure 2

Registered (black solid line) and estimated (red dotted line) HPV incidence in Girona in the period 2010–2014.

Registered (black solid line) and estimated (red dotted line) HPV incidence in Girona in the period 2010–2014. These results indicate that only 33% of the estimated HPV incidence in the considered period of time was actually recorded. Taking into account that public health cervical cancer prevention strategies are often designed on the basis of simulation models which are calibrated to registered HPV data[27], it is clear that providing decision makers with accurate data on HPV incidence is key to ensure optimal allocation of scarce public health funds.

Example: Covid-19 incidence in the region of Heilongjiang

The betacoronavirus SARS-CoV-2 has been identified as the causative agent of an unprecedented world-wide outbreak of pneumonia starting in December 2019 in the city of Wuhan (China)[17], named as Covid-19. Considering that many cases run without developing symptoms beyond those of MERS-CoV, SARS-CoV or pneumonia due to other causes, it is reasonable to assume that the incidence of this disease has been underregistered, especially at the beginning of the outbreak[28]. This section focuses on the Covid-19 incidence registered in Heilongjiang province (north-eastern China) in the period (2020/01/22–2020/02/26), and it can be seen in Fig. 3 that the registered data (black color) reflect only a fraction of the estimated actual incidence (red color).

Figure 3

Registered (black solid line) and estimated (red dotted line) COVID-19 incidence in the region of Heilongjiang in the period 2020/01/22–2020/02/26.

Registered (black solid line) and estimated (red dotted line) COVID-19 incidence in the region of Heilongjiang in the period 2020/01/22–2020/02/26. Another respiratory disease caused by a coronavirus (MERS-CoV) has been modeled in a previous work as an ARMA(3, 1)[29], so we evaluated the performance of this model and similar ones. Probably due to the shortness of the available data this autoregressive structure was not observed and in our case the best performing model was an MA(1) (AIC of -140.17 against -136.1 for the ARMA(3, 1)), consistently with the residuals profile shown in Fig. 4, obtained from fitting an MA(1) model to the most likely process reconstructed following step (v) in “Estimation” section.

Figure 4

Residual analysis (raw residuals (upper graph), autocorrelation coefficients (lower graph left) and histogram (lower graph right)) after fitting a MA(1) model to the Heilongjiang COVID-19 data.

Residual analysis (raw residuals (upper graph), autocorrelation coefficients (lower graph left) and histogram (lower graph right)) after fitting a MA(1) model to the Heilongjiang COVID-19 data. By means of the estimation method described in “Estimation” section, it can be seen that the estimated model for the hidden process is , being the observed process , The estimated parameters are reported in Table 3.

Table 3

Bootstrap means and standard errors of the proposed model for the Heilongjiang Covid-19 example.

Parameter	Bootstrap mean	Bootstrap SE
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\mu _{\epsilon }}$$\end{document}μϵ^	0.057	0.012
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\theta }$$\end{document}θ^	0.528	0.173
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\omega }$$\end{document}ω^	0.436	0.160
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{q}$$\end{document}q^	0.157	0.076

Bootstrap means and standard errors of the proposed model for the Heilongjiang Covid-19 example.

Example: Covid-19 incidence in Catalonia

The Covid-19 incidence in Catalonia in the period 2021/05/16–2021/06/20 looks totally different. As it can be seen in Fig. 5, these data present a slight decreasing trend and weekly seasonality. The decreasing trend is probably a consequence of a successful vaccination campaign, while the weekly seasonality is artificially created by issues in the notification process, as it can be seen that the lower number of cases are consistently observed by the weekends, while the peak each week is observed on Mondays. In order to account for the trend the simple linear regression model was included as a covariate and the following trigonometric function was used to incorporate the observed periodic behaviour.

Figure 5

Registered (black solid line) and estimated (red dotted line) COVID-19 incidence in Catalonia in the period 2021/05/16–2021/06/20.

Registered (black solid line) and estimated (red dotted line) COVID-19 incidence in Catalonia in the period 2021/05/16–2021/06/20. In this case, the best fitting model according to AIC and residuals profile is an AR(2). As shown in Table 4, the estimates related to underreporting reveal a lower intensity (although a higher frequency probably related to its periodicity) of the issue compared to the previous example, as could be expected. In fact, is not significantly different to zero, so a simpler model for uncorrelated misreported data like the one proposed in[30] might be enough.

Table 4

Bootstrap means and standard errors of the proposed model for the Catalonia Covid-19 example.

Parameter	Bootstrap mean	Bootstrap SE
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\beta _0}$$\end{document}β0^	11.513	1.251
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\beta _1}$$\end{document}β1^	− 0.078	0.013
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\beta _2}$$\end{document}β2^	− 1.037	0.246
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\beta _3}$$\end{document}β3^	− 2.599	0.234
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha _1}$$\end{document}α1^	0.0173	0.184
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\alpha _2}$$\end{document}α2^	− 0.372	0.187
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\omega }$$\end{document}ω^	0.782	0.230
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{q}$$\end{document}q^	0.712	0.089

Bootstrap means and standard errors of the proposed model for the Catalonia Covid-19 example.

Discussion

In biomedical and epidemiological research, the usage of disease registries in order to analyze the impact and incidence of health issues is very common. However, the accuracy and data quality of such registries is in many cases at least doubtful. This is the case, for instance, for rare diseases[31] or health issues that clear asymptomatically in most cases like HPV infection. In the case of HPV incidence in Girona in the period 2010–2014, the registered weekly average is 0.17 cases per 100,000 individuals, while the reconstructed process has a weekly average of 0.51 cases per 100,000 individuals, showing that only 33% of the estimated real incidence is recorded by the public health system. It must be considered that HPV infection is related to subsequent complications such as cervical cancer in some cases and that public health cervical cancer prevention strategies are often designed on the basis of simulation models which are calibrated to registered HPV data[27] and therefore the optimal allocation of scarce public health resources cannot be ensured if the under-reporting issue is not accounted for. This result is very consistent with that of[7], where the authors claim that only 38% of the HPV cases were registered in the same area and period of time. The Heilongjiang region Covid-19 data reveal that in average about 60% of the estimated actual incidence in the period 2020/01/22–2020/02/26 was reported. The unavailable data estimated by the proposed methodology are crucial to provide public health decision-makers with reliable information, which can also be used to improve the accuracy of dynamic models aimed to estimate the spread of the disease[28]. In China and almost globally afterwards, different non-pharmaceutical interventions were undertaken in order to minimise the impact of the disease on the general population and especially over the health systems, which were put to the limit of their capacity by the pandemic. In this context, one of the main challenges in predicting the evolution of the disease or evaluating the impact of these strategies is to use data as accurate as possible, taking into account that many Covid-19 cases are asymptomatic or with mild symptoms and a generalized shortage of testing kits[32], and therefore knowing that the registered number of affected individuals might be severely underestimated. The analysis of Covid-19 incidence in a completely different context (very recent daily data from a European region) shows that the model behaves as expected and is capable of handling trends and seasonality. In the Catalan case, the model reveals that more than 74% of the cases in the period 2021/05/16–2021/06/20 were registered. These examples are only used to illustrate the performance of the proposed methodology, but to properly analyze the evolution of an infectious disease with the behaviour shown by Covid-19 models that take the spreading dynamics into account are probably more appropriate (see[33,34] for instance). The concerns around accuracy of registered data have recently led to the publication of recommendations to improve data collection to ensure accuracy of registries (see for instance[35,36]). Nonetheless, these recommendations are very recent and may be difficult for the public health services of many countries to fully implement them, due to operational or structural issues. The proposed methodology is able to deal with underreported (or overreported) data in a very natural and straightforward way, estimating its intensity and frequency on a continuous time series, and allowing to reconstruct the most likely unobserved process. It is also flexible enough to handle covariates straightforwardly, and therefore it is simple to introduce trends or seasonality if necessary, so it can be useful in many contexts, where these issues might arise. The simulation study shows that the proposed methodology behaves as expected and that the parameters used in the simulations, under different autocorrelation structures, are properly recovered, regardless of the intensity and frequency of the underreporting issues. It also reveals that using standard time series models can lead to severely biased estimates and low coverage rates, while the proposed methodology can overcome the issue of underreporting and provide unbiased and efficient inference. The methods introduced in this paper could certainly be considered as a starting point to develop more general methods, able to deal with non-stationary continuous time series, adapting the ideas developed in[33] for the discrete case. From the applied point of view, it would be very interesting to use these kind of models to analyze other issues that might be potentially underreported and to analyze more thoroughly the examples used to illustrate the performance of the discussed models. Supplementary Information 1. Supplementary Information 2.

22 in total

1. Untangling serially dependent underreported count data for gender-based violence.

Authors: Amanda Fernández-Fontelo; Alejandra Cabaña; Harry Joe; Pedro Puig; David Moriña
Journal: Stat Med Date: 2019-07-29 Impact factor: 2.373

2. Under-reported data analysis with INAR-hidden Markov chains.

Authors: Amanda Fernández-Fontelo; Alejandra Cabaña; Pedro Puig; David Moriña
Journal: Stat Med Date: 2016-07-11 Impact factor: 2.373

3. The parameter identification problem for SIR epidemic models: identifying unreported cases.

Authors: Pierre Magal; Glenn Webb
Journal: J Math Biol Date: 2018-01-13 Impact factor: 2.259

4. Towards a Core Set of Indicators for Data Quality of Registries.

Authors: Sonja Harkener; Jürgen Stausberg; Christiane Hagel; Roman Siddiqui
Journal: Stud Health Technol Inform Date: 2019-09-03

5. Impact of model calibration on cost-effectiveness analysis of cervical cancer prevention.

Authors: David Moriña; Silvia de Sanjosé; Mireia Diaz
Journal: Sci Rep Date: 2017-12-08 Impact factor: 4.379

6. Estimating the Unreported Number of Novel Coronavirus (2019-nCoV) Cases in China in the First Half of January 2020: A Data-Driven Modelling Analysis of the Early Outbreak.

Authors: Shi Zhao; Salihu S Musa; Qianying Lin; Jinjun Ran; Guangpu Yang; Weiming Wang; Yijun Lou; Lin Yang; Daozhou Gao; Daihai He; Maggie H Wang
Journal: J Clin Med Date: 2020-02-01 Impact factor: 4.241

7. CDC grand rounds: Reducing the burden of HPV-associated cancer and disease.

Authors: Eileen F Dunne; Lauri E Markowitz; Mona Saraiya; Shannon Stokley; Amy Middleman; Elizabeth R Unger; Alcia Williams; John Iskander
Journal: MMWR Morb Mortal Wkly Rep Date: 2014-01-31 Impact factor: 17.586

8. Measuring underreporting and under-ascertainment in infectious disease datasets: a comparison of methods.

Authors: Cheryl L Gibbons; Marie-Josée J Mangen; Dietrich Plass; Arie H Havelaar; Russell John Brooke; Piotr Kramarz; Karen L Peterson; Anke L Stuurman; Alessandro Cassini; Eric M Fèvre; Mirjam E E Kretzschmar
Journal: BMC Public Health Date: 2014-02-11 Impact factor: 3.295

9. Temporal dynamics of Middle East respiratory syndrome coronavirus in the Arabian Peninsula, 2012-2017.

Authors: M A Alkhamis; A Fernández-Fontelo; K VanderWaal; S Abuhadida; P Puig; A Alba-Casals
Journal: Epidemiol Infect Date: 2018-10-08 Impact factor: 2.451

Review 10. World Health Organization declares global emergency: A review of the 2019 novel coronavirus (COVID-19).

Authors: Catrin Sohrabi; Zaid Alsafi; Niamh O'Neill; Mehdi Khan; Ahmed Kerwan; Ahmed Al-Jabir; Christos Iosifidis; Riaz Agha
Journal: Int J Surg Date: 2020-02-26 Impact factor: 6.071