Jie Hu1, Wei Liang1, Hongsheng Dai2, Yanchun Bao2. 1. School of Mathematical Science, Xiamen University, China. 2. Department of Mathematical Sciences, University of Essex, UK.
Abstract
Doubly censored data are very common in epidemiology studies. Ignoring censorship in the analysis may lead to biased parameter estimation. In this paper, we highlight that the publicly available COVID19 data may involve high percentage of double-censoring and point out the importance of dealing with such missing information in order to achieve better forecasting results. Existing statistical methods for doubly censored data may suffer from the convergence problems of the EM algorithms or may not be good enough for small sample sizes. This paper develops a new empirical likelihood method to analyze the recovery rate of COVID19 based on a doubly censored dataset. The efficient influence function of the parameter of interest is used to define the empirical likelihood (EL) ratio. We prove that - 2 log (EL-ratio) asymptotically follows a standard χ 2 distribution. This new method does not require any scale parameter adjustment for the log-likelihood ratio and thus does not suffer from the convergence problems involved in traditional EM-type algorithms. Finite sample simulation results show that this method provides much less biased estimate than existing methods, when censoring percentage is large. The application to COVID19 data will help researchers in other field to achieve better estimates and forecasting results.
Doubly censored data are very common in epidemiology studies. Ignoring censorship in the analysis may lead to biased parameter estimation. In this paper, we highlight that the publicly available COVID19 data may involve high percentage of double-censoring and point out the importance of dealing with such missing information in order to achieve better forecasting results. Existing statistical methods for doubly censored data may suffer from the convergence problems of the EM algorithms or may not be good enough for small sample sizes. This paper develops a new empirical likelihood method to analyze the recovery rate of COVID19 based on a doubly censored dataset. The efficient influence function of the parameter of interest is used to define the empirical likelihood (EL) ratio. We prove that - 2 log (EL-ratio) asymptotically follows a standard χ 2 distribution. This new method does not require any scale parameter adjustment for the log-likelihood ratio and thus does not suffer from the convergence problems involved in traditional EM-type algorithms. Finite sample simulation results show that this method provides much less biased estimate than existing methods, when censoring percentage is large. The application to COVID19 data will help researchers in other field to achieve better estimates and forecasting results.
Doubly censored data, with both right and left censoring, occur when time-to-event data are censored either from above or below. Doubly-censored data are very common in studies of infectious disease with incubation period. The left censoring happens when the originating date of the incubation period is not fully observed due to practical sampling factors beyond experimental control. The date of the failure event is often right-censored. A particular doubly censored data on AIDS study can be found in [4]. Another example is time from symptom onset to recovery for people who get COVID19. For COVID19 studies [20], the incubation rate and recovery rate are the key factors for us to understand the epidemiology. In particular, in the current COVID19 outbreak, better understanding of the recovery rate will help governments to take the right intervention strategy at the right time. However, many existing research for COVID19 are based on published information from government or ministry of health websites and media reports [20]. Such data have high percentage of missing information, such as high percentage of left or right censoring. This may distort the estimation of recovery rate, which could further distort the epidemiology model forecasting, as we can see from [5] that different model parameters will give very different forecasting results.The dataset used in [20] is fromhttps://github.com/mrc-ide/COVID19_CFR_submissionwhich has a large number of missing information on the symptom onset and on the date of recovery. Our main research interests here are to employ survival analysis techniques [8], [9] to study the recovery time, e.g. the time from symptom onset to recovery , and to study the sensitivity of recovery rate on the epidemiology forecasting. The recovery times are clearly observed under right censoring because when the data were reported, recovery may not have happened to many patients. Therefore the right censoring time is the time from the symptom onset date to the reporting date. See Fig. 1 for scenarios when right censoring happens. Under right censoring, we will have no information about the left-censoring time , the recorded exposure ending time to recovery. On the other hand, as we know that the symptom usually occur after exposure to the virus, when symptom onset date is missing and also reporting date is missing but the date of exposure to virus is available, we can impose the reasonable condition of on , which gives the left censoring time . So when left-censoring occurs the time is from the date of exposure to the date of recovery. See Fig. 2 for details of left-censoring. Under left censoring, we will have no information about . When , we will observe but cannot observe and . This is shown in Fig. 3. In such cases we usually have that, (symptoms immediately occur after exposure; events recorded on the same day) or (the recorded exposure date means the ending time of an exposure period). We also have which means that recovery occurs before reporting.
Fig. 1
, right censoring and observing only.
Fig. 2
, left censoring and observing only; exposure date observed and symptom date missing.
Fig. 3
, observed; exposure date means the ending date of exposure period.
In summary, under doubly censoring, the event time is observed if . We observe in the case of left censoring with , or observe in the case of right censoring with . Let , be independent copies of , then observations under doubly censorship can be summarized as independent pairs , where Usually, we assumed that the event time is independent of the censoring vector ., right censoring and observing only., left censoring and observing only; exposure date observed and symptom date missing., observed; exposure date means the ending date of exposure period.Denote as the cumulative distribution function of . Suppose that we are interested in a parameter , defined by a functional . Many important parameters can be represented as this form, or sometimes we obtain via the corresponding estimating equation . For example, if we are interested in the expectation of a known function , then , and the corresponding estimating equation is . Other examples include:[1.] is the cumulative hazard function at given time , i.e. , then the estimating equation is ;[2.] is the mean residual life time at given time , i.e. where , then the estimating equation is .To draw inference on the unknown parameter , a straightforward approach is to implement a distribution function estimation for [2], [3], [17], [19]. Using the distribution function estimation, the asymptotic-normality based confidence interval for the parameter of interest can be constructed via the asymptotic variance estimator of the parameter estimate. But there are two main drawbacks associated with this method. First, the asymptotic variance usually takes a complicated form. Secondly, these confidence intervals based on asymptotic normal distribution do not always perform well for small samples. Other existing research about doubly-censored data may depend on specific model assumptions, such as (quantile) regression analysis [7], [15], [23] and two-sample tests [16]. In this paper, we will solve these estimation problems via empirical likelihood method [12], which is a very useful tool for constructing confidence interval for in nonparametric settings. Based on estimating equation , the original Empirical Likelihood (OEL) in [12] is defined as It can be proved that A very important work by [13] generalized the EL method to make inference for parameter defined by a general estimating equation. In general, the empirical likelihood approach has a number of advantages, such as the shape of the confidence region is determined automatically by the data. In many cases, the log empirical likelihood ratio statistics has asymptotic distribution, therefore the confidence interval for can be constructed without estimating asymptotic variance.However, applying OEL methods to incomplete data will lead to a scaled result. When the data is right censored, [21] utilized the Buckley–James estimator to define the estimating equation, and proved that the asymptotic distribution of the corresponding log-likelihood is a scaled distribution. This limiting distribution can be used to construct the confidence interval for , if the scaled parameter is estimated. To avoid estimating the scaled parameter, [6] used the efficient influence function of the parameter under right censorship to define the log-likelihood ratio statistics and proved its asymptotic distribution is a distribution. The confidence interval for based on this method is much more accurate. Under doubly censoring, [14] proposed Leveraged Bootstrap Empirical Likelihood (LBEL) by combining the EL method with the bootstrap. Since the asymptotic distribution of the log-likelihood based on LBEL method is a scaled distribution, the scaled parameter as an adjustment coefficient needs to be estimated in practice. Besides, the LBEL method demands that the parameter of interest should be the linear functional of .Notice that the EL likelihood function is not the real likelihood function for doubly censored data, [11] defined the likelihood function based on observations
where DC is the abbreviation for Double Censoring, and is the survival function. Using (1), [11] showed that this log-likelihood ratio subject to nonparametric moment constraints obeys the Wilks’ phenomenon under some assumptions. This method avoids the scaled parameter, but is computationally difficult to find the nonparametric maximum likelihood. To solve this problem, [16] proposed an EM algorithm to calculate this log-likelihood ratio statistics. However, EM algorithm may suffer from the problem of convergence to a local maximum point. Different from [16], we investigate another approach in this paper. Inspired by [6], we develop the likelihood statistics defined by efficient score function for the parameter of interest . This method is called Efficient-EL method in our paper. Under this new approach, we demonstrate that the log empirical likelihood ratio converges to the standard distribution without using any scale parameter adjustment, which means the confidence interval for different kinds of parameters can be obtained by a unified algorithm. In the mean time, it is computationally much more efficient than existing EL methods under doubly censoring.The rest of the paper is organized as follows. The Efficient-EL inference for the differential functional parameter under doubly-censored data is given in Section 2, including the large sample properties and the computing algorithm. Simulation studies of the Efficient-EL and the EM-EL method proposed by [16] are provided in Section 3. We find that our approach performs much better for longer tail distributions, which usually lead to higher censoring proportions. In the mean time, the new method still performs as good as existing methods for lighter tail distributions which lead to lower censoring proportions. An application on COVID-19 study based on our proposed methodology is presented in Section 4. The paper concludes with a discussion in Section 5.
Efficient empirical likelihood inference
Denote and as the distribution of and respectively. Suppose we are interested in the estimation problem for a parameter , and the corresponding estimating equation for is , that means . Since cannot be observed unless it falls in , we define It is easy to see that, given the distribution , we have which gives an estimating equation for . Then, the EL ratio can be defined by Substituting the unknown with its consistent estimators will lead to a scaled asymptotic distribution. [11] used the likelihood function (1) to solve the problem. Different from their idea, we will try to reconsider the estimating equation to overcome the scaled asymptotic distribution problem.
The main theorems
Assume be the support of , and the following assumptions hold.
Define and . The following Lemma provides the efficient influence function for .Let
be a submodel of
, which approaches
at direction
. Assume
(A1), (A2)
hold and the Hadamard derivative of
exists, denoted by
. Then the efficient influence function for
is
where
is the score operator
and
is its corresponding adjoint operatorSee Appendix . □The assumptions (A1), (A2) guarantee the operator is invertible. The following are some examples of derivatives (in all of the examples we let be fixed).[1.] For mean , we have .[2.] For the th moments , we have .[3.] For cumulative distribution function , we have .[4.] For cumulative hazard function , we haveSince the operators and dependent on , we should write more precisely. Let , the efficient influence function can be denoted as , hence Notice that the nuisance parameter is unknown, we need to estimate it firstly.For , define
[3] gave the self-consistent estimators , , of , , by solving the following equations:
Based on Eq. (2), a naive and simple iterative algorithm can be used to get , and then , can be calculated by Eqs. (3), (4). In order to guarantee the asymptotic consistency and normality of , and , we assume , and satisfy conditions (A1)–(A6) in [2] throughout this paper.Define , then the efficient influence function for can be estimated by . For simplicity of notations, denote and , then the corresponding Efficient EL ratio is defined as
Using Lagrangian multipliers, , we further have where is the solution of the following equation and the following asymptotic results.Suppose the assumptions in
Lemma 2.1
hold,
is the true value of the parameter of interest, and
exists, then we haveUnder Lemma B.1, Lemma B.2 in Appendix, this proof is similar to the proof of Original EL and therefore it is omitted. □Theorem 2.1 shows that the estimated log empirical likelihood ratio converges to the standard distribution without adjustment, which means the confidence interval for different kinds of parameters can be obtained by an unified algorithm. Hence, a confidence region for the parameter with asymptotic coverage probability can be defined asBy recalling the definition of the efficient influence function for
in the following subsection we present an algorithm for the calculation of the numerical solution of and the confidence region .
Algorithm for efficient-EL method
Before presenting the algorithm, we need to introduce the following notations,
where , , and For a given , define the least favorable direction , then the efficient influence function is . Notice that only the values of at the sample points are needed, therefore we can just calculate . The following Corollary 2.1 shows a key equation for which will be used in the Efficient-EL algorithm.The estimator
satisfies the equation
where
.The Efficient-EL ratio can then be calculated by the following algorithm. Hence, the confidence interval for , in (6), can be constructed using the output by this algorithm.
Simulation studies
In this section, we will implement simulation studies to study recovery time distribution, which is very important for the analysis of Susceptible–Exposed–Infectious–Resistant (SEIR) epidemiology model. SEIR model in epidemic disease studies involves four states: susceptible (S), exposed (E), infected (I), and resistant (R) via In this SEIR model, the infectious rate controls the rate of spread which represents the probability of transmitting disease between a susceptible and an infectious individual. The incubation rate is the rate of latent individuals becoming infectious (average duration of incubation is ). Recovery rate is determined by the average duration of infection. is the total population. The basic reproductive number, , does not change in this model.Here we focus on using the proposed double censoring model to estimate the recovery time, because the infection time only involves right censoring in the data and therefore they can be estimated using standard right censoring techniques [8]. Therefore, our simulation focus will be on the mean recovery time and model forecasting to illustrate the importance of recovery time estimation on the forecasting accuracy. We will also study the mean residual recovery time, which is also very important to forecast the expected additional recovery time given that the patient has not recovered at a certain time.
Simulation studies for recovery time
In this subsection, we will illustrate the performance of our method via different simulation scenarios. We denote Uniform as the uniform distribution on , Exp() as the exponential distribution with mean and LogNormal as the Log-Normal distribution with parameters and .There are two parameters of our interests. The first is the mean of , denoted by , and its corresponding estimating equation is . Note that is the inverse of the mean recovery rate parameter . The second is the Mean Residual Lifetime (MRL) of given , denoted by or MRL, and its corresponding estimating equation is . MRL stands for the remaining mean time needed for an infected patient to recover.Based on the simulated data, we use all complete data to construct the benchmark confidence interval, named as complete data EL (or complete-EL) result. We will compare the Efficient-EL confidence interval proposed in the previous section and EM-EL confidence interval given in [16], with the benchmark complete-EL results.
Simulation results for mean and mean residual lifetime
Uniform is considered as the underlying lifetime distribution in this subsection. The left censoring time and censoring interval length are uniformly distributed on interval and . We set and to be different values to achieve 10%, 20%, 30% left censoring proportions and 10%, 20%, 30% right censoring proportions respectively. Based on simulated data sets, we construct Efficient-EL confidence intervals, EM-EL confidence intervals and Complete-EL confidence intervals. The coverage probabilities for mean and MRL() are summarized in Table 1.
Table 1
Coverage probabilities for Mean and MRL with 10%, 50% quantile under Uniform distribution. Two percentages in each column stand for left censoring proportion and right censoring proportion. Efficient-EL results with better performances than EM-EL highlighted in bold.
Mean
Nominal Level = 0.90
Nominal Level = 0.95
10%+10%
20%+20%
30%+30%
10%+10%
20%+20%
30%+30%
Complete-EL
0.898
0.894
0.896
0.944
0.943
0.945
n = 50
Efficient-EL
0.897
0.888
0.877
0.944
0.935
0.932
EM-EL
0.896
0.878
0.873
0.944
0.932
0.930
Complete-EL
0.906
0.899
0.897
0.955
0.949
0.948
n = 80
Efficient-EL
0.903
0.892
0.887
0.950
0.942
0.940
EM-EL
0.904
0.891
0.882
0.951
0.943
0.940
Complete-EL
0.909
0.896
0.903
0.953
0.948
0.954
n = 100
Efficient-EL
0.906
0.898
0.899
0.951
0.946
0.944
EM-EL
0.905
0.888
0.891
0.952
0.944
0.942
MRL(t0=10% quantile)
10%+10%
20%+20%
30%+30%
10%+10%
20%+20%
30%+30%
Complete-EL
0.907
0.898
0.906
0.954
0.943
0.953
n = 50
Efficient-EL
0.904
0.874
0.854
0.949
0.929
0.910
EM-EL
0.898
0.851
0.831
0.947
0.914
0.894
Complete-EL
0.895
0.894
0.892
0.949
0.948
0.945
n = 80
Efficient-EL
0.892
0.880
0.866
0.941
0.931
0.921
EM-EL
0.886
0.859
0.835
0.936
0.921
0.907
Complete-EL
0.896
0.889
0.897
0.949
0.948
0.947
n = 100
Efficient-EL
0.896
0.884
0.868
0.949
0.936
0.923
EM-EL
0.898
0.859
0.838
0.946
0.925
0.907
MRL(t0=50% quantile)
10%+10%
20%+20%
30%+30%
10%+10%
20%+20%
30%+30%
Complete-EL
0.897
0.891
0.896
0.945
0.943
0.950
n = 50
Efficient-EL
0.871
0.838
0.819
0.928
0.896
0.877
EM-EL
0.888
0.833
0.831
0.938
0.897
0.895
Complete-EL
0.895
0.901
0.891
0.949
0.950
0.841
n = 80
Efficient-EL
0.885
0.856
0.844
0.935
0.912
0.909
EM-EL
0.889
0.848
0.846
0.941
0.909
0.915
Complete-EL
0.893
0.901
0.897
0.949
0.948
0.947
n = 100
Efficient-EL
0.892
0.873
0.871
0.942
0.928
0.923
EM-EL
0.894
0.857
0.864
0.945
0.921
0.929
From these results, we notice that as the sample size increases, all coverage probabilities converge to the nominal levels. When is fixed, coverage probabilities of Efficient-EL confidence intervals and EM-EL confidence intervals decrease as the censoring proportion increases. The coverage probabilities of the confidence intervals for parameter MRL decrease when increases. In all cases, the performance of Efficient-EL and EM-EL methods are close to that of Complete-EL method when censoring proportion is not large.Coverage probabilities for Mean and MRL with 10%, 50% quantile under Uniform distribution. Two percentages in each column stand for left censoring proportion and right censoring proportion. Efficient-EL results with better performances than EM-EL highlighted in bold.In the top half of the Table 1, Efficient-EL and EM-EL methods perform similarly. The difference among these two methods and Complete-EL method is small, especially for small censoring proportion or large sample size. However, the performance of these methods for the parameter MRL is different (see the bottom half of Table 1). The coverage probabilities of Efficient-EL confidence intervals perform better than that of EM-EL for almost all scenarios when quantile of . Meanwhile, Efficient-EL method performs as good as EM-EL when quantile of , for most cases.We also plot the results of Table 1 and draw the coverage probability curves of different methods in Fig. 4. Comparing to EM-EL method, Efficient-EL method shows a much better convergence pattern, converging faster to the Complete-EL results.
Fig. 4
The coverage probabilities for MRL() under Uniform distribution when nominal level is 90%. The figures from left to right show the results for different censoring percentages: left plot 10% left-censoring and 10% right-censoring; middle plot 20% left-censoring and 20% right-censoring; right plot: 30% left-censoring and 30% right-censoring.
The coverage probabilities for MRL() under Uniform distribution when nominal level is 90%. The figures from left to right show the results for different censoring percentages: left plot 10% left-censoring and 10% right-censoring; middle plot 20% left-censoring and 20% right-censoring; right plot: 30% left-censoring and 30% right-censoring.
The impact of different censoring proportions and different distributions
In this section, we investigate the impact of different censoring proportions. Here we use Exponential distribution and Log-Normal distribution as the underlying distributions and consider the complicated parameter MRL(), where is the 30% quantile of the underlying distributions. For exponential distribution, we set the left censoring time as Exp and censoring interval length as Exp. For LogNorm, the left censoring time follows Exp, and censoring interval length follows LogNorm. Let to be different values to achieve 20% left censoring and 40% right censoring, and 40% left censoring and 20% right censoring, respectively. Based on simulated data sets, the coverage probabilities are summarized in Fig. 5, Fig. 6.
Fig. 5
The coverage probabilities for MRL under Exp distribution when nominal level is 90%. The left figure shows the results for 20% left and 40% right censoring proportion, while the right figure shows the result for 40% left and 20% right censoring proportion.
Fig. 6
The coverage probabilities for MRL under LogNorm distribution when nominal level is 90%. The left figure is the coverage probability curve under 20% left and 40% right censoring proportion setting, while the right figure shows the result under 40% left and 20% right censoring proportion setting.
We can see that higher right censoring proportion leads to lower coverage probabilities. The coverage probabilities of confidence intervals constructed by the proposed Efficient-EL approach is much better than EM-EL methods under Exponential distribution. In Fig. 5, the left plot with 20% left censoring and 40% right censoring shows that Efficient-EL has coverage probability 0.80 which is much closer to the bench mark (about 0.90), while EM-EL only has coverage probability less than 0.60. The right plot with 40% left censoring and 20% right censoring also shows Efficient-EL is better. In particular, EM-EL seems to have the problem not converging to the bench mark 0.90 as sample size increases.The coverage probabilities for MRL under Exp distribution when nominal level is 90%. The left figure shows the results for 20% left and 40% right censoring proportion, while the right figure shows the result for 40% left and 20% right censoring proportion.The coverage probabilities for MRL under LogNorm distribution when nominal level is 90%. The left figure is the coverage probability curve under 20% left and 40% right censoring proportion setting, while the right figure shows the result under 40% left and 20% right censoring proportion setting.Under the Log-Normal distribution, EM-EL appears to perform similarly as Efficient-EL, but EM-EL does not show a clear pattern of convergence (see Fig. 6). In other words, as sample size increases the coverage probabilities of Efficient-EL based confidence intervals steadily increase, while the coverage probabilities of EM-EL seem not to have a clear increasing pattern (coverage probabilities of EM-EL may not converge to the nominal level as sample size becomes larger). Taking censoring proportion as a specific example, as the sample size increase from 50 to 150, the coverage probabilities of Efficient-EL increase from 0.784 to 0.815, while EM-EL decrease from 0.845 to 0.820. See Fig. 6 for details.In summary, under both exponential distribution and log-normal distribution, the new Efficient-EL approach is more reliable for highly-censored data.
SEIR model forecasting
In this subsection, we will present how the SEIR forecasting results are affected by choosing different recovery rate parameter , i.e. the sensitivity of . In our simulation we set population and discrete time steps of size 0.1 of a simulated day. We consider the reproduction number and , then let the average duration of infection be and days (the corresponding recovery rate is ), respectively. These values are chosen according to the data analysis result in Section 4. The incubation period() is chosen between 2 and 10 days, which mimic the real COVID data analysis results [22]. From the summarized results presented in Fig. 7, we can see that under different and values, the total number of infections will be highly affected by the recovery rate . The maximum number of infection can be different in the scale of 20,000 to 100,000 in a population of 1,000,000. Therefore, even if the confidence interval of recovery rate was estimated wrongly at a very small scale, the final forecasting results will be very different.
Fig. 7
Maximum number of infections curves under different quarantine protocols. Three different sets of curves represent different recovery period.
Maximum number of infections curves under different quarantine protocols. Three different sets of curves represent different recovery period.
Analysis of COVID19 data
Recovery time analysis
There has already been a vast literature on COVID19 research about Susceptible–Exposed–Infectious–Resistant (SEIR) epidemiology model, based on which the UK government’s lock-down strategy were made [5]. Recovery time is a very important factor in such SEIR model. However publicly available data could have a large proportion of missing information, to stop us achieving a proper analysis for it. For example, the dataset fromhttps://github.com/mrc-ide/COVID19_CFR_submissionhas a large number of missing information on the symptom onset and on the date of recovery. It actually gave a double censoring dataset for the recovery time. The event time of interest is time length from symptom onset to recovery. The right censoring variable is from symptom onset to the reporting date. The left censoring variable is from the date of exposure (or the ending date of exposure period) to recovery. The total number of observations used in our analysis is and the data are collected from 20th January 2020 to 28th February 2020.Firstly, we list the censoring proportions of this dataset under different groups in Table 2. Using the Efficient-EL and EM-EL methods, the confidence intervals of recovery time for different groups can be calculated, respectively. These results are listed in Table 2. From Table 2, we can see that the elder groups have longer average recovery period, but there is no significant different between male and female. The confidence intervals based on EM-EL method seem to be shorter than that of Efficient-EL. This corresponds to the simulation results where EM-EL has worse coverage probability in most cases.
Table 2
The analysis of COVID19 data for different groups.
Proportion
Sample
Efficient-EL
EM-EL
Mean
Group
Left
Observed
Right
size
CI lower
CI upper
CI lower
CI upper
Male
0.052
0.185
0.763
323
17.370
22.153
18.482
20.827
19.842
Female
0.063
0.184
0.753
218
18.171
21.567
18.541
22.186
20.243
Age under 30
0.140
0.215
0.645
85
9.527
22.411
15.596
20.014
17.759
Age 30–50
0.082
0.212
0.707
186
17.651
21.172
17.853
21.528
19.605
Age 50–60
0.066
0.168
0.766
115
18.947
23.813
19.856
23.955
21.731
Age 60–70
0.076
0.124
0.800
83
17.902
24.627
19.680
24.786
22.041
Age over 70
0.089
0.089
0.822
68
18.951
25.614
19.935
23.970
22.173
Overall
0.059
0.170
0.771
547
18.784
20.928
18.804
20.837
20.013
The analysis of COVID19 data for different groups.We also carry out a simulation study similar to [5] to compare the forecasting results based on different model parameter values, in order to address the importance of parameter estimation for such forecasting analysis. We set , according to [5]. Since SEIR model dose not include mortality, we classify death and recovered as one group, re-estimate the recovery time and get the 95% confidence interval and mean 20.013. Hence, three different recovery periods: short duration 15 days (, corresponding to results without using double censoring analysis, no right censoring, over estimation of recovery rate), medium duration 20 days (, corresponding to our result based on double censoring) and long duration 25 days (, corresponding to results without using double censoring analysis, no left censoring, under estimation of recovery rate) are considered in our simulation.We also consider two different quarantine protocols: no government interventions following [5] and with mild government interventions , which lead to the parameter value in our simulation. All of our simulation are carried out via the R package deSolve of SEIR model. The daily new cases are plotted in Fig. 8, where for the curves from left to right, the dashed line means -day recovery period, the solid line means -day recovery period and the dotted line means -day recovery period. For both and we can see that with a shorter recovery time, the COVID19 outbreak will end much quicker. Also the mode of daily infected cases will be much smaller under the scenario of shorter recovery time.
Fig. 8
Increased infections curves before and after quarantine. Three different sets of curves represent different recovery period from left to right.
To achieve the herd immunity proposed by the UK government requires a proportion of the UK population being immune to the virus to stop it from spreading. It is well-known that such herd immunity can be stimulated by vaccination or recovery following infection. Based our result using a sophisticated double censoring statistical model, we can see clearly that the recovery period should be much shorter than the estimated figures proposed by other existing works. With , the peak of the curve with recovery rate is will occur on day (95% confidence interval ), the peak with recovery rate will occur on day 479 (95% confidence interval ) and the peak with recover rate will occur on day 705 (95% confidence interval ). Therefore, with a slight over or under estimation for the recovery rate, the forecasting peak date will be different at a scale of about days. This would imply that the outbreak could end about four months earlier than people expected.Increased infections curves before and after quarantine. Three different sets of curves represent different recovery period from left to right.
Conclusions
Through our COVID19 forecasting analysis and [5], we can see that correct estimation of SEIR model parameters may change the final forecasting results significantly, for example the peak date estimation may be different at the scale of months. For such a rapid spread disease, it will be extremely challenging to carry out a real-time monitoring task for the pandemic [1]. The data collected in real-time will certainly involve different kind of censoring. This paper highlighted the importance of dealing with the censored data and presented a efficient new statistical estimation approach. By utilizing the efficient influence function of the parameter of interest as an estimating equation, a new method of constructing EL confidence interval for doubly censored data is proposed in this paper. This new Efficient-EL method is easy to calculate since it does not need to estimate scale parameter. Simulation studies show that the new method performances better than the EM-EL method in terms of coverage probabilities.Comparing model predictions with our estimated recovery rate parameter and existing parameter values used in other research works, we found that the peak of the epidemic predicted could be months different from each other. This could lead to wrong health policy decisions, for example taking or removing lock-down decisions at the wrong time points, which may lead to a second peak of outbreak or making the lock-down period too long to cause severe economic damage and mental health problems for more people. Our analysis highlights the importance of doing such sophisticated survival analysis will provide better estimation for the parameters in the SEIR models.To our knowledge, this is the first work which considered using censoring techniques in survival analysis to carry out parameter estimation for COVID19 data. Most existing COVID19 research such as [10] and [5] did not address the issues of highly contaminated data due to censoring or simply use prespecified model parameters. Although only a relatively small data set is used, the methodology can be used by other researcher who have the access to larger COVID19 dataset with individual information. It will help interdisciplinary collaboration between statisticians and epidemiologists and help policy makers on public health policy making.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Authors: Paul J Birrell; Lorenz Wernisch; Brian D M Tom; Leonhard Held; Gareth O Roberts; Richard G Pebody; Daniela De Angelis Journal: Ann Appl Stat Date: 2020-03 Impact factor: 2.083
Authors: Robert Verity; Lucy C Okell; Ilaria Dorigatti; Peter Winskill; Charles Whittaker; Natsuko Imai; Gina Cuomo-Dannenburg; Hayley Thompson; Patrick G T Walker; Han Fu; Amy Dighe; Jamie T Griffin; Marc Baguelin; Sangeeta Bhatia; Adhiratha Boonyasiri; Anne Cori; Zulma Cucunubá; Rich FitzJohn; Katy Gaythorpe; Will Green; Arran Hamlet; Wes Hinsley; Daniel Laydon; Gemma Nedjati-Gilani; Steven Riley; Sabine van Elsland; Erik Volz; Haowei Wang; Yuanrong Wang; Xiaoyue Xi; Christl A Donnelly; Azra C Ghani; Neil M Ferguson Journal: Lancet Infect Dis Date: 2020-03-30 Impact factor: 25.071
Authors: Adam J Kucharski; Timothy W Russell; Charlie Diamond; Yang Liu; John Edmunds; Sebastian Funk; Rosalind M Eggo Journal: Lancet Infect Dis Date: 2020-03-11 Impact factor: 25.071