Literature DB >> 34898640

Quantile estimation of semiparametric model with time-varying coefficients for panel count data.

Abstract

Panel count data frequently occurs in follow-up studies, such as medical research, social sciences, reliability studies, and tumorigenicity experiences. This type data has been extensively studied by various statistical models with time-invariant regression coefficients. However, the assumption of invariant coefficients may be violated in some reality, and the temporal covariate effects would be of great interest in research studies. This motivates us to consider a more flexible time-varying coefficient model. For statistical inference of the unknown functions, the quantile regression approach based on the B-spline approximation is developed. Asymptotic results on the convergence of the estimators are provided. Some simulation studies are presented to assess the finite-sample performance of the estimators. Finally, two applications of bladder cancer data and US flight delay data are analyzed by the proposed method.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34898640 PMCID： PMC8668142 DOI： 10.1371/journal.pone.0261224

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

In longitudinal follow-up studies, panel count data is frequently encountered in many fields such as medical research, social sciences, reliability studies, and tumorigenicity experiences, which has been widely analyzed by many authors. This type data is usually collected from the discrete observations in recurrent event process, as the continuous observations might be too expensive to be carried out. Thus, we can only obtain the cumulative occurrence numbers of the events of interest at these discrete observation times. For the analysis of panel count data, [1, 2] developed the regression analysis approaches to the panel count data model. [3] studied the clustered mixed nonhomogeneous Poisson models of panel count data. [4] considered the spline-based likelihood estimation of the proportional mean model. To describe the potential correlations of the recurrent event process, [5-7] developed some joint models of panel count data by employing some frailty parameters to discuss these correlations. Recently, semiparametric transformation models with informative observation times were studied by many authors, such as [8-10]. More comprehensive introductions about this type data can be referred to the book of [11]. In general, the existing approaches in modeling panel count data are based on the time-invariant coefficients assumption, but which may be violated in practice. In some applications, coefficients may be time-varying, and sometimes it is more vital to detect the temporal impacts on the recurrent event process. For example, in medical studies, we are interested in detecting the temporal impacts of one new drug. Recently, [12, 13] proposed the varying coefficient models for recurrent events. However, the analysis of panel count data with varying coefficients is very limited. Most recently, [14] proposed a partially varying coefficient model of panel count data to account for the nonlinear interactions between covariates. [15] proposed a nonparametric proportional mean model of the panel count data with time-varying coefficients. Quantile regression is widely used in the analysis of longitudinal data. It can provide more information about the distribution shape of the response and can be used to measure the effect of variables under different percentiles of the distribution. However, quantile regression methodologies for the panel count data are lagging. As the discreteness of the panel count data, quantile regression cannot be directly used. For the first, a smoothing technique (“jitter”) is used for this type data, then the quantile regression can be applied to the smooth data. In this paper, a semiparametric time-varying coefficient model is formulated. For the inference of the unknown functions, quantile regression method is used for the panel count data, with the unknown functions approximated by the B-spline basis functions. Furthermore, the asymptotic results on the convergence of the estimators are established as well. The main contribution of the paper is that we propose a new spline-based quantile estimation procedure for the time-varying coefficient panel count data model, which has not been discussed in the literature.

Model specification

Suppose that n independent subjects are observed over time. N(t) denotes the cumulative total number of recurrent event occurring at or before time t for subject i. is a counting process with jumps at the discrete observation times, t < t < ⋯. We assume that t is in a fix interval ℜ of finite length. Besides, two follow-up times are existed: the potential censoring time and the observation endpoint T. Thus, only can be observed in the process, with . is assumed to be independent with N(t) and . Let denote the real observation process of subject i, and , i = 1, ⋯, n. Then, N(t) can be only acquired at the time points where H(t) jumps. The total number of the observations is defined as . Let Z be a p × 1 vector of covariates. So we can have the independent and identically distributed dataset {H(t), N(t)dH(t), C, δ, Z;t ≥ 0, i = 1, ⋯, n}. To describe the possible time-varying effects of covariates on N(t), the time-varying coefficient model is proposed as follows. (1) Given Z, the conditional mean function of N(t) is where λ0(u) is an unspecified smooth baseline intensity function, and β(u) is an unknown p × 1 vector of time-varying regression coefficients. (2) Conditional on Z, are mutually independent. For the model defined above, [15] developed the likelihood and pseudo-likelihood methods to get the estimation of the baseline intensity function λ0(u) and the varying coefficient functions β(u) based on the Poisson distribution assumption on N(t). However, no distribution assumption is specified in this paper and the existed methods cannot be used. In the next section, the spline-based quantile regression is proposed to acquire the estimation of the unknown functions. In the first step, the unknown baseline intensity function and the coefficients are approximated by B-splines. And then, the discrete panel count data become continuous by a smoothing technique. Quantile regression is developed for the inference in the last step.

Estimation procedure

For the inference of Eq (1), the model can be rewritten as, where , η(u) = (β(u)⊤, log{λ0(u)})⊤.

Approximations of baseline and varying coefficients

Similar as [16], we use the basis expansion method to get the estimation of the unknown functions in this paper. Suppose η(u), k = 1, 2, ⋯, p + 1, can be approximated by a basis expansion, that is where are basis functions, and L is the number of basis functions. Various basis functions can be used in the expansion such as Fourier basis functions, polynomial basis functions and B-spline functions. In this paper, the B-spline basis is selected in the estimation procedure for calculation simplicity. The tuning parameter L is selected by L = n + q + 1, where n is the number of interior knots and q is the degree of the B-spline functions. The interior knots of the splines are equally spaced or placed on the sample quantiles of the data in all simulations and applications. The tuning parameter L may be different for different k. In this paper, we assume that L = L and q = q for all η(u). Thus, we define B(u) = B(u) for simplicity presentation.

Quantile regression

As quantile regression is a good alternative to the conditional mean models, the quantile regression is considered for the panel count data model. However, quantile regression cannot be directly used as the discreteness of the data N(t). According to the method developed in [17], the “jitter” method is applied to construct continuous random variables. By adding U, which is generated from a [0, 1) uniform distribution, we can have where the noise U is independent of N(t) and Z. The uniform distribution is used because it allows computational simplifications. The uniform noise, however, is by no means a necessity to jitter the data. The noise may be generated by any continuous distribution with support on [0, 1). Thus, we can get the continuous data and there exists a one-to-one link between the quantiles of N(t) and . The regression model of can be written as where ϵ are assumed to be independent of t with unknown cumulative distribution function (cdf) G(⋅) and density function g(⋅). Besides, the τ-th conditional quantile of ϵ is b. The quantile regression loss function is defined as ρ(u) = u[τ − I(u < 0)], τ ∈ (0, 1). Then the quantile regression is applied on the smooth data to obtain the estimation of the unknown parameters. Thus, the unknown parameters ϕ = (γ⊤, b)⊤ can be estimated by minimizing the following objective function Ψ(ϕ), that is where W(u, X) = I ⊗ B(u) ⋅ X and . For the ease of calculation, Gauss-Legendre formula is used to approximate the integral. Thus, we have where ω is the Gauss coefficient, S is the number of the Gauss points and Δ is the Gauss point. The Gauss-Legendre approximation of the objective function Ψ(ϕ) can be defined as Define be the minimizers of the approximation of the objective function Ψ(ϕ). It is nature to get the estimation of the varying coefficient β(u), k = 1, ⋯, p, and the baseline intensity function of λ0(u) can be obtained by Next, we discuss how to select the tuning parameter L and the Gauss point number S. As proposed by [16], we use the leave-one-subject-out cross-validation (CV) to choose L and S. Let and denote the estimators from the data with the i-th subject deleted. So the leave-one-subject-out CV can be written as Thus, the tuning parameter L and S can be selected as Remark 1 The number L of the basis expansion of β may be different from each other. However, we assume L = L for all k, for simplicity.

Asymptotic results

The asymptotic results are concluded in this section. Before presenting the results, some regularity conditions are introduced for the first. (C1) Z is uniformly bounded. (C2) The observation number m is bounded by a constant. (C3) λ0(u) and β(u), k = 1, ⋯, p, are l-th differentiable and bounded. (C4) There exists an open subset Ω ⊂ R, which contains the true parameter ϕ*. The second derivative matrix ∇2 h(t, X;ϕ) of h(t, X;ϕ) with respect to ϕ, satisfies for all ϕ ∈ Ω, with , for all j, k. (C5) , , and 0 < d1 < λmin(Γ) ≤ λmax(Γ) < d2 < ∞, where λmin(Γ) and λmax(Γ) denote the smallest and the largest eigenvalues of Γ. (C6) ϵ is independent with unknown distribution function G(⋅) and density g(⋅). Besides, the τ-th conditional quantile of ϵ is ℓ. Under these above regularity conditions, the asymptotic results on the convergence of the estimators are displayed in the following theory. For the need of the proofs, a lemma of spline function of [18] is presented. First, define Let S be the space of splines of degree q consisting of functions η satisfying: (i) the function η to each subinterval is a polynomial spline of degree q; (ii) for q ≥ 1 and 0 ≤ q′ ≤ q, η is q′ times continuously differentiable on the support. Besides, η is assumed to satisfy the following regularity condition. Let l1 ∈ [0, q] be a nonnegative integer. The l1-th derivative, denoted as , exists and satisfies the Lipschitz condition of order v ∈ (0, 1] such that ρ = l1 + v > 0.5 and , for s, t ∈ [0, C], where δ is a positive constant. Lemma 1 There exists η ∈ S such that ‖η − η‖2 = O(L− + L1/2 m−1/2). If L = O{m1/(2}, then we have ‖η − η‖2 = O{(L/m)1/2} = O{m−}. Theorem 1 Suppose the conditions (C1)–(C6) hold and if L = O{m1/(2}, then we have Furthermore, we have Ignoring the approximation error in the B-spline basis approximation of β(u), k = 1, ⋯, p, we can have the 100(1 − α)% pointwise confidence interval of β(u) under quantile τ, where z2/ is the 100(1 − α)% percentile of the standard normal distribution and . Similar as the baseline function λ0(u).

Simulation studies

Three simulation studies are carried out to evaluate the performance of the method developed in this paper. We generated 200 datasets from the time-varying coefficient model, each of size n = 100 or 200 independent subjects. For each subject i, the endpoint of observation T is assumed to be 6 and the censoring time follows the uniform distribution of [T/2, 3T/2]. The number of observation times m is generated from a discrete uniform distribution {1, 2, 3, 4, 5}. And the observed event times, , are the order statistics of a random sample size m from the uniform distribution over (0, C). Given m and , the panel count data N(t) can be obtained by the following formula for j = 1, ⋯m and i = 1, ⋯n. is the random number generated from the Poisson distribution with mean The following three cases are considered: Case I: p = 1 and the covariate Z is generated independently from the [0, 1] uniform distribution. The baseline function is taken as λ0(u) = 2u + 1 and the varying coefficient β(u) = sin(−πu/6). Case II: p = 1 and the covariate Z is generated independently from the [0, 1] uniform distribution. The baseline function is taken as λ0(u) = 2(u + τ) and the varying coefficient β(u) = sin(−τπu/6). Case III: p = 2 and the covariates Z are generated from the [0, 1]2 uniform distribution with correlation cor(Z, Z) = 0.5|. The baseline function is taken as λ0(u) = 2u + 1 and the varying coefficient β1(u) = sin(−πu/6) and β2(u) = 2sin(−τπu/6). To estimate the smooth functions logλ0(u) and β(u), the cubic B-spline functions are selected. Under different quantiles τ = {0.25, 0.5, 0.75}, the estimations of Case I–III are presented with sample size n = 100 or 200 in Tables 1–3, respectively. The results include the average of the absolute bias values based 100 grid points (BIAS), the average of sampling standard errors based 100 grid points (SSE), the average of the bootstrap standard errors based 100 grid points (BSE) and the average of the estimated 95% coverage probabilities based 100 grid points (CP). It can be seen that the estimations are unbiased under different quantiles. The values of SSE and BSE are close and decrease with the increasing sample size n. Besides, from the results of CP, we can note that the Gaussian approximation is appropriate for the estimators.

Table 1

BIAS, SSE, BSE and CP of the estimated functions in Case I at different τ.

τ	n	Estimated function	BIAS	SSE	BSE	CP
0.25	100	β(t)	0.0558	0.8329	0.8355	0.9640
	100	log λ₀(t)	0.1276	0.4150	0.4453	0.9548
	200	β(t)	0.0319	0.6826	0.5999	0.9355
	200	log λ₀(t)	0.1044	0.3853	0.3377	0.9470
0.5	100	β(t)	0.0598	0.4472	0.4184	0.9623
	100	log λ₀(t)	0.0341	0.2460	0.2264	0.9750
	200	β(t)	0.0197	0.3905	0.3684	0.9611
	200	log λ₀(t)	0.0280	0.1886	0.1725	0.9525
0.75	100	β(t)	0.0789	0.8014	0.9118	0.9701
	100	log λ₀(t)	0.0751	0.4935	0.4955	0.9640
	200	β(t)	0.0553	0.6901	0.6785	0.9368
	200	log λ₀(t)	0.0599	0.4079	0.3808	0.9472

Table 3

BIAS, SSE, BSE and CP of the estimated functions in Case III at different τ.

τ	n	Estimated function	BIAS	SSE	BSE	CP
0.25	100	β₁(t)	0.0553	1.0039	0.9841	0.9560
		β₂(t)	0.1272	0.9849	0.9762	0.9262
		log λ₀(t)	0.1694	0.7364	0.6911	0.9735
	200	β₁(t)	0.0404	0.7205	0.6835	0.9436
		β₂(t)	0.0795	0.7557	0.7422	0.9677
		log λ₀(t)	0.1271	0.5135	0.4974	0.9595
0.5	100	β₁(t)	0.1330	0.8475	0.8618	0.9205
		β₂(t)	0.1641	0.9111	0.8852	0.9455
		log λ₀(t)	0.0431	0.5986	0.5748	0.9625
	200	β₁(t)	0.0927	0.6934	0.6520	0.9380
		β₂(t)	0.0518	0.7482	0.7249	0.9628
		log λ₀(t)	0.0445	0.4225	0.4418	0.9561
0.75	100	β₁(t)	0.1235	0.9436	0.8869	0.9256
		β₂(t)	0.1282	0.9478	0.9648	0.9460
		log λ₀(t)	0.0695	0.8331	0.8160	0.9335
	200	β₁(t)	0.0383	0.7438	0.7409	0.9510
		β₂(t)	0.0825	0.8510	0.8325	0.9246
		log λ₀(t)	0.0507	0.5684	0.5348	0.9714

Figs 1–3 display the estimation curves of the unknown functions log λ0(t) and β(t) with n = 200. In the figures, the point lines represent the estimated curves, the solid lines represent the true curves and the dotted lines represent the 95% confidence intervals. Based the figures, it is easy to find that the real curves and the estimated curves are very close, which indicates the B-spline estimations of the unknown functions work well. From the simulation results, we note that the estimations under different quantiles are reasonable for log λ0(t) and β(t).

Fig 1

Estimated curves of time-varying functions in case I at different τ with n = 200.

Fig 3

Estimated curves of time-varying functions in case III at different τ with n = 200.

Applications

Bladder cancer data

Bladder cancer data was collected by the Veterans Administration Cooperative Urological Research Group. In this study, 85 patients were randomly assigned to two treatment groups: placebo group (47) and thiotepa group (38). For each patient, the observation times and the cumulative numbers of the bladder tumors that occurring at or before the observation times are recorded. The observation endpoint is 53 month. What’s more, the initial number of the bladder tumors and the largest initial tumor size for each patient are also recorded. In the literature, the dataset has been discussed by many authors such as [5, 7, 19]. However, time-varying coefficient panel count data model is not considered for this dataset. In order to describe the temporal impacts of the covariates on the bladder cancer data, the time-varying coefficient model proposed in this paper is applied to these data. For each patient i, N(t) is denoted as the cumulative bladder tumors number occurring up to time t, and H(t) is denoted as the cumulative observation number up to time t, i = 1, ⋯, 85. Furthermore, let Z = 1 if the patient i is belonged to the thiotepa group and Z = 0 otherwise. Z is denoted as the initial tumor number and Z is the natural logarithm of the largest initial tumour size plus 1 for each patient i. Therefore, we have the model Then quantile regression estimation is applied to this data. 100 samples are drawn from the data every time and 200 times are repeated in the estimation. Similar to the numerical studies, the unknown functions λ0(t) and β(t), k = 1, 2, 3 are approximated by Cubic B-spline functions. The estimation is implemented under quantiles τ ∈ {0.25, 0.5, 0.75}. The estimation curves of log λ0(t) and β(t), k = 1, 2, 3 are displayed in Fig 4. In general, the thiotepa treatment and the tumor recurrence rate are negatively correlated at different quantiles. Patients in the thiotepa group tend to have less tumor recurrence rate than those in the placebo group. The initial tumor number is positively correlated with the recurrence rate and the largest initial tumor size is negatively correlated with the recurrence rate. These above conclusions are consistent with [19]. Furthermore, we can find the covariates impacts are varying during the observation time and the impacts are different at different quantiles. Thus, more information can be obtained from the quantile regression of the time-varying coefficient panel count data model than the other analysis in the existing literature.

Fig 4

Estimated curves of time-varying functions for bladder cancer data at different τ.

US flight delay data

In this subsection, 2015 US flight delay data (available from https://www.kaggle.com/usdot/flight-delays) is analyzed with the time-varying coefficient panel count data model. This dataset was collected from the U.S. Department of Transportation’s (DOT) monthly Air Travel Consumer Report. The report contained information about the numbers of on-time, delayed, canceled, and diverted flights. The dataset included 9794 flights which were observed during 3 months in the year of 2015. The numbers of delays for each flight are recorded between the observation times. The observation times of each flight are the same and the observation interval is 7 days. Besides, the average departure delay time and the average flight distance of each flight are also recorded. In order to describe the temporal covariates impacts on the flight delays, the time-varying coefficient model proposed in this paper is used to these data. For each flight i, N(t) is denoted as the cumulative flight delay number that had occurred up to time t, H(t) is denoted as the cumulative observation number up to time t, i = 1, ⋯, 9794. Furthermore, we define Z as the average time of the departure delay and Z as the average distance of the flight i. Therefore, we have the model Then spline-based quantile estimation is applied to this data. Similarly, the unknown functions λ0(t) and β(t), k = 1, 2 are also approximated by Cubic B-spline functions. The estimation is implemented under quantiles τ ∈ {0.25, 0.5, 0.75}. As the sample size of the dataset is large, it is time-consuming or even not possible to read the entire dataset in practice due to the limited memory. Besides, the direct analysis can be infeasible, mainly due to the computing memory or computing time. In order to deal with the massive data, parallel computing method is developed by [20, 21]. In parallel computing method, we split the original dataset into a family of disjoint sub-sample blocks with equal size for the first. More precisely, the data structure can be defined as the following form: where the original dataset S is of size n = K × m which is partitioned to K subsample blocks S each consist m samples which are randomly picked up from the dataset S. Thus, the estimation procedure proposed can be implemented for every block S, k = 1, ⋯, K and the estimated values of unknown parameters for each block S is denoted as . Similar to the method introduced in [21], the final full-sample estimators can be generated by The estimation curves of log λ0(t) and β(t), k = 1, 2, under different quantiles τ ∈ {0.25, 0.5, 0.75} are displayed in Fig 5. From Fig 5, we can find that the departure delay time is positively correlated with the cumulative flight delay numbers. Besides, the impact of the departure delay time is varying over the time under different quantiles and the impact is different at different quantiles. However, the effect of the flight distance is not significant on the flight delay numbers.

Fig 5

Estimated curves of time-varying functions for US flight delay data at different τ.

Concluding remarks

In this paper, we proposed a spline-based quantile regression estimation method in the time-varying coefficient panel count data model. This model discussed in our paper is more general than [15], with no Poisson restriction on the recurrent event process. To get the estimations, B-splines are used to approximate the unknown functions log λ0(t) and β(t) for the first, and then a smoothing technique is applied to obtain the continuation of the discrete panel count data. Finally, the spline-based quantile regression approach is developed at different quantiles. Some simulations are presented to evaluate the performance of the proposed approach and two applications are analyzed to demonstrate its effectiveness in this paper. Recently, the Enron e-mail corpus which was a massive set of the e-mail messages, have been discussed by many authors, such as [22]. If we are interested in the number of interactions of all pairs of individuals in this longitudinal observations, as usual in network analysis, the snapshots are applied to model this longitudinal networks, then, this is a standard panel count dataset with massive observations. Furthermore, in this paper, we only considered the situation with low dimensional covariates, which may be not unpracticable in the applications. As the high-dimensional covariates may be existed, variable selection methods can be considered for the time-varying coefficient model. This will be an important topic for our further studies. Besides, reliability data and traffic data have been studied by many authors, such as [23-26]. This will be interesting to study the quantile regression estimation of such data.

Proof of Theorem 1

Define as the true but unknown values of , , , , ϕ = (γ⊤, b)⊤ and Let where By the Taylor expansion, we can have r = o(1). Besides, where and is between ϕ* and . Define By the identity of [27], Hence, it can be obtained that Thus, ΔH can be denoted as ΔH = ΔH1 + ΔH2, with By calculating the expectation and variance of ΔH2, By condition (C5), E[{∇h(t, X;ϕ*)}⊗2] = Γ, we can have Next, we calculate the variance of ΔH2, Hence, we can have . Before discussing ΔH1, we first define . Then, we have E(κ) = 0 and Var(κ) = τ(1− τ)Γ. By the Cramer-Wald Theorem and the Central Limit Theorem, we can have that κ → N{0, τ(1 − τ)Γ}. Next, we define so that . By simple calculation, we have Thus κ1 → κ. By Slutsky’s theorem, κ1 → N{0, τ(1 − τ)Γ}. Then, we can have that By the epi-convergence results of [28], . Finally, the asymptotic normality is proved . Since , we have By the Lemma 1, . Thus, we can get and

Table 2

BIAS, SSE, BSE and CP of the estimated functions in Case II at different τ.

τ	n	Estimated function	BIAS	SSE	BSE	CP
0.25	100	β(t)	0.0768	0.9830	0.9890	0.9592
	100	log λ₀(t)	0.1501	0.4924	0.5356	0.9661
	200	β(t)	0.0214	0.7345	0.6980	0.9365
	200	log λ₀(t)	0.1099	0.4021	0.3875	0.9480
0.5	100	β(t)	0.0455	0.7163	0.6937	0.9250
	100	log λ₀(t)	0.0746	0.4082	0.4277	0.9425
	200	β(t)	0.0380	0.4600	0.4511	0.9389
	200	log λ₀(t)	0.0514	0.2633	0.2601	0.9694
0.75	100	β(t)	0.0552	0.8944	0.8558	0.9697
	100	log λ₀(t)	0.0722	0.5191	0.4701	0.9406
	200	β(t)	0.0398	0.6906	0.6721	0.9355
	200	log λ₀(t)	0.0529	0.3553	0.3284	0.9640

8 in total

7. Analysis of hourly crash likelihood using unbalanced panel data mixed logit model and real-time driving environmental big data.

Authors: Feng Chen; Suren Chen; Xiaoxiang Ma
Journal: J Safety Res Date: 2018-04-25

8. Crash Frequency Modeling Using Real-Time Environmental and Traffic Data and Unbalanced Panel Data Models.

Authors: Feng Chen; Suren Chen; Xiaoxiang Ma
Journal: Int J Environ Res Public Health Date: 2016-06-18 Impact factor: 3.390