Literature DB >> 34945891

Right-Censored Time Series Modeling by Modified Semi-Parametric A-Spline Estimator.

Dursun Aydın1, Syed Ejaz Ahmed2, Ersin Yılmaz1.   

Abstract

This paper focuses on the adaptive spline (A-spline) fitting of the semiparametric regression model to time series data with right-censored observations. Typically, there are two main problems that need to be solved in such a case: dealing with censored data and obtaining a proper A-spline estimator for the components of the semiparametric model. The first problem is traditionally solved by the synthetic data approach based on the Kaplan-Meier estimator. In practice, although the synthetic data technique is one of the most widely used solutions for right-censored observations, the transformed data's structure is distorted, especially for heavily censored datasets, due to the nature of the approach. In this paper, we introduced a modified semiparametric estimator based on the A-spline approach to overcome data irregularity with minimum information loss and to resolve the second problem described above. In addition, the semiparametric B-spline estimator was used as a benchmark method to gauge the success of the A-spline estimator. To this end, a detailed Monte Carlo simulation study and a real data sample were carried out to evaluate the performance of the proposed estimator and to make a practical comparison.

Entities:  

Keywords:  B-splines; adaptive splines; right-censored data; semiparametric regression; synthetic data transformation; time series

Year:  2021        PMID: 34945891      PMCID: PMC8699840          DOI: 10.3390/e23121586

Source DB:  PubMed          Journal:  Entropy (Basel)        ISSN: 1099-4300            Impact factor:   2.524


1. Introduction

Time series datasets are censored from the right under specific conditions, such as a detection limit or an insufficient observation process. Consider a device which cannot measure values above a certain point, which is known as a detection limit. Since the device cannot determine the real value of an observation above its detection limit, such observations are recorded as right-censored data points. The hourly observed cloud ceiling heights data collected by the National Center for Atmospheric Research (NCAR) and modelled by [1,2] can be used as an example of a right-censored time series. Although right-censored time series are encountered frequently in the real world, in the literature, there are truly few studies completed on the estimation of right-censored time series. This may be because censorship is an unwanted data irregularity for the researchers, and it is therefore often ignored or solved by outdated techniques. To solve the censorship problem before modelling the time series, reference [1] used the Gaussian imputation technique to estimate the series using modified ARMA models. In a similar manner, references [2,3] solved the censorship problem by using data imputation techniques. The common ground of these studies is the use of imputation and data augmentation methods to estimate the regression models with autoregressive errors for right-censored time series. On the other hand, there is an easier way to handle the censorship problem called synthetic data transformation. Although data imputation techniques have some merits, they are generally based on iterative algorithms and their calculations are costly. Reference [4] estimated the temporally correlated and right-censored series by Nadaraya–Watson estimator nonparametrically, solving the censorship problem using a data transformation technique. Various data transformation (or synthetic data) methods have been proposed and studied in the literature for independent and identically distributed (i.i.d.) datasets; for example, see [5,6,7]. Because synthetic data transformation manipulates the data structure, which is disadvantageous, this solution method is no longer the preferred technique for right-censored time series. This paper aims to propose a method which can overcome the disadvantage of the synthetic data transformation method. Note that the studies mentioned above consider the modeling of time series data using parametric or nonparametric methods. The data structure of a time series in the real world is generally not suitable for parametric modelling, because it requires rigid assumptions to reach reasonable estimates. Single-index nonparametric models, on the other hand, are very flexible, which is an important advantage over parametric methods and there are valuable studies on the subject [2,8,9]. However, nonparametric approaches lose their statistical efficiencies, when the number of covariates increases. In addition, it should be noted that, when a time series dataset is right-censored, the weaknesses of both methods are further increased. Considering the issues mentioned above, this paper adopts semiparametric regression model for estimating right-censored time series. Although several researchers have introduced different types of semiparametric estimators for time series data, such as [10,11], there remains a significant gap in the research regarding the modelling of right-censored time series data. To address this absence, our paper proposes a modified semiparametric A-spline (AS) estimator based on synthetic data transformation. Thus, the bidirectional flexibility of the semiparametric model will be used, and the censorship problem will be effectively solved. The paper is designed as follows: the methodology and fundamental ideas about right-censored semiparametric time series model with autoregressive errors and the synthetic data transformation method are given in Section 2. Section 3 introduces a modified AS estimator for parametric and nonparametric components of the right-censored time series model, and a semiparametric B-spline (BS) is given as a benchmark. Section 4 involves the statistical properties and evaluation criteria for both the modified AS and benchmark BS methods. Section 5 introduces some additional information about the penalty term of the semiparametric AS approach. Section 6 and Section 7 contain a detailed Monte Carlo simulation study and a real-world data example, respectively. Conclusions are presented in Section 8.

2. Background

The classical semiparametric model can be defined as a hybrid model with a finite dimensional parametric component and a nonparametric component having an infinite dimensional nuisance parameter. See [12,13,14,15] for additional information. In both theory and practice, the semiparametric model brings a new perspective to data modeling, since it includes both parametric and nonparametric components. As mentioned in the previous section, it is well-suited to time series data, because it brings the advantages of the semiparametric model to time series analysis. Suppose that a time series dataset satisfies an uncensored semiparametric time series model of the form: where s are the observations of stationary time series, and are known p-dimensional vectors of the explanatory variables, is an unknown -dimensional vector of the regression coefficients to be estimated, is an unknown smooth function that describes the relationship between and a nonparametric temporal covariate , and finally, ’s are the stationary autoregressive error terms generated by: where are the autoregressive coefficients, and denotes the independent and identically distributed random error terms with mean zero and a constant variance. Model (1) does not include lagged s and has auto-correlated errors. This expression makes it a suitable model for the semiparametric regression analysis of certain kinds of time series. A common problem in practice is that dependent observations cannot be perfectly collected due to limitations including the detection limit of an evaluation tool or the end time for the study. To express this situation algebraically, we assume that are censored from the right by a non-negative random variable representing detection limit . Therefore, instead of observing the values of , we now observe: where ’s denote the censoring information. Suppose that we are interested in estimating the mean semiparametric regression function. The distribution of the observable random variables does not identify the mean regression function uniquely. However, this problem can be solved as follows. Let , and for be cumulative distribution functions of non-negative random variables respectively. If random variables and are independent, then the survival function for observed response variable can be defined from the basic relationship between : Given a random sample from the distribution of (), it is of interest to examine the explanatory variables’ effect on the observations of time series (i.e., response variable) by estimating the survival function , which is the regression function , the conditional mean of time series . However, because of the censoring, ordinary methods cannot be applied directly to estimate the regression function. To overcome censored observations, a data transformation technique should be used. One of the most widely used techniques is the synthetic data transformation, detailed in the section below.

Synthetic Data

To extend the penalized sum of squares approach to right-censored semiparametric regression analysis, we updated the synthetic data approach developed by [5]. The first step is to create an unbiased synthetic response variable of which the expectation is equal to the original and then to obtain the penalized squares estimator by means of this synthetic variable. The main goal of this transaction is to consider the censoring effect on the distribution of response variable. In the case of censored data, the authors of [16,17] used the synthetic data approach. In the synthetic approach, we replace observed variable with transformed data ; a transformation maintains the conditional expectation of original variable . To describe this situation, it is easier to proceed directly using the cumulative distributions given in Lemma 1 below. Note also that if is known then it is possible to transform observed data into unbiased synthetic data, given by: where is the distribution function of the censoring time , as defined before. It should be noted that the distribution of is rarely known. In this case, we use the Kaplan–Meier estimator defined by: where are the sorted values of and is the related to . Equation (5) has the following properties: (a) if distribution is selected arbitrarily, some can be identical. In this case, the ranking of into is not unique. However, the Kaplan–Meier estimator allows us to define the ranking of uniquely; (b) has jumps only at the censored observations of the time series (see [18]). Substituting for in Equation (5), we construct the following synthetic data, given by: Then, one practical consequence of the following Lemma is that synthetic data and completely observed response times have the same conditional expectations, as claimed in before. Consider time series datadenoted as a response variable. If the data is censored by random censoring variablewith distribution, transform observed seriestoin an unbiased form, as defined in Equation (4). Based on the information, it can be easily verified that. However, generally,is unknown as mentioned before. Therefore,is used which is defined in Equation (7), instead of. Because ofwhen, (see [. Let us consider that , where is defined right after Equation (3). In the literature, the convergence rate of the Kaplan–Meier estimator is examined in two classes: (i) restriction of time-interval as with ; (ii) extension of time-interval (see [19] for more detailed discussions). Here, the convergence rate of the Kaplan–Meier estimator is inspected with regard to case (ii). However, cannot be used without some strong conditions that can be given by: ; ; . Details about conditions (i)–(iii) were studied by [20]. The convergence of over the interval can be provided. Reference [19] clearly shows both strong and weak convergences at the rate where . The proof of Lemma 1 is given in Appendix A. The major concern of this paper is to overcome the censoring problem and to estimate the semiparametric time series model efficiently. To achieve this goal, we used two different approaches, BS and modified AS estimators. In the following section, we applied these approaches to the transformed data to estimate time series observations under random right-censorship.

3. Estimating the Semiparametric Model Based on the BS Estimator

We first introduce the BS considered for estimating the components of model (1). A univariate B-spline is constructed by a piecewise polynomial function of degree such that its derivatives up to order () is continuous at each knot point The set of BSs of degree over the real numbers is a vector space of dimension . In addition, note that denotes the number of interior knots, while indicates the polynomial order. For example, the polynomials of order are defined as constant, linear, quadratic, and cubic BS basis functions, respectively. If the knots are equally spaced (i.e., separated by same distance )), the knot points and the corresponding BSs are called uniform. Given an ordered knot vector in the domain of covariate , then BS basis functions of degree and can be defined in recursive series, respectively, as: Note that if the denominator of Equation (9) is equal to zero, then the BS basis function is assumed to be zero. From Equations (8) and (9), a set of basis functions have the following important properties: (a) The BS basis functions form a partition of unity, ; (b) For all values of covariate ; and (c)is realized in the interval []. Reference [21] proposes an algorithm to solve equation (9). See also the work of [22] for more detailed discussions on the BS approximation. Note also that the BS curve can be uniquely represented as a linear combination of the BSs basis functions in Equation (9), as given in the next section. Note that references [23,24] could be counted as recent studies about BSs.

3.1. BS Estimator

As previously noted, in this paper, we fit semiparametric time series model (1) with right-censored data. For this purpose, the BS estimator can be used as an approximation method. Using the synthetic data in Equation (7), we estimated the parametric and nonparametric components of model (1). Therefore, the sum of the squares of the differences between the censored time series values and are minimum. Assume that is a smooth function that can be approximated by a linear combination of the BSs basis functions in Equations (8) and (9): where is the total number of BS basis functions being used, are estimated coefficients (or control points) for each BS, is an -dimensional matrix which includes BSs as defined by Equation (9) and is a parameter vector of the BS function. Note also that the autoregressive errors in model (1) follow an -dimensional multivariate normal distribution with a zero mean and stationary covariance matrix , that is, where the covariance matrix is a symmetric and positive definite matrix with elements: Throughout the paper, the notation is used as . Note that is generally unknown. However, its elements can be obtained by the generalized least squares (GLS) based on an iterative process. Then, as in [25] which is a penalized BS study combining BS and difference penalties, the estimates of the components of semiparametric model (1) were obtained by minimizing the penalized sum of squares () criterion: where is the first-order difference penalty on the coefficients of the BSs. The other differences can be defined as follows: and similarly: Note that if degree in Equation (12), we obtain semiparametric ridge regression based on BSs. When in Equation (12), we have the minimization equation of ordinary least squares regression with a correlated error. If , the penalty only influences the main diagonal and sub-diagonals (on both sides of the main diagonal elements) of the banded structure system due to the limited overlap of the BSs. We rewrite the minimization criterion described as Equation (12) in a matrix and vector notation: where denotes Euclidean norm, , is the synthetic response vector defined in Equation (7), is a smoothing parameter, and denotes the matrix notation of the difference operator defined in Equation (13). For example, is an -dimensional banded matrix that corresponds to the second-order difference penalty, given by: From simple algebraic operations, it follows that the solution to the minimization problem in Equation (15) satisfies the following block matrix equation: Given a parameter , the corresponding estimators based on BSs for vectors and can be easily obtained by: and: where It should be noted that the estimates of the unknown regression function in a censored semiparametric model are obtained by: From Equations (19) and (20), we see that the fitted values of dependent time series data can be written as: where is a hat matrix for BSs and computed as follows: where .

3.2. AS Estimator

The adaptive spline (AS) applies an adaptive ridge penalty to the BS method, which makes it more flexible for knot determination. The AS concept is explained in [26] in a nonparametric context. However, in this paper, we generalized this estimation concept to the semiparametric environment based on synthetic response observations. It should be noted that the location and number of knots have crucial importance in terms of synthetic data transformation. This issue is discussed in detail in Section 4.3. The point here is that a more efficient estimator based on synthetic responses is needed, as most of the existing smoothing techniques (spline smoothing, kernel smoothing, etc.) cannot properly handle synthetic data. This article aims to solve this issue with the AS estimator. When a BS is defined on the knots such that for some knot, it may be reparametrized as a BS on the knots . Accordingly, when , we want to put a penalty on the number of non-zero differences indicated as below: where is the -order difference operator and is the -norm of the differences, that is, = 0 if , otherwise, = 1, and is a positive penalty parameter that ensures the tradeoff between the goodness of fit to the data and the smoothness of the fitted curve. This penalty enables us to remove knot that is not related to the smoothing problem, to join the neighbor intervals and , and to carry on fitting with a BS described over the remaining knot points. Note also that when , the fitted curve becomes a BS with knots and when the fitted function becomes a polynomial of degree . It should be emphasized that one of the important points about the adaptive ridge penalty is that Equation (23) cannot be differentiated due to the -norm. As a result, the fitting process is made numerically untraceable. An approximate solution to dealing with the -norm is provided by [27,28]. Following the studies of these authors, we approximate the -norm by using an iterative process referred to as an “adaptive ridge” based on synthetic data. The new criterion function is expressed by the following weighted penalized sum of squares: where ’s denote the positive weights. It should be noted that the penalty is close to the -norm of the differences when the weights are iteratively calculated from the parameter vector of BS following the equation: where is a constant properly determined by the researcher. There are a few important points to know about the selection of . If , then the magnitudes of ’s might be quite large, resulting in and the penalty term turning into . Furthermore, if , then . This convergence gives us a measure of how relevant the knot point is. In practice, one possible choice, suggested by [ . They select the knots (denoted as ) with a weighted difference bigger than 0.99. The number of parameters of the chosen BS is , where denotes the number of selected knot points. Note that reference [28] provides a figure to show the effects of different norm degrees () on the quality of estimation. It is seen from that the performance of estimation does not change for different values of when norm degree is zero . However, it affects the performance seriously if . For some and non-negative weights, the of Equation (26) can be rewritten as: where is a penalty matrix and written as , where and is the matrix form of the difference operator , as defined in Equation (13). Simple algebraic operations show that the solution to the minimization problem in Equation (26) satisfies the block matrix equation: By similar arguments as in the case of the BS approach, the corresponding estimators and of and , based on the right-censored semiparametric time series model (1) with correlated data, can be easily obtained, respectively, as: and: where . The proofs and derivations of Equations (28) and (29) are given in Appendix B. Notice that the estimates corresponding to the nonparametric part of the semiparametric model (1) are obtained using Equation (28) as described in the following equation: From Equations (29) and (30), we can see that the fitted values of the dependent time series data can be obtained as: where denotes the hat matrix, given by: with . To make the computation process efficient, all penalty terms are calculated by using the iteration process instead of finding matrix and knot set individually. The iterative algorithm is given in Algorithm 1 below. For the constant value of , the iteration process repeats between step 3 and step 9 until the pre-determined tolerance value is obtained where . From our experience, the expected number of iterations is observed as to achieve the convergence. Notice that the complexity and efficiency of Algorithm 1 is analyzed from different aspects that are given by: (i) Number of local searches: algorithm does not involve a local search procedure which is an advantage for the speed of Algorithm 1; (ii) Number of nested loops: due to the fact that there is only an iteration loop (without nested loops), if an algorithm does not include nested loops, its “order of growth” will be (iii) Asymptotic behaviors: as the former inference mentioned, Algorithm 1 has which means that the limiting case of its convergence speed is considerable when it is compared with its alternative BS method on this issue. As mentioned at the beginning of this section, the choice of an optimum smoothing parameter λ is required for both semiparametric BS and AS estimators. In this context, the improved Akaike information criterion ( proposed by [29] is used, which is computed with the following equation: where is the estimate of the model variance, which is estimated for both methods separately in the next section, and denotes the hat matrix for any of two methods. It is replaced by for the AS method and for the BS method, respectively.

4. Statistical Properties of the Estimators

In this paper, we introduced the semiparametric AS and BS estimators for the estimation of the right-censored time series model. It should be noted that these two methods were used for the first time in the setting of a time series estimation procedure. Inferences were therefore carried out about their statistical properties. For example, among these, the error terms obtained from the estimates of both methods and the estimators of parametric and nonparametric components were inspected and their properties were extracted.

4.1. Properties of the Semiparametric BS Estimator

Firstly, the parametric component was inspected. As is known, in a parametric context, errors can be decomposed into the bias and the variance terms that provide the quality of the estimator. Accordingly, the estimator of the parametric coefficients vector is expanded as follows: where and matrices are as defined in Section 3.1 and . From here, bias and variance-covariance of estimator can be computed as follows: where is the variance of the fitted semiparametric model. Since the variance is not generally known, instead of , the estimation () based on the BS is used. It can be computed from the residuals sum of squares (RSS) using error terms: where denotes the degrees of freedom. In addition, needs algebraic operations. In the context of the BS, if the data have a normal distribution, is asymptotically unbiased. Secondly, the properties of estimated nonparametric component are given here. The bias of is one of the quality measurements for the estimated model. The bias is denoted as conditional expectation , given by: From that, the bias is given by: Accordingly, the covariance of can be computed as: where is defined by Equation (36). In addition, to reveal the performance of , the root square of mean squared error is used:

4.2. Properties of the Semiparametric AS Estimator

Similar to in Section 4.1, the same properties for parametric and nonparametric components are given for the AS estimator here. The necessary expansion is written as follows to derivate the bias and variance of : where and are given in Section 3.2. Now, the bias and the covariance matrix of the estimator can be provided by: where is the variance of the fitted semiparametric model. Similar to Equation (40), instead of the model variance, is obtained as follows: The properties of estimated nonparametric component for the AS method are described below. The bias and the variance of the AS estimator can be given, respectively, as: and Thus, the value of for , similar to Equation (41), is calculated as follows:

4.3. Quality Measures for the Fitted Model

After assessing the parametric and nonparametric components of the model in Section 4.1 and Section 4.2, several measurements are introduced in this section to evaluate the overall model performance. In the literature on time series modelling, mean absolute percentage error (), mean absolute error (), and mean squared error () are the most commonly used performance criteria. To represent these criteria, is preferred in this study. In addition, median absolute error () was used, which allowed us to account for missing or censored data. Generalized and the ratio of   proposed by [30] and [2], respectively, were used to measure the quality of the fitted time series model. The aforementioned criteria can be defined as follows: where and denote the fitted dependent variable values and vector for any estimation method. Here, and are replaced by and for the BS, and and for the AS. In addition, to make a more considerable comparison between the AS and BS estimators, is defined below. The ratio of GMSE can be defined as follows: Regarding the criterion, if , then it can be seen that the fitted model by the AS method shows better performance then the BS method.

5. Further Information for Adaptive-Ridge Penalty

The semiparametric AS estimator proposed for the right-censored time series model, with its adaptive nature, aims for qualified estimations despite the censorship. To approach the -norm given in Equation (23), the most suitable knot locations can be chosen due to the weighted penalty term. Thus, the model avoids the disadvantages of synthetic data transformation, which gives higher magnitudes to uncensored observations. This section is designed to inspect some of the large sample properties of the modified AS estimator under right-censored data. It should be noted that adaptive ridge penalty in the setting of regression has been studied by many authors; see for example [25,26,28]. However, the aforementioned studies consider adaptive ridge penalty individually, not as a part of a semiparametric time series model. This section provides basic information for the large sample properties of the proposed AS estimator in the context of a semiparametric time series model. As previously stated, the AS approximation is a modified version of the P-splines (penalized BSs) estimator proposed by [31]. Note also that the AS method diverges from BSs with a significant difference of the -norm in the penalty term. The AS estimator is obtained by an iterative process with determining weights, as expressed in Section 3.2. In addition, apart from the usage of the AS method in the literature, it is also used for modelling censored time series. For these reasons, we can make several important assumptions. The large sample properties are written based on the assumptions given below: The minimization problem for the semiparametric AS is given in Equation (26). To make this expression more general, it can be rewritten as follows:whererepresents the-norm of the penalty term. The first assumption is, which allows approximation to the-norm with the acquisition of weights via the iterative process. Otherwise, the-norm needs overly complex calculations, which leads to the loss of practicality when using the method. From our knowledge of the literature, whensuch as in Equation (26), the minimization of Equation (50) works by penalizing the non-zero coefficients’s, as shown by [. Whenis examined asymptotically, the objective function of Equation (26) may not have a global minimum, since it is not clearly convex. However, if we assume that:then it is possible to point out some important aspects of asymptotic consistency. Therefore, it should be presumed thatis a non-negative definite matrix and:where elements of. , , and are assumed to be full rank matrices. Under the assumptions given above, to see asymptotic consistency of and , an equation can be obtained from Equation (50) as follows: where denote the limiting case of the estimators for . Note that Equation (52) is ensured by following Theorem 1. Based on Assumptions 1–3, and , then where: Therefore, for optimal, paircan be counted as a consistent AS estimator of. In this context, whenthen. For the proof of Theorem 1, see Appendix C. To clearly indicate the place of Assumptions 1–3 in the estimation process, the following explanations are given for each assumption. Assumption 1 is independent from the data. We assume that to provide a practical solution when minimizing Equation (50). Therefore, in both empirical and real data studies, this assumption does not impose anything to the dataset, but it is necessary to reduce the computational complexity. In real data studies, to ensure Assumption 2, “” matrix obtained by using the nonparametric covariate needs to have independent columns. Because should be identifiable and avoid the ill-posed problem, must be a full-ranked matrix. Assumption 3 confirms Assumption 2. Thus, it can be seen that asymptotic consistency can be confirmed by Assumption 3. From that it can be said that Assumption 3 is indirectly depended on the dataset.

5.1. Asymptotic Distribution and Consistency of the Proposed Estimator

In this section, the estimate of parametric component is inspected in terms of asymptotic consistency and asymptotic distribution. Assume the following regularity conditions: for non-negative matrix ; Autoregressive errors ’s given in Equation (2) are stationary with independent and identically distributed random error terms ’s that have zero mean and finite variance ; exists. Here, condition (ii) indicates that the diagonal elements of and are identical and one, because the covariates are scaled. To obtain the asymptotic distribution of , “nearly-singular” designs are performed due to for . Thus, it can be ensured that asymptotically. On the other hand, and are assumed as non-singular in Section 5.1. To show the consistency and asymptotic normality of the semiparametric AS estimator when conditions (i), (ii), and (iii) are ensured with non-singular , first the case of is considered, followed by the case of . Let be an asymptotic estimator. The consistency of can be reached by using following minimization function: The following theorem shows the consistency of for validated additional assumption . Assume that is non-singular, behaves stable, and . It can then be said that as : where is a consistent estimator of . The proofs of this theorem are given in , and therefore is a consistent estimator. It should be emphasized that the consistency of is sufficient to show that . However, this depends on the magnitude of growth of . When grows more slowly, then a limiting distribution exists. It is clear from Theorem 2 that the mean of the limiting distribution of converges to zero for the consistency of . In addition, its asymptotic variance can be obtained based on conditions (i) and (iv) as . Accordingly, the asymptotic distribution of the semiparametric AS estimator is written as: However, the limiting distribution depends on whether or . In the context of this paper, Theorem 3 is given for the limiting distribution of when . Assume that if . Then: where . The proofs of Theorem 3 are given in

6. Simulation Study

In this section, a simulation study was conducted to inspect the finite-sample behaviors and performances of the two semiparametric estimators and under right-censored time series. These estimators were then compared through the quality measurements given in Section 4. The simulation scenarios are designed as follows: We use the model to generate datasets in the simulation experiments. The unknown smooth regression function is constructed by combining the functions that denote seasonal effects on the data, that is, , where with The design matrix is generated from a normal distribution: , where is the dimensional matrix for . Note also that the distribution may not be normal, and that one can thus consider a uniform or other distributions. The vectors of the regression coefficients are . The autoregressive error terms are generated from a one-lagged process with and . Thus, as stated in (a), completely observed dependent time series ’s are generated from the sum of the parametric, nonparametric, and error terms using (b), (c), and (d). To produce the right censored variable , as specified in Equation (3), we generate the censoring variable from the binomial distribution with proportions or censoring levels (CLs) at The Algorithm 2, given below, demonstrates how the censoring variable is created. To deal with censored observations in obtained with Algorithm 2, we use synthetic data values obtained through the Kaplan and Meier estimator [18], as described in Equation (6). AR(1) model is used as a naïve model to estimate the right-censored time-series as in [1,2]. Thus, the finite sample performance of the introduced methods can be made. For each CL in the simulation experiments, we generated 1000 random samples for size . The results of the simulation study were divided into three parts for parametric components, nonparametric components, and overall model performance. Accordingly, the outcomes of the estimated models, comparative results, and corresponding comments are given together in the following tables and figures. To understand the simulated datasets and the scenarios, examples of some of the simulation configurations are given in Figure 1. Panel (a) shows the dataset for small sample size and low censorship. Panel (b) is drawn to show the case when the censoring level is really high. Panels (c)–(d) indicates the cases for medium and large sample sized data with censoring levels 20% and 40% respectively.
Figure 1

Some of the datasets generated using Algorithm 2 including both fully observed and censored data points for different censoring levels and sample sizes.

6.1. Assessing the Parametric Component

In this section, the performances of the two methods were compared in terms of the parametric components of the right-censored semiparametric linear models generated by the simulation. It should be also noted that in this simulation study, 54 different configurations were analyzed to provide a broad perspective of the adequacy of each method. The results from the parametric components in the simulation study are displayed in Table 1 and Figure 2. Note that bold colored scores indicate the best (minimum) scores.
Table 1

Estimated regression coefficients from the AS and the B-spline (BS) with values of variance and bias.

β1 = 3 β2 = 0.5 β3 = −1
Bias(β^1) Var(β^1) Bias(β^2) Var(β^2) Bias(β^3) Var(β^3)
n C.L. ASBSASBSASBSASBSASBSASBS
5050.887 0.870 0.936 0.842 0.809 0.786 0.922 0.845 0.867 0.837 0.884 0.804
20 0.852 0.895 1.180 1.290 0.888 0.892 1.210 1.3580.963 0.949 1.191 1.336
40 0.999 1.172 1.455 1.641 0.916 1.108 1.431 1.657 0.946 1.145 1.453 1.674
10050.510 0.470 0.440 0.425 0.539 0.434 0.433 0.422 0.515 0.467 0.439 0.431
20 0.514 0.610 0.583 0.609 0.538 0.579 0.583 0.609 0.527 0.599 0.590 0.618
400.535 0.433 0.619 0.689 0.525 0.622 0.619 0.689 0.535 0.610 0.629 0.692
20050.285 0.271 0.260 0.253 0.290 0.272 0.2550.2550.294 0.271 0.252 0.254
20 0.310 0.324 0.333 0.3550.311 0.300 0.325 0.3510.304 0.296 0.328 0.353
40 0.314 0.333 0.338 0.352 0.321 0.337 0.332 0.356 0.307 0.336 0.332 0.363

The bolded values indicate the best scores.

Figure 2

Boxplots of bias values for both the AS and BS methods for all configurations. In the x-axis, b1, b2, and b3 denote , and ; A1, A2, and A3 denote biases obtained from the AS method for CLs of 5%, 20%, and 40%. Similarly, B1, B2, and B3 denote biases for the BS method, when CLs are 5%, 20%, and 40%.

From the careful inspection of Table 1, it can be demonstrated that the behaviors of the BS and AS change noticeably in different scenarios. Let us look at low and medium CLs for ; under these conditions, the BS has remarkable superiority over the AS. This can be interpreted as the BS fitting the data better when the data’s structure is distorted less by censorship. However, for which means the data are heavily censored, the AS method gives better scores. As the sample size increases, although the bias and variance values from the methods are obtained more closely, the AS provides more efficient performance in estimating the parametric component. Regarding the parametric component, it should be emphasized that the AS behaves as expected and gives the best scores for cases of heavy censorship. In general, the best scores for each method can be evaluated in terms of bias and variance results. When we examined the bias results of the regression coefficients, the AS method gives the best score in only 12 out of 27 configurations while the BS method gives the best score in 15. However, regarding the variances, the AS gives the best score in 18 of 27 configurations, while the BS is superior in only 9 configurations. In Figure 2, Panels (a–c) shows the calculated biases for each simulation repetition for all cases when sample size is small, medium, and large.

6.2. Evaluating the Nonparametric Component

As in the case of parametric components, we constructed 1000 estimates of the regression function , which is the nonparametric component of model (1). For each method, 1000 replications were carried out, and the estimated bias, variance and RMSE values were computed for each estimator. This section is designed to show the simulated results related to the nonparametric component. The results in Table 2 showed that the AS method proves its efficiency for the estimation of the nonparametric component when time series data are moderately to heavily censored. On the other hand, for , the BS method gives better results for all sample sizes according to our evaluation metrics. One of the main reasons for this is that the BS adapted to the knots more than the AS. Consequently, when the data points are manipulated by censorship, these knots force the BS to make inefficient estimates. At this point, the knot determination of the AS based on the weights given in Equation (24) diminishes the effect of the censorship. That is why the AS method performs better under moderately and heavily censored time series data.
Table 2

Outcomes from the fitted nonparametric components.

Bias(α^) Var(α^) RMSE(f,f^)
n CLs AS BS ASBSASBS
5051.085 0.629 0.048 0.022 1.135 0.883
20 1.128 1.498 0.066 0.075 1.099 2.061
40 1.287 2.510 0.079 0.095 2.511 3.127
10050.961 0.851 0.022 0.0250.824 0.664
20 1.040 1.217 0.030 0.041 1.255 1.779
40 1.070 1.302 0.037 0.070 1.815 2.331
20050.891 0.813 0.009 0.008 0.670 0.435
20 0.928 0.959 0.013 0.021 1.547 1.871
40 0.995 1.070 0.017 0.028 2.397 2.882

The bolded values indicate the best scores.

Figure 3, consisting of four panels (a), (b), (c), and (d), is drawn to illustrate the performance of the AS and BS methods in nonparametric curve estimation and to present different simulation configurations. Panel (a) show the estimated curves for small sample size and medium censoring level. Similarly, Panel (b) shows the case when medium sample size and high censoring level. Panel (c) indicates the estimated curves for small sample size and low censoring level. Finally, Panel (d) shows the estimated curves when sample size is large and censoring level is medium. When panels (a) and (c) are analyzed comparatively, the effect of the censorship level can be seen. At the first glance, the distortion of both curves is noticeable. However, the BS method is insufficient to represent censored time series compared to the AS method. In addition, panel (b) shows that when data are heavily censored, the BS curve is drawn towards the x = 0 line, due to the presence of zero values in the synthetic response variable. Finally, panel (d) indicates that although the time series contains censored data points, the qualities of the estimates for both the AS and BS methods become better as the sample size increases.
Figure 3

Data points, real regression functions, and curves fitted by two methods. In the legend of the plots, f(A) and f(B) represent function estimates obtained from the AS and BS methods, respectively.

6.3. Assessing the Performances of Methods

This section involves the results for overall model estimations obtained from the AS and BS methods. Although results are given for parametric and nonparametric components in the previous sections, a separate review for the whole model estimation is required for a healthy comparison. Accordingly, the performance scores for , and are given in Table 3, and Figure 4 is drawn to illustrate the values.
Table 3

The values of performances from the AS and BS methods.

MAPE MedAE GMSE
n CLsASBSAR(1)ASBSAR(1)ASBSAR(1)
5050.166 0.157 0.3220.419 0.383 0.9993.1193.5104.915
200.358 0.348 0.388 0.737 0.8961.0524.4684.9205.142
40 0.584 0.6881.980 1.030 1.5191.9717.7629.54210.751
1005 0.154 0.1860.3030.323 0.320 0.8601.0010.9283.614
20 0.333 0.3360.365 0.668 0.7500.9141.8701.9884.147
40 0.514 0.5281.476 1.025 1.8311.8913.6634.1826.798
20050.111 0.096 0.2830.264 0.251 0.7170.983 0.761 1.935
20 0.312 0.3320.364 0.552 0.6060.847 2.065 2.4973.411
40 0.499 0.5080.654 1.008 1.0861.501 2.759 2.8163.131

The bolded values indicated the best scores.

Figure 4

bar chart for the s of all simulation combinations.

When Table 3 is examined, it can be seen that the results obtained for the model estimates are slightly different from the previous results, as expected. The total error obtained from the estimation of parametric and nonparametric components is one of the reasons for this discrepancy. In addition, considering the situations where the two methods produce extremely similar scores, this difference can be understood better. Note that AR(1) model shows poor performance, which depends on its parametric and linear structure. However, for the large sample size (n = 200), the scores of models obtained are close to each other. However, it is clearly seen that the AS and BS methods are much better on the estimation of right-censored time series. As can be seen from the bolded scores, the AS method generally performs better. From Table 3, it can be seen that the values obtained by BS are better for . However, as mentioned earlier, in this study, the criterion, which is not frequently used for time series data, is used to measure the durability of the predictions. When the scores of this criterion are examined, it is understood that, as stated from the beginning of the study, the BS method has more successful estimates under low censorship levels, but the AS method is superior for medium and high censorship levels. Figure 4 includes the scores for both the AS and BS methods that are formed by the ratio of the values of each method. In Figure 4, the difference between the qualities of the estimates is clearly very small for . However, the difference becomes more significant for and . Note that for , the BS method gives smaller ratio values, which confirms the results given in Table 3. As stated before, the AS method is demonstrably superior at higher censorship levels, which can be seen in Figure 4 for all sample sizes.

7. Real-World Data

This section is designed to show how the newly introduced semiparametric estimator AS and benchmark BS method behave with a real right-censored time series dataset. For this purpose, we consider unemployment duration data involving the monthly unemployment period rates years between 2004 and 2019 for Turkey; this dataset is available at https://ec.europa.eu/eurostat/databrowser/view/UNE_RT_M__custom_1635127/default/table?lang=en. In the dataset, the last three months of 2004 and the last three months of 2019 cannot be observed correctly. Therefore, these data points can be censored from the right by the detection limit zero, because none of the data points are negative values. Accordingly, the introduced semiparametric methods, AS and BS, can be used for this time series analysis. In addition, as in the simulation study, the results of the AR model are given in the following tables. However, different from the simulation study, AR(2) model was used for the real data study, because the optimal lag values is determined as from Table 4. Before the modelling procedure, the stationarity of the time series data was tested with the augmented Dickey–Fuller (ADF) test, the suitable lag is determined under null hypothesis . The test results are given in Table 4 below:
Table 4

Augmented Dickey–Fuller (ADF) test results for the stationarity of time series data and the determination of the appropriate lag.

No. LagADF Test Resultsp-Value
0−2.610.318
1−3.270.077
2 −3.52 0.041
3−3.330.066
4−3.300.072

Bold scores are significant score for the 95% confidence level.

Table 4 shows that the second lag for this time series is suitable for the modelling. From this information, the semiparametric time series model can be given by: where s represent the dependent time series of the unemployment duration ratio, and denote the first and second lags of the dependent series that are used as covariates, respectively, denotes the seasonality, and finally, ’s are the stationary autoregressive error terms as given in Equation (2). The estimation of model (6.1) is realized by both the AS and BS methods, and then, results are presented in Table 5 and Table 6 and Figure 5.
Table 5

The performances of the BS and AS methods for the estimation of both parametric and nonparametric components.

MeasurementBiasVariance
ASBSASBS
β^1 1.941 2.682 1.272 1.703
β^2 0.915 1.139 1.562 1.624
α^ 3.628 4.5660.067 0.058

The bolded values indicate the best scores.

Table 6

Scores of performance measures for the AS and BS methods obtained from the whole model estimation.

Method MAPE MedAE GMSE RGMSE RMSE(f,f^)
AS 0.623 0.510 1.275 0.824 1.154
BS1.3151.1661.5461.2121.385
AR(2)1.8564.5063.7022.775-

The bolded values indicate the best scores.

Figure 5

Estimated curves for the seasonality obtained from the AS and BS methods.

Table 5 involves the bias and variance values for estimated regression coefficients and . Accordingly, the AS method gives smaller bias and variance values than the BS method regarding . Moreover, the AS method has better bias values for , but the BS method gives smaller variance values for than the AS method. In overview, the AS and BS methods give similar values, because the data properties are and . Thus, it can be seen that the results of the unemployment duration data ensure the simulation outputs. In addition, it should be noted that the outcomes obtained from estimated model (7.1) are given in Table 6 with scores for the estimated nonparametric function . Upon close inspection, it is obviously seen from the results that the AS method produces the best scores. It should be emphasized that the largest difference between the methods regarding performance criteria is in , which indicates the strength of the AS method under censorship. Table 6 indicates the results of AR(2) model that are worse than the results of the other two as in the simulation study. Note that because of the sample size of the real data of which is close to the simulation configurations when , scores are relatively close to each other. Figure 5 is given to compare the AS and BS methods in representing data under censorship. As can be seen in Figure 5, the estimated curves are quite similar due to the data properties of a large sample size and a low CL. The effect of synthetic data manipulation is obvious in the figure with zero values. Like the simulation study, the BS method is affected by these zero values more than the AS method. The reason for this is that the knots of the AS method are determined by iteratively calculated weights. Therefore, the optimal knot sequence diminishes the effect of censorship.

8. Concluding Remarks

This paper demonstrated the estimation of right-censored time series data using a newly introduced semiparametric AS estimator and making a comparison with the BS method as a benchmark. The results obtained from both a simulation study and a real data example proved that the introduced method (AS) achieves the superior modelling of right-censored time series data in a semiparametric context. Comparative outcomes also support that the AS method provides better performance scores over the BS method in most simulation configurations and the real data example. The most important factor in the success of the AS method is the adaptive nature of the method based on iteratively calculated weights. In the AS method, weights are responsible for determining and controlling the penalty term and for dependently obtaining the optimal knot points. Accordingly, our findings showed that the proposed method provides an advantage in modelling right-censored time series over the benchmark. The simulation study examined the performance of the methods in three parts: the outcomes for the estimated parametric component (Table 1 and Figure 2), the nonparametric component (Table 2 and Figure 3), and the whole semiparametric model (Table 3 and Figure 4). The unemployment data estimation was evaluated for bias and variance (Table 5) using the criteria of , , , and (Table 6). Given the outcomes of the simulation study and the real data example, our general and detailed conclusions are as follows: As expected, the estimation qualities for both the AS and BS methods change for different CLs and sample sizes. The performances of the methods are affected negatively by increasing CLs, and they give better results for larger sample sizes. This claim is seen clearly from Table 1, Table 2 and Table 3. When unemployment duration data were analyzed, it can be seen that the results agreed with the corresponding configuration () of the simulation study. One of the striking results of this paper is that, as Table 1, Table 2 and Table 3 demonstrate, while the AS method gives worse results at low censorship levels than the BS method, it provides significantly better results at medium and high censorship levels. This conclusion proves the claim of the paper, which is that using the AS method reduces the effect of the data manipulation of synthetic data transformation. When all the results obtained from simulation and real data studies were inspected, the AS method gives better results than the BS method, except in the configurations for low CLs, which supports the targeted conclusion. Unemployment duration data were modelled by the BS and AS methods using two lagged parametric components and the seasonal effect as a nonparametric component. Table 5 and Table 6 show each method’s scores using four evaluation metrics, which indicate the superiority of the AS method. Figure 5 shows the estimated curves for both methods, which are similar. However, the estimated curves show that the AS method is less affected by zero values of synthetic data and thus gives more satisfying estimates for the right-censored time series model than the BS method. Finally, as can be understood from the whole paper, the AS method is superior for estimating right-censored time series over the BS method in both theory and practice.
  6 in total

1.  Multiple imputation for multivariate data with missing and below-threshold measurements: time-series concentrations of pollutants in the Arctic.

Authors:  P K Hopke; C Liu; D B Rubin
Journal:  Biometrics       Date:  2001-03       Impact factor: 2.571

2.  Quantile smoothing of array CGH data.

Authors:  Paul H C Eilers; Renée X de Menezes
Journal:  Bioinformatics       Date:  2004-11-30       Impact factor: 6.937

3.  Variable Selection in Semiparametric Regression Modeling.

Authors:  Runze Li; Hua Liang
Journal:  Ann Stat       Date:  2008       Impact factor: 4.028

4.  An Adaptive Ridge Procedure for L0 Regularization.

Authors:  Florian Frommlet; Grégory Nuel
Journal:  PLoS One       Date:  2016-02-05       Impact factor: 3.240

5.  Visualization of genomic changes by segmented smoothing using an L0 penalty.

Authors:  Ralph C A Rippe; Jacqueline J Meulman; Paul H C Eilers
Journal:  PLoS One       Date:  2012-06-05       Impact factor: 3.240

6.  Representation of a Monotone Curve by a Contour with Regular Change in Curvature.

Authors:  Yevhen Havrylenko; Yuliia Kholodniak; Serhii Halko; Oleksandr Vershkov; Oleksandr Miroshnyk; Olena Suprun; Olena Dereza; Taras Shchur; Mścisław Śrutek
Journal:  Entropy (Basel)       Date:  2021-07-20       Impact factor: 2.524

  6 in total
  1 in total

1.  Nonparametric Statistical Inference with an Emphasis on Information-Theoretic Methods.

Authors:  Jan Mielniczuk
Journal:  Entropy (Basel)       Date:  2022-04-15       Impact factor: 2.524

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.