Literature DB >> 24987729

Identification and forecasting in mortality models.

Abstract

Mortality models often have inbuilt identification issues challenging the statistician. The statistician can choose to work with well-defined freely varying parameters, derived as maximal invariants in this paper, or with ad hoc identified parameters which at first glance seem more intuitive, but which can introduce a number of unnecessary challenges. In this paper we describe the methodological advantages from using the maximal invariant parameterisation and we go through the extra methodological challenges a statistician has to deal with when insisting on working with ad hoc identifications. These challenges are broadly similar in frequentist and in Bayesian setups. We also go through a number of examples from the literature where ad hoc identifications have been preferred in the statistical analyses.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2014 PMID： 24987729 PMCID： PMC4060603 DOI： 10.1155/2014/347043

Source DB: PubMed Journal: ScientificWorldJournal ISSN： 1537-744X

1. Introduction

Mortality models are commonly used in a wide range of fields such as actuarial sciences, epidemiology, and sociology. They are often used in important decisions such as how to deal with unisex legislation in the pension industry; see Ornelas et al. [1] and Jarner and Kryger [2]. However, such models do often have inbuilt identification issues stemming from overparametrisation. While identification issues are omnipresent in statistical modelling, this paper focuses on mortality modelling, where estimated parameters are treated as time series and extrapolated to give forecasts of future mortality. The underlying theme of this paper is to provide strategies of avoiding arbitrariness resulting from the identification process. We suggest two ways forward. First, we can reparametrise the model in terms of a freely varying parameter, which therefore has to be of lower dimension than the original parameter. Secondly, we can work with an identified version of the original parameter as long as we keep track of the consequences of the identification choice. That way we ensure that two researchers making different identification choices get the same statistical inferences and forecasts. A simple example is the age-period model for an age-period array of mortality rates. It is well-known that the levels of the age- and period-effects cannot be determined from the likelihood representing the overparametrisation of the model. When the estimated age- and period-effects are treated as time series and subjected to plotting and extrapolation, then our approach ensures that the statistical analysis is the same for two researchers identifying the above model in two different ways. Whereas this issue is relatively simple for the age-period model, identification becomes more tricky for complicated models such as the age-period-cohort model and the model of Lee and Carter [3], let alone two-sample situations. Mortality models are built as a combination of age, period, and cohort-effects, but the likelihood only varies with a surjective function of these time effects. The time effects can be divided into two parts. One part that moves the likelihood function and another part which does not induce variation in the likelihood function. We will argue that all inferences and forecasts should be concerned primarily with the part of the parameter that moves the likelihood function. This does not preclude the researcher from working with the time effects, but it gives some limitations on what can be done. This is important because the motivation and the intuition of mortality models typically originate in the time effects. For instance, in the context of an age-period-cohort model linear trends cannot be identified so time series plots of the time effects need to be invariant to linear trends and extrapolations of time effects must preserve the arbitrary linear trend in the time effects. This applies regardless of whether the identification issue is dealt with in a frequentist manner or by Bayesian methods. To formalise the discussion slightly return to the age-period example. Denote the predictor for the age-period data array by μ. The age-period model then determines how the predictor μ varies with a vector θ summarising age and period effects. That vector is split into two components ξ and λ so that the predictor only depends on θ through ξ but not on λ which cannot be identified by statistical analysis. In the age-period example ξ could reflect the contrasts and the overall level of the predictor μ, whereas λ reflects the level of the age effect. The more principled solution is then to work exclusively with ξ and simply consider θ as a motivation rather than the objective of the analysis. Another solution is to ad hoc identify λ based on a notion of mathematical convenience or based on a particular purpose given the substantive context. Once an ad hoc identification of λ is chosen the identification problem appears to go away, because the likelihood analysis can now go through. The reason is that the variation of θ is now reduced to the variation of ξ precisely because λ is fixed. Suppose two researchers choose the same likelihood and the same parametrisation of ξ but different ad hoc identifications λ † and λ ‡. Which of their conclusions will be the same and which will be different? As the likelihood only depends on ξ the fits of the two researchers will be identical. But differences might arise if the statistical inference or forecasting or any other statistical analysis involves λ in some way. Indeed, with many extrapolation methods forecasts will be invariant to the choice of λ. But, there will also be extrapolation methods where this is not the case. Examples arise in the age-period-cohort model, where linear trends have to be handled with care. We will start by analysing linearly parametrised models at a rather general level. We do this with two aspects in mind. First, we need to step back to a point in the analysis before ad hoc identification is made. Secondly, we also want to avoid the discussion of how to choose ξ and λ, which tend to be specific to the mortality model in question. Working at the general level we can focus on the mappings between different parametrisations and the invariance properties coming from these mappings. It is then seen that the parameter ξ arises as a maximal invariant. The general setting also allows the formulation of a series of results discussing different types of ad hoc identification, first in a frequentist fashion and then in a Bayesian fashion. Subsequently, we will consider the age-period-cohort model in detail, both for one- and two-sample situations. Using the general results it becomes easier to see that a number of popular methods inadvertently include features that are not invariant to ad hoc identification. These include the “intrinsic estimator” advocated by Yang et al. [4], the “mixed model approach” by Yang and Land [5], the Bayesian approach by Berzuini and Clayton [6], and the two-sample analysis by Riebler and Held [7]. Finally, we consider the nonlinearly parametrised model of Lee and Carter [3]. The nonlinearity gives a further complication since the mapping from the time effects to the mortality predictor is nondifferentiable. As it turns out the mortality predictor varies in a smooth space, so the nondifferentiability is avoided by working directly with the mortality predictor instead of the original time effects. Instead, a Lee-Carter application should consider whether a certain matrix has rank of unity or zero. Apart from that the analysis is similar to that of linearly parametrised models. Likewise a theory is given for two-sample situations. Throughout the paper our concern rests exclusively with the identification problem and the consequences of ad hoc identification for estimation, plots, inference, and forecasting. In practice, important additional concerns are how to choose appropriate models and forecasting methods. We would like to refer to Girosi and King [8], Pitacco et al. [9] for general discussions of these issues, and also to Kuang et al. [10] and Coelho and Nunes [11] for discussions of forecast methods in the light of structural breaks. Instead, the aim of the paper is to present an overall framework that can help streamlining the identification discussion that has appeared in so many papers in so many fields over so many years. Section 2 of this paper considers standard linear statistical models, which lend themselves to a relative straightforward analysis based on linear algebra. Any ad hoc identification splits the time effect into two components. The first component is an arbitrary component, which is not needed for the identification of the likelihood. The other component is necessary and sufficient to identify the model and hence sufficient for statistical analysis. In Section 3 it is outlined how to analyze the statistical model when the latter component is ad hoc identified. It is argued that this can cause difficulties for estimation, interpretation, and forecast. In Section 4 it is shown that Bayesian analysis shares the same challenges as the frequentist approach. In Sections 5 and 6 we study the two particular examples: the omnipresent age-period-cohort and Lee-Carter mortality models. All proofs are collected in the Appendix.

2. Statistical Models with Linear Parametrisations

In this section we present the identification problem in a linear framework. The problem is solved by analysing the mapping from the original time effect to the predictor which, in turn, leads to standard statistical analysis. In Section 6 we show how these ideas transfer to a nonlinear context. This contrasts with Section 3 in which we illustrate the analytical challenges and inconveniences arising from ad hoc identification. In Section 2.1 we present the overparametrized linear model for the mortality predictor. The identification problem is defined in Section 2.2 via the likelihood. In an overparametrized linear model two different parameters might produce the same likelihood. In Section 2.3 we analyze the mapping from the overparametrised parameter to the predictor. This mapping enables us to split the overparametrised parameter into two. One arbitrary parameter and one parameter identify the model without being overparametrised. This latter parameter is shown to be a maximal invariant parameter. In Section 2.4 it is demonstrated how any statistical analysis can be based on this maximal invariant parameter alone. In particular we comment that visual data representations, hypothesis testing, and forecasting are simple and well defined. This in turn leads to standard statistical analysis. The analysis of the linearly parametrised involves projections on linear or affine spaces and on their orthogonal complements. It is therefore convenient to introduce the following notation. A matrix m has full column rank if m′m is invertible. In this case the orthogonal complement m ⊥ is a matrix so m ⊥′m = 0 and (m, m ⊥) is invertible. Thus, when m itself is invertible then m ⊥ is the empty matrix. It is not difficult to calculate m ⊥ in practise, an explicit construction of m ⊥ follows from a singular value decomposition of mm′, choosing m ⊥ as the eigenvectors associated with the zero eigenvalues. Moreover, let so that is the identity matrix, while .

2.1. The Model

Think of the time effect θ as our preferred intuitive, but unidentified parameter, and think of the predictor μ as some function of θ specifying the model at hand. In a Poisson type model, where the mean specifies the distribution, μ could be the log of that mean. Such Poisson models are omnipresent in mortality models. We will often think of θ as containing some time effects. Often forecasting is carried out simply by isolating and extrapolating such a time effect. Consider a data vector Y of dimension n. This could, for instance, be the vector consisting of the stacked mortality rates for a rectangular age-period array of dimension I × J in which case n = IJ. The statistical model for Y could be a generalized linear model. This involves an appropriately chosen distribution and a link function, which links the expected mortality rate to an n-dimensional predictor, which is denoted by μ. Taken together this defines a likelihood function L(μ; Y). The model for the predictor μ is constructed in terms of, for instance, age, period, and cohort time effects. These time effects are summarized in a vector θ, which is of dimension q < n. Therefore μ is a surjective function of θ. For the moment the specification of the predictor is assumed linear so that for some design matrix D ∈ R . We refer to this specification as the mortality model, while the space Θ is the time effect space. The time effect space is chosen as an unrestricted real space in accordance with the starting point of most mortality analyses. The parameter space for the likelihood function and therefore for the statistical model is given by the range of variation for the predictor μ; that is, The likelihood function is assumed uniquely identified on this space in the sense that for all pairs of predictors so μ † ≠ μ ‡ then the likelihood of μ †, μ ‡ differ; that is, for Y in a set with positive probability.

2.2. The Identification Problem

The identification problem of mortality models arises when the mapping from the time effect space Θ to the parameter space M is surjective but not injective. With a linear parametrisation this arises when the design matrix D has reduced column rank p < q so D′D is singular. In this situation there exists time effects θ † ≠ θ ‡ with the same likelihood: for all data Y. Then the time effect space Θ is not useful as parameter space for the statistical model.

2.3. Analysing the Mapping θ ↦μ

When analysing the mapping from our intuitively preferred parametrisation θ into the linear predictor μ, we will be able to rewrite θ as a sum of two components: one is a function of the predictor and the other is the arbitrary part varying with θ, but not with the predictor. We provide two methods for analysis. The first method is to find a basis X ∈ R with full column rank⁡p for the design D. The design matrix of the mortality model can then be expressed as D = XA′ for some matrix A ∈ R with full column rank⁡p. Introduce a new p-dimensional parameter: The parameter space M can then be written more parsimoniously as The mapping from ξ to μ is bijective, so the statistical model can just as well be parametrised in terms of ξ ∈ Ξ = R . Alternatively, the identification problem can be expressed through an invariance argument. This argument relates to the parameterization but resembles the classical invariance argument for reduction of data; see Cox and Hinkley [12, page 157]. With a linear parametrisation the argument involves the orthogonal complement to the matrix A. That is a matrix A ⊥ ∈ R which has the properties that A ⊥′A = 0 and that (A, A ⊥) is invertible. The mortality model (1) is defined by the mapping from Θ = R to M. This mapping is surjective in that two different values of θ may result in the same μ and therefore the same likelihood. These equivalence classes in the time effect space can be described by the group of transformations acting on Θ for arbitrary ζ ∈ R . Indeed, it holds that θ and g(θ) will result in the same μ. The mapping (7) is therefore invariant to the group g. We will argue that the parameter ξ = A′θ is a maximal invariant to the group g acting on Θ, which provides a link with (6). It has to be argued that for any θ †, θ ‡ so that ξ † = A′θ † equals ξ ‡ = A′θ ‡ then θ ‡ = g(θ †), see Cox and Hinkley [12, page 159]. For this argument use the orthogonal projection identity to write for unique ξ = A′θ and φ = A ⊥′θ. Thus, if A′θ ‡ = A′θ † then θ ‡ = g(θ †) with ζ = φ ‡ − φ † = A ⊥′(θ ‡ − θ †). In applications it can be difficult to find a basis X for the design D. It can be easier to find a group g and hence A ⊥ and then use this information to construct A and a candidate basis , noting that D = XA′. This argument leaves it to be proven that X is a basis, or equivalently, that the suggested group g actually describes the equivalence classes of the mapping from θ to μ. It is useful to note that in the choices of X, A only the spaces spanned by them are unique since XA′ = X mm −1 A′ for any invertible m ∈ R . Likewise, the maximal invariant ξ is only unique up to bijective transformations. This lack of uniqueness has no impact on the analysis of the likelihood albeit it influences interpretations.

2.4. Statistical Analysis Using the Maximal Invariant Parameter

The statistical model parametrised with the maximal invariant parameter ξ can be analysed by standard statistical techniques. This contrasts to a range of problems that arise when working with an ad hoc identified time effect θ. In the following the relatively simple standard statistical analysis of the model parametrised by ξ is discussed with respect to likelihood theory, interpretation, plots, hypothesis testing, forecasting, and Bayesian analysis. In Sections 3 and 4 we give an overview of the much more complicated theory underpinning models parametrised by the ad hoc identified time effect θ. Age-period-cohort examples follow in Section 5.

2.4.1. Exponential Family Theory

Suppose the likelihood is drawn from a generalized linear model based on an exponential family. Then the model is actually a regular exponential family where the maximal invariant parameter ξ is the canonical parameter since it is freely varying in a real space; see Barndorff-Nielsen [13, page 116]. This opens up for a wealth of convenient statistical properties such as a likelihood equation with a simple expression and explicit conditions for a unique solution. In contrast, ad hoc identified parameters are based on an injective mapping of the canonical parameter ξ into θ; see Sections 3.1 and 3.2. It is then more difficult to fully exploit the exponential family theory.

2.4.2. Interpretation and Plots

The maximal invariant parameter ξ varies freely in R . It can therefore be interpreted as the parameter of any standard statistical model. Since ξ is freely varying the coordinates of ξ can be interpreted independently. When θ is a collection of time effects then ξ can be organised as a collection of time series. Since the coordinates of ξ are freely varying the time series plots of the components of ξ have the usual interpretation of time series. In contrast, ad hoc identified estimators are constrained to a p-dimensional subspace Θ of Θ = R , which is often affine but can be more complicated. A consequence is that plots are complicated to evaluate; see Section 3.4.1.

2.4.3. Hypothesis Testing

Hypotheses are easily formulated and analysed when using the maximal invariant parametrisation. An affine hypothesis that restricts ξ to vary in a p -dimensional affine subspace can be formulated as H′ξ = η for known matrices H ∈ R , η ∈ R . This implies a restriction on the predictor μ = Xξ of (6). Form the orthogonal complement H ⊥ and recall the orthogonal projection identity so that . Introduce a p -dimensional parameter , a design matrix X = XH ⊥, and an offset . The restricted parameter space is In an exponential family context both the unrestricted model and the restricted model form regular exponential families. A variety of nice properties then follow for the estimators and the test statistics from the exponential family theory. Examples are given in Sections 5.3 and 5.5.3. In contrast, the hypothesis derived from restrictions on ad hoc identified parameters and the resulting degrees of freedom are complicated to analyse; see Section 3.4.2.

2.4.4. Forecasting

Most often the objective of a mortality study is to forecast the future mortality. In the linear context, μ = Xξ, this is done by extending the design X and by extrapolating ξ. It is usually easy to extend the design X into the forecast horizon. This involves the construction of a triangular block matrix with an appropriate number of extra rows corresponding to the data over the forecast horizon as well as extra columns representing the extra parameters that would be needed: Extrapolating ξ into a vector then gives the forecast The extrapolation of the parameter ξ can be done as follows. The estimated parameter, or part of it, can be thought of as a time series. Any forecast techniques from the time series literature applied directly to ξ can be used, subject to the usual contextual considerations. Ad hoc identified time effects can be extrapolated in a similar way; see Section 3.4.3. This may, however, result in avoidable arbitrary effects in the forecast. Necessary and sufficient conditions for this eventuality are given for age-period-cohort models in Section 5.4.3. The practical examples are mainly Bayesian in nature and are discussed next.

2.4.5. Bayesian Analysis

The introduction of the canonical parameter shows that the likelihood, in Bayesian notation, is of the form p(y | θ) = p(y | ξ) where ξ is freely varying. A purist Bayesian analysis can simply introduce a prior on the canonical parameter, p(ξ). This is updated in a straight forward way, resulting in the posterior p(ξ | y) = p(y | ξ)p(ξ)/p(y). In contrast, introducing a prior on ad hoc identified parameters gives various difficulties. Only parts of the prior are updated by the likelihood, so that it becomes unclear which information arises from the data and which information arises from the ad hoc identification. Moreover, avoidable arbitrariness is introduced in the forecast; see Section 4. Introduction of hyperparameters exacerbates the issue. Examples are given in Sections 5.4.4, 5.5.2, and 6.1.6.

3. Working with the Time Effects

In Section 2 we considered the situations where estimation, hypothesis testing a hypothesis, or forecasting is carried out using the canonical parameter. However, there might be situations, where the original time effect parametrisation is preferred, perhaps because it is felt that this parametrisation is particularly helpful in guiding the intuition. This requires ad hoc identification of the time effect. In this section we will guide the considerations a statistician that has to go through when insisting on an analysis based on some nonunique parametrisations. As in Section 2 we focus on linearly parametrised models. Specific examples follow in Sections 5 and 6. In Section 3.1 ad hoc identification is defined. As an example we consider a least squares estimation problem with collinear regressors in Section 3.2. For the age-period-cohort model reviewed in Section 5 it is common to ad hoc identify in two steps: first identifying levels then the linear trends. We consider such two-step ad hoc identification in Section 3.3. The consequence of ad hoc identification is considered in Section 3.4. Indeed, when forecasting the time effect, we do not want the forecast to depend on the identification scheme. The same applies to graphical visualisation of our data, where the eye may extract patterns that depend on the identification scheme. Likewise, confusion may arise when formulating a hypothesis directly on the time effect parameters.

3.1. Ad Hoc Identification

In this section the time effect parametrisation is considered. An identification scheme has to be introduced when working with the time effects. This may rest on mathematical convenience or it may be chosen for a particular purpose given the substantive context. We therefore call it ad hoc identification. Here we consider a simple identification scheme but turn to a more common two-step identification scheme in Section 3.3. Once the canonical parameter ξ has been estimated there is often a wish to return to the original time effect θ. The two are linked through the surjective mapping from Θ = R to Ξ = R . Indeed, since ξ is constructed as a function of θ the notation for ξ is often chosen to reflect θ. The canonical parameter ξ does, however, only give partial information about θ. The remaining part, say λ, of θ will have to be chosen by the researcher and combined with ξ. A linear ad hoc identification of θ comes about by the researcher choosing a constraint for some known λ ∈ R and some matrix L ∈ R chosen so the square matrix (A, L) is invertible. The time effect space Θ is now reduced to an affine subspace Given θ we can find ξ, λ through (13) and (14) as (ξ′,λ′)′ = (A,L)′θ. At the same time, given values of ξ, λ and the invertibility of (A, L), the ad hoc identified time effect is found through In this notation a subindex λ is introduced to avoid confusion with the time effect θ in the original mortality model. Indeed, there are now four different parameters in play, namely, the original time effect θ ∈ Θ, the predictor μ ∈ M, the maximal invariant parameter ξ ∈ Ξ and the ad hoc identified time effect θ ∈ Θ, each of which has a different interpretation. The mapping from θ to each of μ, ξ, and θ is surjective, while there are bijective mappings between the latter three. The interpretations of the time effect θ and the canonical parameter ξ will inevitably be different. For a start they have different dimensions. Endowing the spaces with Euclidean norms shows that distances in the two spaces Θ and Ξ will be judged differently. The time effect θ and the ad hoc identified time effect θ will similarly have different interpretations. Although they have the same dimensions the Euclidean norms on Θ and Θ will be rather different. Confusion may arise in the interpretation of a mortality analysis if there is no clear distinction between θ and θ . In addition an unnecessary arbitrariness may arise when making inference on θ or extrapolating θ ∈ Θ. We will return to these issues in Section 3.4. It is perhaps interesting to note that despite the linear parametrisation the ad hoc identification need not be done in a linear fashion as in (14). Indeed it is common for Poisson models with a log link to ad hoc identify θ through the original multiplicative scale. That means that the ad hoc identification is done nonlinearly through The fit of the model is unaffected by the ad hoc identification. Indeed the fit is measured in terms of the estimate of the predictor μ = Dθ where D = XA′. Since the identification is made so ξ = A′θ ; the estimated predictor reduces to regardless of the choice of ad hoc identification.

3.2. A Least Squares Example

As an illustration of estimation in the presence of ad hoc identification consider a normal likelihood. Different, but equivalent, expressions can be found depending on the parametrisation. The likelihood of the predictor μ is Rewriting it in terms of the canonical parameter it is while introducing the time effect parameter gives The likelihood (20) of the canonical parameter ξ can be analysed by the least squares method since the design X has full column rank. The maximum likelihood estimator for ξ and the predictor for the data are Along with the residual variance this is all the information that is given by the likelihood. The likelihood (21) of the time effect θ only depends on θ through ξ = A′θ. The lack of identification means that the maximum likelihood estimator for θ has an arbitrary element, so that it is a set valued estimator. Based on (16) this can be expressed by for any L so (A, L) is invertible and for any λ ∈ R . The fit, however, remains the same and (18) becomes In order to compute actual estimates then L, λ have to be chosen, which amounts to ad hoc identification. For instance, with the ad hoc identifying restrictions L = A ⊥ and λ = 0 then can be thought of as the least squares estimator of Y on D using the Moore-Penrose generalised inverse for the singular matrix D′D; see Searle [14, page 212]. See Section 5.4.1 for an example.

3.3. Step-Wise Identification

It is common to ad hoc identify parameter in a step-wise fashion. In the first step the time effect parameter is only partially constrained. The full identification then follows in a second step. An example is given in Section 5.4.1 for an age-period-cohort model in which the levels of the time effects are constrained in the first step leaving the ad hoc identification of the linear trends to the second step. The first step constraints are affine of the type for known matrices C ∈ R , ψ ∈ R . The constrained time effect space is then Thereby the q-dimensional time effect space Θ is reduced to a q -dimensional variation. The properties of this partially ad hoc identified parameter space depends on the rank of the matrix (A, C). If the number of constraints, q − q , is at most equal to the number of unidentified components q − p, it is possible that (A, C) has full column rank. In that case the constraint implies a partial ad hoc identification without constraining the parameter space M of the statistical model. This is shown in Theorem 1; see also Section 5.4.1 for an example, while the proof is given in the Appendix. When (A, C) has reduced rank the parameter space M is also constrained; see Section 3.4.2 for a discussion.

Theorem 1

Suppose (A, C) has full column rank. Then the matrix m = A ⊥′C ∈ R ( has full column rank and the constraint (25) does not constrain the canonical parameter ξ and the predictor μ. Hence, the predictor space remains of the form (2). The equivalence classes in Θ under the mapping θ ↦ μ = XA′θ are given by the group for arbitrary ζ ∈ R where m ⊥ ∈ R ( is the orthogonal complement of m. The maximal invariant remains ξ = A′θ. The partial ad hoc identification by (25) implies that any time series analysis of the time effects has to happen relative to the constrained space Θ rather than the space Θ. This is awkward as discussed in Section 3.4 below. It is also considerably more complicated than working with the freely varying canonical parameter ξ; see Section 2.4.2.

3.4. Consequences of Ad Hoc Identification

In the following we will look closer at the consequences of working with the ad hoc identified time effect parameter θ in the context of a linear mortality model of the form μ = Dθ. We consider the consequences for plotting, hypothesis testing, and forecasting.

3.4.1. Plots of Time Effects

In the mortality model (1) the time effect θ is the concatenation of age, period, and cohort effects. It seems natural to think of these individual time effects as time series and to plot them against time. As the time effect θ varies in the unrestricted space Θ = R this maps the q-vector into unrestricted time series. Estimates of the time effects are constructed by combining an estimate of ξ with an ad hoc chosen value for λ = L′θ, see (14). The resulting estimate is therefore constrained to the space Θ ⊂ Θ. The interpretation of the estimate is therefore different from the interpretation of the original time effect θ. Distances on the spaces Θ and Θ are judged differently and the variability of is deduced exclusively from through (16). The time series components of are now restricted through λ = L′θ . Plots of the -time series are therefore interpreted differently from the imagined plots of the original θ-time series and from the plots of the maximal invariant parameter ξ discussed in Section 2.4.2. Indeed, if one were to analyse the estimated -time series statistically the linear constraint should be taken into account. This is a bit complicated as illustrated below, but it is the consequence of working with the ad hoc identified parameter θ rather than the canonical parameter ξ. Attempts to give intrinsic meaning to λ will be specific to the index set for the data set at hand. For instance, the requirement that the age effect should be zero on average does not carry over when looking at a subsample or when forecasting. It is not obvious that such an ad hoc identification is any more or less arbitrary than saying that, for instance, the first or the last age effect should have a particular value. Adding confidence bands to a plot of is in itself not difficult. If is asymptotically normal with mean ξ and variance Σ, then is asymptotically normal with mean θ and variance L ⊥(A′L ⊥)−1Σ(L ⊥′A)−1 L ⊥′. This is a normal distribution on the space Θ. The interpretation of these standard errors will therefore be similar to that of itself. Finally, it may be of interest to analyse the estimated -time series statistically. Denote this time series by x . Its sample space is now Θ. A statistical model on Θ can be built as follows. The starting point could be a time series model for unrestricted variables x on the sample space Θ. This gives a joint density for x ∈ Θ, which can be reduced by marginalisation to a density for x ∈ Θ. Whether one is working with the unrestricted model for x ∈ Θ or the restricted model for x ∈ Θ inferences that are invariant to g must be based on those statistics of x or x that are invariant to g. Thus, inferences must be based on the maximal invariant under g. For a general overview of invariant reduction see Cox and Hinkley [12, page 175f], whereas Nielsen [15] gives the argument in some detail for an autoregression with a linear trend.

3.4.2. Hypothesis Testing

Having formulated the model in terms of time effects it may be of interest to test the hypothesis that one of these time effects is absent. No identification issues arise when the hypothesis is formulated as a restriction on the canonical parameter ξ as discussed in Section 2.4.3. But one has to be careful when formulating hypotheses in terms of the original time effect. See Sections 5.4.5, 5.5.3, and 5.5.4 for examples. Affine hypotheses on the time effect are of the form for known matrices R ∈ R , ρ ∈ R . The constrained time effect space is then To see how the restriction (28) restricts the predictor space M ⊂ R recall that the predictor μ only depends on θ through ξ = A′θ. Thus, the analysis of the restriction (28) depends on the interplay between the matrices A, R. Theorem A.3 in Appendix A.3 gives a general result to that effect. It shows that the hypothesis (28) restricts the predictor space M to a p -dimensional affine subspace of R in so far as it restricts the canonical parameter ξ. In particular, the degrees of freedom of the hypothesis, p − p , may in general be different from the dimension reduction of the time effect parameter, q − q . When this is the case the restriction (28) has an element of ad hoc identifying the time effect.

3.4.3. Forecasts

Forecasts can be made by extrapolating the ad hoc identified time effects θ . Two researchers choosing different ad hoc identification schemes, but otherwise making the same analysis, may make different forecasts. This can be avoided if the extrapolation method is chosen with some care. Following the linear approach outlined in Section 2.4.4 the predictor μ = Dθ = XA′θ is forecasted by extending the design D into Extrapolating the ad hoc identified θ into a vector then gives the forecast Often both components D 1 θ and depend on the ad hoc identification. Nonetheless, these dependencies of ad hoc identification may cancel each other so that the overall forecast is invariant to the ad hoc identification. Such invariance would seem desirable in most applications unless there is strong substantial reason for the ad hoc identification scheme. Necessary and sufficient conditions for invariance are presented for the age-period-cohort model in Section 5.4.3 and for a nonlinear model in Section 6.1.5. In contrast, these considerations are redundant when working with the canonical parameter, ξ; see Section 2.4.4.

4. Bayesian Models and Random Effects Models

Mortality analysis is often carried out using either Bayesian methods or random effects methods. The mortality model is then altered through the introduction of a prior distribution on the parameters. One might think that the identification problems become less of an issue or even disappear. This is not the case since the Bayesian method and the random effects method is based on the mortality likelihood which only depends on the time effect θ through the maximal invariant parameter ξ. Thus, the identification challenges remain. The issue is that a prior on the unidentified part, say λ, of the time effect amounts to an ad hoc identification. Indeed, the conditional prior of λ given ξ is not updated by the mortality likelihood. A main difference is that a maximum likelihood analysis of the original mortality likelihood usually prompts the researcher when there is an identification issue, whereas both Bayesian methods and random effects methods allow computations to go through despite an identification issue. In Section 4.1 it is seen that introduction of a conditional prior on λ given ξ is the Bayesian analogue of ad hoc identification. This leads to the same type of forecasting challenges as in the frequentist settings as is seen in Section 4.2. In Section 4.3 we show how the Bayesian identification issues transfer to random effects models.

4.1. Bayesian Estimation

For Bayesian and random effects models we formulate a likelihood and a prior. Thus, consider a likelihood p(y | θ) = L(θ; y). Replacing θ by ξ, λ the identification problem implies that The prior on θ is factorised as p(θ) = p(ξ, λ) = p(ξ)p(λ | ξ). In the case of Bayesian estimation the following result emerges.

Theorem 2

Suppose the likelihood satisfies (32). Then the predictive distribution does not depend on the conditional prior for λ: the posterior satisfies the posterior means satisfy Theorem 2 shows that it suffices to give a prior to ξ and ignore λ as advocated in Section 2.4.5. Indeed the conditional prior for λ given ξ is not updated. Theorem 2 appears to be well-known; see Poirier [16, Proposition 2] or Smith [17, Section B]. Due to Theorem 2 the Bayesian analyst faces the complications outlined in Section 3.4. Indeed, suppose that two Bayesian researchers choose the same likelihood p(x | ξ, λ) = p(x | ξ) and the same prior p(ξ) for ξ, but different conditional priors for λ given ξ. Their marginal distributions for the data are identical, but any inferences regarding interpretation or forecasting will differ in so far as they involve the unidentified parameter λ. A Bayesian researcher should therefore be cautious with inference related to λ. There will of course be situations where the prior knowledge of λ given ξ is found to be of substantive relevance. In such situations it seems more fruitful to change the likelihood to include that information.

4.2. Forecasting

Bayesian forecasts involve integrating an extrapolative distribution. This can be done in two ways, either working exclusively with the identified, maximal invariant parameter ξ as in Section 2.4.4, or working with the time effect θ = (ξ, λ) as in Section 3.4.3.

4.2.1. Forecasting Using the Maximal Invariant Parameter

Consider first the case where only the maximal invariant parameter ξ is used. In that case the forecast is computed by sampling from the posterior p(ξ | y) and then extrapolating using the sampled value ξ using some extrapolative methods, say . In combination this gives the forecast

4.2.2. Forecasting Using the Ad Hoc Identified Time Effect

Consider now forecasts involving the full time effect θ = (ξ, λ). Theorem 2(ii) shows that the posterior satisfies p(θ | y) = p(ξ | y)p(λ | ξ). The distribution forecast with extrapolation is then The concern is now as follows. Suppose a second researcher chooses the same extrapolative method, likelihood, and prior for ξ, but different conditional priors p †(λ | ξ). In general, this will result in a different distribution forecast: The question is then under which conditions will so that the distribution forecasts are invariant to the choice of conditional prior for λ given ξ? A sufficient condition is that the extrapolation method does not depend on λ so Condition (39) could alternatively be expressed as requiring that the forecast is invariant to the group g acting on the time effect space Θ so that .

Theorem 3

Suppose that the likelihood satisfies (32) and the priors are probabilities. If the extrapolative distribution does not depend on λ so (39) holds; then the forecast distribution computed in (37) is invariant to the choice of conditional prior for λ given ξ. The forecast then reduces to (36). To summarise, the identification issues surrounding Bayesian analysis are similar to those outlined in the previous sections. Examples of the problems that can arise are discussed in Sections 5.4.4, 5.5.2, and 6.1.6. There are two solutions to the identification problem. The first is only to formulate a prior on ξ; see Section 2.4.5. Incidentally, this is what Bernardo and Smith [18, page 218] do in their discussion of the two-way analysis of variance, albeit without linking it to the considerations of Smith [17]. The prior p(ξ) can of course be constructed by formulating a prior on θ and then reduce it to a prior on ξ by marginalisation so p(ξ) = ∫p(ξ, λ)dλ. The other solution is to work with a prior on θ but avoid those parts of the posterior that depend on λ.

4.3. Random Effects Models

It is common to combine mortality models with a random effects approach, which effectively forms a new model. An example is given in Section 5.4.6. We consider the consequence of the lack of identification. The random effect models are typically constructed as follows. Suppose the density of the data y given the time effects θ = (ξ, λ) is of the form p(y | ξ, λ) = p(y | ξ) as before; see (32). A prior p(θ | ψ) is chosen that now depends on a parameter ψ. The prior can be decomposed as p(θ | ψ) = p(ξ | ψ)p(λ | ξ, ψ). Theorem 2 implies that the density of the data y given ψ is This in turn is used to form the random effects likelihood of ψ as This, effectively, defines a new model. The random effects likelihood only depends on the prior p(θ | ψ) through p(ξ | ψ). Two researchers choosing the same prior p(ξ | ψ) but different conditional priors p(λ | ξ, ψ) will then get the same random effects likelihood and the same maximum likelihood estimator . In mortality modelling it is common to go one step further and estimate the time effects θ through the mean of the posterior p(θ | y, ψ) evaluated at . Then the identification problem may show up. Theorem 2 shows that so that the prior for ξ is updated, while the conditional posterior for λ given ξ is not updated by the data. Thus, in general the estimate for θ is based, in part, on a prior which is not updated by the data.

5. Age-Period-Cohort Models

We will now apply the theoretical considerations to analyse the age-period-cohort model. The methodological literature on this model is large and the consequences of the above theory are wide ranging. In Section 5.1 we present the age-period-cohort model along with the maximal invariant parameter. This maximal invariant parameter is also called the canonical parameter because the age-period-cohort model is usually implemented as an exponential family; see Section 2.4.1. When formulating the model we choose a notation matching the age-period-cohort literature rather than the reserving literature. At the same time the exposition takes it starting point in Kuang et al. [19], but the notation deviates. The implementation of the canonical parameter depends on the type of data array. In Section 5.2 design matrices are given for age-cohort, age-period, and period-cohort data arrays. While they illustrate interesting differences in the structure for these data arrays, they also provide the basis for an immediate implementation via any generalised linear model software. The age-cohort model is expressed as a hypothesis of the age-period-cohort model in Section 5.3. Time effects and forecasting are considered in Section 5.4, while the two-sample age-period-cohort model is discussed in Section 5.5.

5.1. The Model and the Canonical Parameter

Here the age-period model is set up and a quite general identification result is reported. Consider data Y indexed by (i, j) ∈ I where i is the age and j is the period. The index set may be a rectangle given by i = 1,…, I and j = 1,…, J so that the cohort k = I − i + j runs from 1 to K = I + J − 1. More generally, the index set could be a generalized trapezoid where two corners are cut off the rectangle so that the cohort k runs from 1 + h 1 to I + J − 1 − h 2 for some h 1, h 2 ≥ 0. The class of generalized trapezoids includes the three types of Lexis diagrams discussed by Keiding [20]. We will return to those special cases below. The statistical model is defined by the assumption that the variables Y are independent with an exponential family distribution with predictor μ given by The time effect θ = (α 1 …, α , β 1,…, β , γ ,…, γ , δ)′ now varies in some time effect space Θ ∈ R where q = I + J + K + 1 − h 1 − h 2. The model (43) is of the form (1) discussed in Section 2. Specifically, the predictors μ can be stacked in a vector μ of dimension n = dim⁡I and written as μ = Dθ. Thus, the parameter space for the model is of the form M = (μ ∈ R : μ = Dθ for θ ∈ Θ) as outlined in (2). The mapping θ ↦ μ from Θ to M is surjective and the equivalence classes in the time effect space can be described by a group of transformations that are discussed in (8). This group can be represented as for any a, b, c, and d. This is of the form (8) with ζ = (a,b,c,d)′ although the definition of the matrix A depends on the structure of the index set I. A first clue for the canonical parametrisation is given by Fienberg and Mason [21] and Clayton and Schiffler [22] who pointed out that, on the multiplicative scale, ratios of relative risks are invariant. On the additive scale this amounts to looking at second differences, such as Δ2 α = α − 2α + α . A graphical illustration of the double differences is given in Figure 1 (graphics were done using R 3.0.2, see R Development Core Team [23]), which is taken from Miranda et al. [24]. Panel (a) illustrates the interpretations of the formula for Δ2 α as follows. Consider the 1970 and 1971 cohorts. In 2010 these have ages 40 and 39, while in 2011 these have ages 41 and 40. Thus, Δ2 α 41 represents the increase in mortality from ages 40 to 41 in 2011 relative to the increase from ages 39 to age 40 in 2010. An equivalent interpretation is that which represents the increase in mortality from ages 40 to 41 for the 1970 cohort relative to the increase from ages 39 to 40 for the 1971 cohort. In a similar way panels (b) and (c) illustrate the formulas for Δ2 β 2012 and Δ2 γ 1972.

Figure 1

Illustration of interpretation of Δ2 α 41, Δ2 β 2012, and Δ2 γ 1972.

Kuang et al. [19] introduces a parameter formed by these second differences as well as three entries of the predictor; that is, The parameter ξ varies in the space Ξ = R where p = q − 4. If the three points μ , μ , and μ are chosen not to be linearly related then they define the levels and the linear trends in the predictor. The formal condition is that a certain determinant defined from the indices is nonzero; that is,

Theorem 4 (see [19], [25, Corollary 2])

Let μ satisfy (43). If the condition (46) is satisfied then the parameter ξ of (45) satisfies the following: ξ is a function of θ which is invariant to the group g in (44); μ is a function of ξ; the parametrisation of μ by ξ is exactly identified in that ξ † ≠ ξ ‡⇒μ(ξ †) ≠ μ(ξ ‡). Theorem 4 therefore shows that ξ varies freely in Ξ = R . Moreover, ξ is a maximal invariant of the mapping m from θ to μ under the transformations g. It should be noted that the choice of maximal invariant is not unique. Indeed, any bijective mapping of ξ can serve as maximal invariant. The choice of ξ is convenient since it becomes the canonical parameter in generalized linear models of the exponential family type. In itself this theorem does not tell how to express the predictor μ in terms of the canonical parameter ξ. The link depends on the structure of the index set I. The above mentioned paper gives implicit expressions for generalized trapezoid index sets. In the following we report explicit expressions for the 3 principal Lexis diagrams.

5.2. Design Matrices for Lexis Diagrams

The link between the canonical parameter ξ and the predictor μ is analysed for the 3 principal Lexis diagrams. We start with age-cohort data arrays, which were the focus of attention in Kuang et al. [19]. Such arrays are easiest to analyse because all three time scales increase from the point where i = j = k = 1. As a consequence the results are relatively easier for these arrays.

5.2.1. Age-Cohort Data Arrays

Age-cohort data arrays are rectangular in the age and cohort indices and given by Consequently, the period index j = i + k − 1 varies over j = 1,…, J = I + K − 1. Keiding [20] refers to this Lexis diagram as the first principal set of death. Age-cohort arrays are in particular used for reserving in general insurance. In that situation, only the triangle 1 ≤ i, j, k ≤ I is observed. The issue is to forecast the other triangle in the square 1 ≤ i, k ≤ I. In the reserving literature these triangles are referred to as the upper and lower triangles, since the cohort axis has reverse order. The two-factor age-cohort model for triangular age-cohort arrays is known as the chain-ladder model; see England and Verrall [26] for an overview. Zehnwirth [27] introduced an age-period-cohort model for such triangular arrays. The identification issue is analysed in detail in Kuang et al. [19, 25]. Subsequently, Kuang et al. [28] analysed the Poisson likelihood, while Kuang et al. [10] give an empirical analysis focusing on forecasting. The age-period-cohort model for the age-cohort arrays is parametrised by The time effect θ = (α 1,…,α ,β 1,…,β ,γ 1,…,γ ,δ)′ now varies in Θ = R 2(. The design matrix linking the canonical parameter ξ in (45) and the predictor μ is essentially an identity linking the two parameters. A natural choice of the three levels points to the predictors that are μ 11, μ 12, and μ 21. We then get the representation with the convention that empty sums are zero, and recalling that second differences are defined as Δ2 α = α − 2α + α so that ∑ Δ2 α = Δα − Δα 2 and ∑ ∑ Δ2 α = α − α 1 − (i − 1)Δα 2. The identity (49) is crucial to the understanding of the age-period-cohort model. It shows that the predictor has a single level expressed as μ 11, which in turn satisfies μ 11 = α 1 + β 1 + γ 1 + δ. The level μ 11 is therefore estimable, but the individual levels α 1, β 1, γ 1, and δ are not identifiable from the model. Further, the model has two linear trends, here expressed with slopes μ 21 − μ 11 and μ 12 − μ 11 in terms of the age and cohort indices. These slopes can be expressed as μ 21 − μ 11 = Δα 2 + Δβ 2 and μ 12 − μ 11 = Δβ 2 + Δγ 2. They are estimable, but the individual slopes Δα 2, Δβ 2, and Δγ 2 are not identifiable. The design matrix now follows from the identity (49) so that the predictor satisfies μ = Xξ, where where ξ ∈ R , where p = 2(I + K − 2) and h(t, s) = max⁡(t − s + 1,0). The identification relies on Theorem 4, which can be specialised to age-cohort arrays as follows.

Theorem 5 (see [19, Theorem 1])

Let μ satisfy (48). The parameter ξ of (50) satisfies the following: ξ is a function of θ which is invariant to the group g in (44); μ is a function of ξ, because of (49); the parametrisation of μ by ξ is exactly identified in that ξ † ≠ ξ ‡⇒μ(ξ †) ≠ μ(ξ ‡). Theorem 5 in turn implies that the parameter ξ varies freely in Ξ = R , while the design matrix X given by (51) has full column rank. Originally, the more general Theorem 4 was proved as a corollary to Theorem 5.

5.2.2. Age-Period Arrays

An age-period data array is rectangular in the age and cohort indices and given by Consequently, the cohort index k = j − i + I varies over k = 1,…, K = I + J − 1. Keiding [20] refers to this Lexis diagram as the third principal set of death. Age-period arrays are commonly used in epidemiology, in mortality analysis, and in sociology. The analysis of identification issue is largely similar to that of age-cohort arrays. However, the representation of the predictor μ in terms of ξ differs in an intriguing way, because the third time index, the cohort k, is the difference of the other two indices. The age-period-cohort model for the age-period arrays is parametrised by The time effect θ = (α 1,…,α ,β 1,…,β ,γ 1,…,γ ,δ)′ now varies in Θ = R 2(. A representation of the predictor μ in terms of the canonical parameter ξ is now The representation (54) differs from that of (49) in a subtle way. The three reference points for the levels of the predictor are chosen in the corner i = I, j = 1. From this corner period and cohort indices increase, while age decreases. Hence, the age double differences Δ2 α are now cumulated backwards. This phenomenon arises because the cohort index is the difference of the principal indices of age and period, whereas for the age-cohort array the period index is the sum of the principal indices of age and cohort. The predictor is now μ = Xξ where, with h(t, s) = max⁡(t − s + 1,0), The identification relies on Theorem 4. It is specialised to age-period arrays as follows.

Theorem 6 (see [24, Theorem 4.1])

Let μ satisfy (53). The parameter ξ of (55) satisfies the following: ξ is a function of θ which is invariant to the group g in (44); μ is a function of ξ, because of (54); the parametrisation of μ by ξ is exactly identified in that ξ † ≠ ξ ‡⇒μ(ξ †) ≠ μ(ξ ‡). The group of transformations in (44) can be specialised as see, for instance, Carstensen [29]. This is of the form (8) with ζ = (a,b,c,d)′ and

5.2.3. Period-Cohort Arrays

An age-period data arrays is rectangular in the age and cohort indices and given by Consequently, the age index i = j − k + K varies over i = 1,…, I = J + K − 1. Keiding [20] refers to this Lexis diagram as the second principal set of death. Age-period arrays are commonly used in prospective cohort studies in epidemiology and in sociology. The analysis is similar to that of age-period arrays when swapping the role of age and cohort. The age-period-cohort model for the age-cohort arrays is parametrised by The time effect θ = (α 1,…,α ,β 1,…,β ,γ 1,…,γ ,δ)′ now varies in Θ = R 2(. A representation of the predictor μ in terms of the canonical parameter ξ is now Thus, the canonical parameter and the design matrix are given by In parallel with Theorem 6 we then have the following identification result.

Theorem 7

Let μ satisfy (60). The parameter ξ of (62) satisfies the following: ξ is a function of θ which is invariant to the group g in (44); μ is a function of ξ, because of (61); the parametrisation of μ by ξ is exactly identified in that ξ † ≠ ξ ‡⇒μ(ξ †) ≠ μ(ξ ‡).

5.3. Expressing the Age-Cohort Model as a Hypothesis

It is often of interest to test the absence of the period effect. An application to analysing asbestos related mortality can be found in Miranda et al. [24]. The hypothesis is that β 1 = ⋯ = β , when expressed in terms of the time effect parameters. The restricted model is given by, with k = j − i + I, The identification problem simplifies to a question of determining the levels of α and γ . Therefore the (log) relative risk parameters Δα are identified as pointed out by Clayton and Schifflers [30]. In this model the cohort index is present and keeps the difference of the principal age and period indices. Therefore the representation of the predictor involves backward cumulated age differences as before but with a subtle change of sign, so that (54) reduces to As a consequence the canonical parameter and the design reduce to μ ac = X ac ξ ac, where Miranda et al. [24, Theorem 4.2] establish an identification result similar to Theorem 6. The age-cohort model can also be formulated as a hypothesis on the maximal invariant ξ in the age-period-cohort model following Section 2.4.3. The period effects Δ2 β are set to zero through H′ξ = 0, where H′ = (0, I , 0). Applying this to the expression for ξ in (55) gives since in the absence of period effects; then μ − μ = Δα − Δγ 2 and μ − μ = Δγ 2. The double differences cumulate to first differences through ∑ Δ2 α = Δα − Δα 2, so the above expression ξ is seen to be a linear transformation of ξ ac in (67). In other words the age-cohort model arises from the age-period-cohort model by restricting the maximal invariant parameter.

5.4. Working with the Time Effect

There is a large literature seeking to identify the original time effects α , β , and γ of the age-period-cohort model from the predictor. Here we look closer at some of those ad hoc identification proposals.

5.4.1. Ad Hoc Identification of Levels

For the age-period-cohort model it is popular to impose ad hoc identifications in two steps of the type discussed in Section 3.3. Here the first step is concerned with the level of the time effects and the second step is concerned with the linear trend. Examples are given in Sections 5.4.2 and 5.5.4. A common first step ad hoc identification is to require that This ad hoc identification is specific to the chosen data range. For instance, the constraint ∑ α = 0 is not easily transferable to a different data set drawn from the same population but with a different set of age groups. This aspect would have to be kept in mind if a substantive motivation was to be found for this constraint. Other ad hoc identification schemes such as α = β = γ = 0 have similar problems. The constraint (69) is a special case of affine constraints of the form C′θ = ψ discussed in Section 3.3. The involved dimensions are q = 2(I + J) and p = q − 4, while the number of constrains is q − q = 3. The matrix C′ ∈ R ( is given by the top left {3 × (q − 1)}-block of A ⊥′ in (58) padded with a column of zeros, while ψ ∈ R 3 is given by ψ = 0. Theorem 1 shows that m = A ⊥′C ∈ R ( has full rank. Indeed, m and its orthogonal complement are given by where, for instance, . Thus, the constrained group of equivalence classes (27) is

5.4.2. Ad Hoc Identification of Slopes: The “Intrinsic” Estimator

The “intrinsic” estimator is a popular estimator in the sociology literature; see Yang et al. [4] and see also O'Brien [31, 32] and Fu et al. [33] for a recent discussion of its merits. It has its roots in a suggestion by Kupper et al. [34], with an early critique given by Holford [35]. The “intrinsic” estimator is defined in two steps. In the first step, the levels are identified by the ad hoc constraint (69). Three of the θ-coordinates are then dropped; that is α , β , and γ are dropped. In a second step the linear trend is ad hoc identified using a Moore-Penrose inverse as in (23). We can analyse these steps using the developed framework. The first step identifies the levels by the ad hoc constraint (69), which is a constraint of the form C′θ = 0 for the C discussed in Section 5.4.1. This θ is defined on Θ which is a linear subspace with a dimension deficiency of 3. Introduce a selection matrix S ⊥ ∈ R that selects all coordinates of θ except α , β , and γ . Thus S ⊥ arises as a q-dimensional with 3 columns deleted corresponding to α , β , and γ . This is chosen so that (C, S ⊥) is invertible. Then S ⊥′θ is freely varying in that S ⊥′Θ = R . The skew projection identity I = S(C′S)−1 C′ + C ⊥(S ⊥′C ⊥)−1 S ⊥′ and the constraint C′θ = 0 then implies that θ = C ϑ where C = C ⊥(S ⊥′C ⊥)−1 and ϑ = S ⊥′θ ∈ R . Note that while C depends on S ⊥ and C ⊥, it does not depend on the normalisation of C ⊥, since we can replace C ⊥ by C ⊥ m for arbitrary invertible matrices m ∈ R (. This implies that C is a function of S ⊥ and C. The predictor μ is now parametrised by μ = XA′θ = XA ′ϑ with A ′ = A′C . This corresponds to equation 5 of Yang et al. [4] who use the notation X and b for XA ′ and ϑ, respectively. In the second step the linear trend is ad hoc identified through a time effect parameter of the form (23) with A, θ replaced by A , ϑ so that θ ad.hoc = C ⊥ ϑ ad.hoc where ϑ ad.hoc = L ⊥(A ′L ⊥)−1 ξ + (A )⊥{L′(A )⊥}−1 λ for some scalar λ and some matrix L ⊥ ∈ R (. The “intrinsic” estimator is ad hoc identified through the choices λ = 0 and L ⊥ = A , while C is chosen by (69). It therefore estimates an “intrinsic” parameter: which depends on the choices of S ⊥′, C, and A ⊥. However, since we can replace C ⊥ by C ⊥ m for arbitrary invertible matrices m ∈ R ( without changing θ intrinsic the expression θ intrinsic does not depend on the normalisation of C ⊥. The “intrinsic” parameter satisfies the following result.

Theorem 8

The “intrinsic” parameter is an injective mapping of the canonical parameter ξ ∈ R into a p = q − 4 dimensional linear subspace Θintrinsic of Θ = R . The “intrinsic” time effect space is a p-dimensional linear subspace of R of the form where w ∈ R is uniquely defined up to a scale by w′C ⊥′A = 0. Theorem 8 implies that the “intrinsic” parameter should be interpreted as an object varying in the linear subspace Θintrinsic rather than in the unrestricted time effect space Θ = R . As outlined in Section 3.4 this has consequences for the interpretation of plots of the time effects, hypothesis testing, and forecasts. A consequence of this argument is that different choices of C, S ⊥, L, and λ would lead to other ad hoc identified parameters varying in other affine subspaces of Θ. In other words, the “intrinsic” estimator carries the cost of working with the somewhat complicated linear subspace Θintrinsic. This effort may be worthwhile if the particular choice of C, S ⊥, L, and λ can be made on substantive grounds.

5.4.3. Forecasting

Forecasting of future mortality rates involves an extrapolation of the time parameters. In Section 2.4.4 it was argued that ad hoc identification may introduce an undesirable arbitrariness in the forecast. When working exclusively with the canonical parameter ξ this arbitrariness is avoided. It is, however, also possible to work with ad hoc identified time effects under specific circumstances that we characterise here for age-period arrays. This builds on the theory developed in Kuang et al. [25] for age-cohort data arrays. In the context of an age-period data array I ap it is often of interest to forecast h periods ahead. Suppose it is of interest to forecast the mortality at age i in period J + h, so that the cohort is k = I + J + h − i. This requires an extrapolation of the period effect. If the cohort index is sufficiently large, that is, k > K, then the cohort effect needs to be extrapolated too. Thus, there are two forecast index arrays of interest: Figure 2 illustrates these forecast index arrays.

Figure 2

I ap is the data array. J ap,1 is the forecast array where only period parameters need to be extrapolated. J 2 is the forecast array where both period and cohort parameters need to be extrapolated. Cohorts are indicated by dashed lines.

Identification plays a role when extrapolating the estimates obtained on the data array I ap. The identification issues can be ignored if the investigator simply extrapolates Δ2 β and Δ2 γ . In the context of ad hoc identified time effects arbitrary linear trends are introduced in the model. The forecast of the predictor μ is invariant to these if and only if the chosen extrapolation method for β , γ preserves these linear trends so that they can cancel with the arbitrary linear trend in α . The next result gives a precise formulation of this statement. It applies both to point forecasts and distribution forecasts.

Theorem 9

Consider the predictor μ for i, j ∈ I ap as given in (53). Suppose the time effects α , β , and γ are ad hoc identified. Consider the class of h periods-ahead forecasts over J ap constructed as , where is a function of the ad hoc identified estimate . Let g be the group (57). Invariance of the forecast with respect to the ad hoc identification is equivalent to either of the following: the extrapolation method for period and cohort effects is linear trend-preserving: functions f , f exist so that with ; then To illustrate the use of Theorem 9 consider the extrapolation methods and . The first forecast is a random walk forecast and it is seen to violate (ii). The second forecast is a cumulated random walk and satisfies (ii). The reason is that β = β + ∑ Δβ . Since , then . Further examples of forecasts that are linear trend-preserving as well as some which are not are given Kuang et al. [25, Table 1]. Kuang, Nielsen, and Nielsen [10] apply this to reserving data organised in an age-cohort array I ac and discuss the issue of robustification of forecast with respect to structural breaks at the forecast origin. Miranda et al. [24] give an application to asbestos related mortality using an age-period array I ap.

5.4.4. Bayesian Ad Hoc Identification Using a Dynamic Prior

A Bayesian ad hoc identification using a dynamic prior does not solve the identification problem as discussed in Section 4 and the same care has to be exercised to avoid the problems outlined in Section 3.4. Berzuini and Clayton [6] suggest such an ad hoc identification approach. On page 831 they write “Identificability problems may be solved by imposing an arbitrary linear constraint on the log-linear trend components of age, period and cohort effects. Happily, such an arbitrary constraint has no effect on the predictions of the model.” The previous analysis suggests that this is far from innocent. The Berzuini-Clayton suggestion is to ad hoc identify the model (53) through A dynamic prior is chosen so that the double differences Δ2 α , Δ2 β , and Δ2 γ are independent zero mean normal with variances ϕ = (σ 2, σ 2, σ 2) that have χ 2-type prior. The purpose of this is in part to facilitate extrapolations Δ2 α , Δ2 β , and Δ2 γ for i > I, j > J, and k > K, which is done through further draws from normal distributions. The level/trend effects θ level = (α 1, α 2, β 1, β 2, γ 1, γ 2)′ have independent uniform priors on some large intervals. We will analyse the Berzuini-Clayton model as applied to an age-period data array I ap. Decompose the canonical parameter ξ from (54) into two parts: the slope and level parameters, say ξ = (μ , μ − μ , μ − μ )′, and the collection of double differences, say ξ Δ. The assumed prior for ξ Δ is a simple collection of independent normal distributions with variances ϕ. The assumed prior for ξ is a linear combination of not only the independent uniform variables θ level, but also on ξ Δ, since the age double differences Δ2 α are cumulated backwards in (54), but forwards in (77). Thus, the prior for ξ = (ξ ′, ξ Δ′)′ depends on the θ level construction. We get a hyper-parameter λ hyper = (λ, ϕ), where λ is some three-dimensional ad hoc identified level/trend effect dependent on θ level, ξ Δ. We will argue that the ad hoc identified level/trend effect λ will wash out in the Berzuini-Clayton model. However, the level/trend parameter ξ is a function of the θ level construction that is tailored to the ad hoc identification. That construction remains in the analysis. In the presentation of the posterior Berzuini and Clayton are careful only to consider the double differences ξ Δ and stay clear of the ad hoc identified level/trend effect θ level. Theorem 2 yields the posterior p(ξ | y) = p(y | ξ)p(ξ)/p(y). Thus, the marginal posterior for the double differences is p(ξ Δ | y) = ∫p(y | ξ Δ, ξ )p(ξ Δ, ξ )dξ /p(y). This links ξ Δ to ξ and in turn to the θ level construction. The extrapolative method is based on double differences so it only depends on λ hyper through ϕ due to Theorem 9 and the subsequent discussion. Thus, the extrapolative method is of the form . By construction it does not reduce to so that condition (39) for Theorem 3 is not satisfied. The distribution forecast is of the form which, apart from depending on the θ level construction, also depends on the conditional prior p(ϕ | ξ), which is not updated by the likelihood. In summary, it appears that the Berzuini-Clayton analysis depends on the θ level construction as well as the conditional prior p(ϕ | ξ). The dependence on the θ level construction could of course be addressed by introducing priors directly on ξ , which in turn would be updated by the likelihood. Since the conditional prior p(ϕ | ξ) cannot be updated by the likelihood that its sole justification rests on the substantial context.

5.4.5. A Functional Form Hypothesis

It is instructive to consider functional form restrictions on the time effects. Such hypotheses can be analysed using the results outlined in Section 3.4.2. As an example restrict the age effect to be quadratic in a similar way to Yang and Land [5] so that This restriction on the time effect can be analysed by writing it on the form R′θ = ρ, see (28), and then applying Theorem A.3. Alternatively, in this particular case, we can show that the restriction actually only affects the ad hoc identified time effect through the canonical parameter, so a simpler analysis can be made. A quadratic polynomial has constant second order derivative. Therefore the restriction (79) implies This expression has one free parameter. Thus, it is useful to consider the third order difference: This gives I − 3 linear restrictions on the canonical parameter. The age time effect α then has three remaining parameters, say α 1, α 2, and α 3. These are freely varying since the parameters σ 0, σ 1, and σ 2 are freely varying. If the constraint is imposed directly on the canonical parameter, the restricted model is a regular exponential family with the advantages outlined in Section 2.4. However, if the analysis is done with the time effect the levels and trend will have to be ad hoc identified while bearing in mind the issues discussed above.

5.4.6. The “Hierarchical Age-Period Cohort Regression Model”

In some cases a random effects approach can be used to get an overview of the many parameters of the age-period model. When applied to the time effects this implies an ad hoc identification. An example is the “hierarchical age-period cohort regression model” by Yang and Land [5]. In that paper the age effect is given a quadratic structure, but that does not have to be the case. The model is then given by Since random effects are only introduced for some of the time effects, the analysis of Section 4.3 has to modified in a similar way to the analysis in Section 5.4.4. From (80) it is seen that the model restricts Δ2 α = 2σ 2. Thus, divide the canonical parameter ξ into three elements: the slope and level parameters, say ξ = (μ , μ − μ , μ − μ )′, the age-double differences ξ = (Δ2 α 3,…, Δ2 α ), and the remaining double differences ξ . Here ξ is restricted by the hypothesis 2σ 2 and ξ is linear function of the normal random effects, while ξ is a three-dimensional linear function of σ 2 and of the six-dimensional object ν = (σ 0, σ 1, β 1, β 2, γ 1, γ 2)′. This leaves a three-dimensional ad hoc identified level/slope parameter λ which is also a function of ν but not entering the likelihood. Let ψ = (σ 0, σ 1, σ 2, σ 2, σ 2). The random effects likelihood are constructed in three steps. First, we have the usual age-period-cohort likelihood p(y | ξ). Secondly, the random effects distribution for ξ , ξ , and λ is multivariate normal, while ξ is deterministic function of ψ. Thus, decompose the prior as p(ξ , ξ , λ | ψ) = p(ξ , ξ | ψ)p(λ | ξ , ξ , ψ). Thirdly, following Section 4.3 the random effects likelihood will not depend on p(λ | ξ , ξ , ψ) and it is given by The prior p(λ | ξ , ξ , ψ) is not updated by the data. Plots and inferences based on the posterior p(θ | ψ, y) will then suffer from the ad hoc identification issues outlined in Section 3.4.

5.5. A Two-Sample Age-Period-Cohort Model

When confronted with two samples for women and for men it may be of interest to apply the age-period-cohort model (43) to each of the samples and impose that some of the time effects are the same across samples. The models for samples r = 1,2 are The time effect θ = (…, α , β , γ , δ ,…)′ now varies in Θ = R where q = 4(I + J).

5.5.1. Analysis of the Unrestricted Two-Sample Model

The unrestricted two-sample model is simply analysed as two copies of the one sample model of Section 5.1. The time effects of each copy are only defined up to linear trends. The group of transformations characterizing the identification problem combines two copies of the one sample group (44). The maximal invariant parameter is ξ = (ξ 1′, ξ 2′)′ ∈ R where p = 4(I + J − 2) and each of ξ are of the form (45). The benefits of Section 2 hold when working with that parameter.

5.5.2. Bayesian Ad Hoc Identification Using a Dynamic Model

An application of the unrestricted two-sample model can be found in Cairns et al. [36]. The two samples are the population of England and Wales and the subpopulation of assured lives, so the substantive question is whether there is a selection effect for the assured lives. A Bayesian model with dynamic prior is used. It shares some features with the Berzuini and Clayton [6] model discussed in Section 5.4.4 although the details of the ad hoc identification of the levels and slopes are slightly different. When it comes to forecasting the extrapolative method appears to depend on the ad hoc identified parameter as well as the hyperparameters. This complicates the analysis of the forecast relatively the discussion in Section 5.4.4.

5.5.3. The Hypothesis of Common Period Parameters

The two-sample model allows the possibility for adding cross-sample restrictions on the parameters. As an example we consider the hypothesis of common period parameters. Working with the canonical parameter the hypothesis is This is a simple linear restriction as that discussed in Section 2.4.3. It is readily seen that the degrees of freedom of the hypothesis are p − p = J − 2 so the dimension of the restricted model is p = 4I + 3J − 6. The canonical parameter under the hypothesis is then The same result arises when writing the hypothesis in terms of time effects so that Such hypotheses on the time effect were discussed in Section 3.4.2. It can be analysed using the general result in Theorem A.3. However, we will take the simpler route of arguing that this only restricts the canonical parameter given a hypothesis of the type (85). The argument relies on noting that analysing the restriction for the predictors μ and μ is equivalent to analysing the restriction for the predictors μ and μ − μ , where the cross-sample differenced predictor is of the form Now, the restricted model for the cross-sample differenced predictor μ − μ is an age-cohort model: Following the analysis of Section 5.3 the (87) therefore implies the J − 2 linear restrictions given by (85). At the same time the predictor for the first sample μ is left unrestricted by (87). In summary, the restrictions (85) and (87) are equivalent. The restriction has an interesting implication for the interpretation of the involved double differences. For the unrestricted model it was found that only plain double differences, such as Δ2 α , are identified. Under the restriction the cross-sample differenced predictor is of age-cohort form (89) so also the cross-double differences Δ(α − α ) and Δ(γ − γ ) are identified.

5.5.4. Step-Wise Ad Hoc Identification under the Hypothesis

The analysis of Riebler and Held [7] finds that the difference α − α is identified under the hypothesis (85). This is not consistent with the above analysis showing that the cross-sample differenced predictor is an age-cohort model under the hypothesis, for which levels such as α − α are identified. The apparent difference comes about because Riebler and Held follow a step-wise identification approach along the lines of Sections 3.3 and 5.4.1. In a first step the time effects α , β , and γ are constrained to have zero-sums as in (69). In a second step the slopes are ad hoc identified using a Bayesian approach similar to that of Berzuini and Clayton [6]; see Sections 4 and 5.4.4 for a discussion of the consequences. The identification in the first step implies that α − α has a zero sum. Under the hypothesis (85) this is exactly what is needed to ad hoc identify the levels in the age-cohort model (89). In other words a different level identification in the first step leads to a different level for the difference α − α .

6. Models with Nonlinear Parametrisations

Some additional issues arise when looking at models with nonlinear parametrisations. A prominent example is the mortality model proposed by Lee and Carter [3] and which is the current benchmark in mortality studies done by government agencies and pension funds. For this model the time effect space Θ has a nondifferentiability which can actually be avoided by working directly with the parameter space M. We analyze the Lee-Carter model in Section 6.1. In Section 6.2 we turn to a two-sample problem where some additional difficulties can arise when forecasting.

6.1. The Lee-Carter Model

The mortality model proposed by Lee and Carter [3] has predictor of the form The time effects θ = (α 1,…, α , β 1,…, β , κ 1,…, κ ) vary in Θ = R 2. Lee and Carter pointed towards two identification issues of the model. If α, β, and κ are one solution to (90), then α − βc, β, κ + c is also a solution for any scalar c, just as α, β/d, and κd are a solution for any d ≠ 0. Consequently, they proposed the ad hoc identification: This is, however, not the full story about the identification issues. To get at this we follow the outline from the linear parametrised models and start by finding the parameter space for the predictor μ.

6.1.1. The Parameter Space

We start by finding the predictor space M. Write the model in matrix form. Let denote the I × J-matrix of μ . Then where α, β, and κ are vectors concatenating α , β , and κ and where ι = (1,…, 1)′ ∈ R . Postmultiply by the projection identity to get where the orthogonal complement ι ⊥ can be chosen so that ι ⊥′κ = (Δκ 2,…, Δκ )′ but could also be chosen otherwise. Equation (93) shows that the model is composed of two matrices with rank one. Thus, the parameter space is given by Note that M does not depend on the normalisation of ι ⊥ since δ is freely varying. The space M is a manifold since the space of matrices δ with an upper bound to the rank is a manifold as opposed to the space where δ has rank of unity. This space can be parametrised parsimoniously by varying in the manifold The ξ is the candidate for the maximal invariant describing the equivalence classes of the mapping from the time effect θ to the predictor μ. The next step is to analyse the time effect space Θ. It is convenient to decompose M into two disjoint sets depending on the rank of δ. These sets are Correspondingly, the time effect space Θ can be decomposed into two disjoint sets: Note that δ = 0 if and only if θ ∈ Θ0. Consider first the time effect space Θ1, which is implicitly what Lee and Carter had in mind. The mapping θ ↦ μ on Θ1 to M is invariant to the group of transformations: acting on Θ1 for all c ∈ R and all d ≠ 0. The parameter ξ = (γ′, δ′)′ is invariant under g 1 acting on Θ1. Now, consider the space Θ0 with deficient rank. Then α , β , and κ map into α + φ where φ = β κ is constant in j, so that . This mapping is invariant to the group of transformations: acting on Θ0 for all (a 1,…, a )′ ∈ R .

Theorem 10

Let . The parameter ξ ∈ Ξ of (95) satisfies the following: ξ is a function of θ ∈ Θ which is invariant to the groups g 0, g 1 in (99) and (100); is a function of ξ; the parametrisation of by ξ is exactly identified in the sense that . Theorem 10 shows that ξ varies freely on the space Ξ and it gives a unique parametrisation of μ. As a function of θ it is invariant to g 0, g 1; hence it is a maximal invariant. It is interesting to compare the properties of the spaces M, Ξ, and Θ. The spaces M and Ξ are spaces of matrices with deficient rank. These are smooth spaces, but they are not vector spaces since the sum of matrices with rank one may have rank larger than one. In contrast Θ is a vector space. The mapping from Θ to M will inevitably be nondifferentiable. This nondifferentiability is avoided by working directly with M. Likewise, in a Bayesian setting it would seem more difficult to introduce a meaningful prior of Θ with its nondifferentiability than on M.

6.1.2. Maximum Likelihood Estimation

The maximum likelihood estimator for ξ can be derived analytically in the normal case. Consider a situation where the data array is of age-period form so Y for (i, j) ∈ I ap. Suppose Y are independent normal with mean μ and variance σ 2. Organise the data in a matrix . Then the log likelihood is of the form The maximum likelihood estimator is of the following form. Subsequently, this is related to the estimator suggested by Lee and Carter.

Theorem 11

For a normal age-period array parametrised by (94) the maximum likelihood estimators are where svd1(·) is the singular value decomposition truncated to one factor. Thus, γ is estimated by the row-averages of the data matrix, while δ is estimated by the singular value decomposition of the row-wise demeaned data matrix.

6.1.3. Estimation of Ad Hoc Identified Time Effects

The ad hoc identification (91) gives a time effect θ varying in a 2I + J − 2 dimensional affine subspace of Θ = R 2. The ad hoc identified θ can now be expressed in terms of the maximal invariant parameter ξ using (95). In the case where δ ≠ 0 then it has singular value decomposition δ = δ δ δ ′ for two vectors δ ∈ R and δ ∈ R so δ ′δ = 1 and δ δ ′ = 1, while δ > 0 is a positive scale. The ad hoc identification of Lee and Carter then gives Inserting the maximum likelihood estimators from Theorem 11 yields the estimators proposed by Lee and Carter. However, the disentangling of the singular values and singular vectors of is done by the ad hoc identification β′ι = 1 and κ′ι = 0. These estimators are therefore specific to the considered data array and data set in parallel with the discussion in Sections 3.2 and 5.4.1.

6.1.4. Consequences of the Possible Rank Deficiency

The parameter space M was split into spaces M 1 and M 0 depending on the rank of δ. The space M 0 is a Lebesgue null set relative to M. Broadly speaking, there are two consequences of the possible rank deficiency. The first consequence is an estimation problem arising in the vicinity of M 0. The second consequence is that the usual normal asymptotic distribution theory does not apply in the vicinity of M 1. Whether this becomes a problem in practice depends on the data. One solution is to ensure that the time effect really is present when using the Lee-Carter model. Investigate whether the time effects are present amounts to estimating the rank of δ. For a given data set two Lee-Carter models can be estimated. The first model with predictor space M is the unrestricted model in which rank⁡(δ) ≤ 1. The second model has predictor space M 0 so δ = 0. Twice the difference of the likelihood values gives a likelihood ratio test statistic which is asymptotically χ 2. If the smaller model, M 0, is accepted this is used in subsequent analysis. However, if the smaller model M 0 is rejected then it is likely that the predictor is not located in the vicinity of M 0 and it is then safe to work with the predictor space M 1. The consistency of this step-wise procedure is discussed in a cointegration context by Johansen [37, Section 12]. Even when this procedure points towards working with the parameter space M 1 the rank deficiency may still affect inference under M 1. Analysis of simple canonical correlation models suggests that inference under M 1 will be nearly similar if the distance to M 0 is sufficiently large. A problem is that the distribution for the test statistic will have poor finite sample properties when the parameter value is close to M 0. A simple way to get around this problem is to test for M 0 using a test with lower level than the conventional level. A more complicated way to address this is to employ a finite sample correction when seeking to test for M 0. See Nielsen [38, 39] for further discussion of this issue in the context of simple canonical correlation models. The rank deficiency issue is typically not encountered in a standard Lee-Carter analysis. The reason is that the analysis is typically applied to data where there is a marked improvement in mortality rates over time. A Lee-Carter analysis could however run into trouble if it were applied to data without a strong calendar effect. The issue becomes more pertinent when extending Lee-Carter model with a cohort component such as see Renshaw and Haberman [40]. If the cohort effect is modest the latter matrix is nearly rank deficient and the likelihood will be nearly flat in certain directions. This is presumably the reason for the estimation problem noted by Cairns et al. [41].

6.1.5. Forecasting

The purpose of Lee-Carter model is usually to forecast future mortality. This issue is considered for the model with parameter space M 1. The standard approach is to extrapolate κ, ad hoc identified through, for instance, κ′ι = 0. The h-step ahead extrapolation of κ based on some forecast methods is denoted by . Combined with the estimates , this gives the overall forecast The identification question is then for which extrapolation methods this equals The condition for avoiding adverse impact of the ad hoc identification is as follows.

Theorem 12

Let . The forecast in (105) is invariant to ad hoc identification if and only if the extrapolation method for the period effect is location-scale preserving: The default forecast method in the literature is a random walk with a drift, which was the preferred forecast of Lee and Carter [3]. This is given by with estimates and normal errors ε with mean zero and estimated variance . This extrapolation method is location-scale preserving as required in Theorem 12. It is even linear trend preserving. Other valid forecasts are a random walk without intercept as given by the equation , or an autoregression given by . An alternative approach to forecasting would consider the predictor of the model for a particular age ground, say i. That predictor is , where e is the ith unit vector. From this we can generate forecasts using any time series method. The resulting forecast will in general depend on as well as , and it is therefore more general than the forecasts discussed in Theorem 12, which only depends on . The forecast for another age group, say i †, should be the same up to a linear transformation dictated by the Lee-Carter structure. Thus, the h-step ahead forecasts for the entire array are for an index is chosen so that .

6.1.6. Bayesian Ad Hoc Identification Using a Dynamic Model

A Bayesian model with dynamic specification of the prior has been suggested by Pedroza [42]. Dynamic priors are presented for the time effects θ = (ξ, λ) involving a hyper parameter ϕ. The ad hoc identification (91) is imposed so that analysis is made for an ad hoc identified time effect θ . Pedroza presents posteriors for θ . When evaluating this posterior one should bear in mind that the conditional prior p(λ | ξ) is not updated by the data; see Theorem 2. The presented extrapolative method does not depend on λ. Even so, the forecast will depend on conditional prior p(ϕ | ξ) which is not updated by the data; see Theorem 3.

6.2. The Two-Sample Lee-Carter Model

We now turn to applications of the Lee-Carter model in two-sample problems. Suppose two samples are for women and men. One approach would be to fit separate Lee-Carter models to the two datasets. These Lee-Carter models are of the form The objective is now to extrapolate the period effects κ . Extrapolating the two models separately using separate random walks is often seen to be volatile, so methods that seek to combine information from both estimated series are sought after. The next result describes the invariance problem in forecasting.

Theorem 13

Let for r = 1,2. The forecast for sample r = 1 is invariant to ad hoc identification if the extrapolation method preserves location/scale for sample 1, but is invariant to location and scale for sample 2. That is for all c 1, c 2 ∈ R and all d 1, d 2 ≠ 0; then For one sample the standard forecasting technique appears to be the random walk with a drift as in (108). For the two-sample problem a suggestion could be that women and men should share a common random walk with a drift but deviate from this by a stationary process. In econometrics this idea is referred to as cointegration as proposed by Engle and Granger [43]; see also Johansen [37] for a likelihood based vector autoregressive approach. It is tempting to require that the calendar effects should cointegrate with coefficients of unity, so κ − κ should be stationary. However, that apparently intuitive choice violates Theorem 13 because the locations and scales of κ are different and arbitrary. There are two fixes to this problem. The first solution is to work directly with the mortality predictors μ for an arbitrary age group i as outlined for the one-sample case in connection with (109). Since no identification is involved it is permitted to impose that μ and μ cointegrate with coefficients of unity. The forecast for age group i is then carried over to other age groups. The second solution is to work with the estimated series but estimate the cointegrating coefficients from the data. In other words, the cointegrating relation should be zero mean, stationary, with coefficients φ, ψ estimated from the data. This can, for instance, be done by Johansen's approach for a bivariate vector autoregression; see Hendry and Nielsen [44, Section 17].

7. Conclusion

Ad hoc identification is intimately linked to interpretation, inference, numerical analysis, and forecasting. The ad hoc identification will often introduce an arbitrary element in the statistical analysis, whether it is based on frequentist or Bayesian methods. This arbitrary element is entirely avoidable and is in our view best avoided unless there is a clear substantial motivation for ad hoc identification. For decades there has been a debate over how it is best to ad hoc identify mortality models. Our proposal is to bypass this discussion by analysing the surjective mapping between the unidentified time effect parameter and the predictor of the model and then deduce a maximal invariant parametrisation. In our experience there are typically two substantial benefits. First, it simplifies estimation and other statistical computations which is what we have focused on here. Secondly and perhaps more importantly, it helps to focus the substantial question that gives rise to the analysis in the first place. The issue of dealing with two time scales also occurs in other statistical models, such as the Cox regression model; see Cabrera et al. [45] for a recent application. In future research it would be interesting to consider whether the analysis presented here has any bearing on that problem.

9 in total

3 in total