Literature DB >> 18060052

On testing dependence between time to failure and cause of failure when causes of failure are missing.

Abstract

The hypothesis of independence between the failure time and the cause of failure is studied by using the conditional probabilities of failure due to a specific cause given that there is no failure up to certain fixed time. In practice, there are situations when the failure times are available for all units but the causes of failures might be missing for some units. We propose tests based on U-statistics to test for independence of the failure time and the cause of failure in the competing risks model when all the causes of failure cannot be observed. The asymptotic distribution is normal in each case. Simulation studies look at power comparisons for the proposed tests for two families of distributions. The one-sided and the two-sided tests based on Kendall type statistic perform exceedingly well in detecting departures from independence.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2007 PMID： 18060052 PMCID： PMC2092381 DOI： 10.1371/journal.pone.0001255

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

We consider a competing risks set-up where a unit is subject to two disjoint risks of failure and each unit ultimately fails due to either of the two risks. We do not allow simultaneous failures due to both the risks. The observations are made on the time to failure T and an identifier of the risk δ = j if the failure is due to the risk j, j = 0, 1. Let the joint distribution of (T,δ) be specified by the subsurvival functions S(t) = P(T≥t, δ = j), or the subdistribution function given by F(t) = P(T Let the conditional probability of failure due to the first risk given that there is no failure up to time t be given aswhenever S(t)>0. These probabilities were introduced while studying failure and preventive maintenance in a censoring setting where the interest is in the distribution of the failure time which would have been observed in the absence of preventive maintenance [1]. Another conditional probability of interest iswhenever F(t)>0. Under independence S(t) = S(t)P(δ = j) and hence, T and δ can be studied separately. Thus, the hypothesis of equality of subsurvival functions reduces to testing whether P(δ = 1) = P(δ = 0) = 1/2, a Bernoulli trial with success probability half. Hence a two-dimensional problem reduces to one-dimensional problem. The dependence between the failure time T and the cause of failure δ in terms of the above two conditional probability functions was studied in [2]. Below we give formal proofs of two results, which were stated in [2] on the independence and positive quadrant dependent (PQD) structure of (T,δ) in terms of these conditional probabilities. Lemma 1: T and δ are independent if and only if Φ1(t) = Φ1(0) = φ, for all t where φ = P(δ = 1). Proof: When T and δ are independent their joint distribution is written as the product of the marginal distributions. Hence, where φ = P(δ = 1). Similarly when Φ1(t) does not depend on t then Φ1(t) = Φ1(0) and Φ1(0) = S 1(0)/S(0) = P(δ = 1) = φ. This in turn implies that S 1(t) = φS(t) which is the product of the marginal distributions of δ and t. Also, S 0(t) = S(t)−S 1(t) = (1−φ)S(t). Hence the result. The independence of T and δ is also equivalent to Φ* 0(t) = Φ* 0(0), for all t. A simple and easily checked dependence structure is positive quadrant dependent (PQD) indicating positive association between two random variables. Definition 1: Random variables X and Y are Positive Quadrant Dependent (PQD) if the following inequality holds:In our case, because δ takes only two values 0 and 1, T and δ are PQD if the following inequality holds:This is because P(T≤t,δ≤0) = P(T≤t,δ = 0), P(T≤t,δ≤1) = P(T≤t), and P(δ≤1) = 1. Hence the required inequality always holds for δ = 1. Lemma 2: T and δ are PQD if and only if Φ1(t)≥Φ1(0) = φ, for all t. Proof: When T and δ are PQD thenNote thatand P(T≤t)P(δ = 0) = (1−S(t))(1−φ). Substituting these identities in the above inequality, we get Hence the result. Note that T and δ are positive quadrant dependent (PQD) is also equivalent to Φ* 0(t)≥Φ* 0(0), for all t>0. Various hypothesis testing problems of checking independence of T and δ against various alternatives specifying dependence structures are considered and U-statistics are derived when the complete data on all n units are available [2]. However, in many practical situations the experimenter may have information on failure times for all the individuals but on the causes of failures only for some. In mortality follow-up study, the causes of death are obtained from the death certificates. The problem of causes of death missing on death certificates is well-known. This may occur due to various reasons like doctor's strike, autopsy not performed and hence, no knowledge of Definite underlying cause of death, and no legal requirement of mentioning an underlying cause of death on the death certificate. The present work is motivated by a follow-up study on mortality where the underlying causes of death were missing for nearly 20% patients but the times of death were known for all. Similar situation arises in engineering fields where series systems are tested for failure due to various components, possibly under accelerated life testing. In this case, a thorough autopsy of failed system is required to identify the failed component(s) which leads to the system failure. Such information may not be available for all failed systems due to financial or logistic reasons. In motorcycle fatalities study, 40% of the death certificates had either partial or no information on underlying causes of death [3]. An example from animal bioanalysis where all causes were not available was considered [4]. Likelihood based estimation in case of missing causes of failure have been studied [5], [6], [7], [8], [9]. A modified log rank test for competing risks with missing failure type was also considered [10]. The maximum likelihood estimators and minimum variance unbiased estimators of the parameters of exponential distribution for the missing case were obtained [11] and their approximate and asymptotic properties were discussed and confidence intervals were derived [12]. In this paper, we consider the problem of testingagainst various alternative hypotheses characterising the dependence structure of T and δ, which are: when causes are missing for some units. Let (T,δ), i = 1,…,n be the competing-risks data available on n individuals. Here, we consider a situation when δ may not be observed always i.e., it may be missing for some units. Let O be an indicator variable which takes value one if δ is observed and zero if δ is missing. Let p = pr(O = 1) be the probability that δ is observed. We assume that δ are missing completely at random and hence O is independent of (T,δ). The fact that the cause of failure will be observed or not will have no bearing on the actual cause. Similar assumptions are made in [9], [13]. We extend some of the tests based on U-statistics proposed in [2] to the case when δ′s are not observed for all the units. We carry out simulation studies for comparing the power of the tests for two families of distributions. We also apply the proposed tests to the data on failure of switches given in [14] by artificially creating missing causes of failure. The proposed tests perform satisfactorily and the use of the data on the failure times even when corresponding causes are missing is recommended.

Results

We apply the proposed tests U, U and U 1, the one-sided test based on U to simulated data from two parametric families of distributions and evaluate empirical powers. We also apply the tests to a real data. The computations were done using SAS [15] and the source codes and a brief guide on how to use the SAS codes are provided in the supplementary material (Text S1, Text S2, Text S3 and Text S4). Example 1: Parametric family of distributions [2] Consider the parametric family of distributions proposed in [2] where 1≤a≤2, 0≤φ≤0.5 and F(t) is a proper distribution function. Note that P(δ = 1) = φ andwhich is an increasing function of t. For a = 1, Φ1(t) = φ, that is, T and δ are independent and for 1φ, that is T and δ are PQD , and hence H 2 holds. Let F(t) = 1−exp{−λt} be the overall distribution function. We simulated random samples by varying n, p, and (λ,a,φ) from the above distribution. Empirical level of significance and power were calculated by using 1000 replications for each combination of n, and (λ,a,φ). Table 1 gives the empirical level of significance and Table 2 gives the empirical power of the three tests based on U, U 1 and UPQDm.

Table 1

Empirical level of significance of the three U-tests for a family of distributions of Dewan et al. (2004) (λ, a) = (1,1)

φ	p	n = 25			n = 50			n = 100
		U_km	U_km1	U_PQDm	U_km	U_km1	U_PQDm	U_km	U_km1	U_PQDm
0.5	1	0.057	0.054	0	0.061	0.049	0.001	0.052	0.047	0.002
0.5	0.9	0.055	0.048	0.002	0.062	0.052	0.001	0.047	0.053	0.004
0.5	0.8	0.054	0.047	0.007	0.049	0.061	0.008	0.05	0.049	0.004
0.2	1	0.069	0.061	0.001	0.052	0.049	0.001	0.051	0.06	0.01
0.2	0.9	0.078	0.059	0.001	0.061	0.052	0.002	0.052	0.059	0.005
0.2	0.8	0.092	0.06	0.005	0.057	0.054	0.004	0.047	0.064	0.002

Table 2

Empirical power of the three U-tests for a family of distributions of Dewan et al. (2004) λ = 1 and φ = 0.5

φ	p	n = 25			n = 50			n = 100
		U_km	U_km1	U_PQDm	U_km	U_km1	U_PQDm	U_km	U_km1	U_PQDm
1.5	1	0.421	0.546	0.049	0.690	0.812	0.201	0.942	0.974	0.575
1.5	0.9	0.394	0.505	0.066	0.652	0.767	0.201	0.916	0.955	0.504
1.5	0.8	0.365	0.489	0.079	0.609	0.711	0.208	0.894	0.938	0.453
1.8	1	0.765	0.863	0.183	0.967	0.988	0.646	0.999	0.999	0.974
1.8	0.9	0.713	0.802	0.181	0.947	0.975	0.545	0.998	0.999	0.917
1.8	0.8	0.653	0.768	0.191	0.921	0.965	0.485	0.997	0.999	0.836

From the two tables it is clear that modified test statistic U attains its level when roughly half of the failures are likely to be due to the first cause. The conclusions are valid even when 20% of the failure causes are not available. The power increases with increase in values of a and also with increase in sample size. The test has very good power even when a = 1.5. One should keep in mind the fact that the alternative of no independence is extremely general. However, the test based on one-sided version of the Kendall's τ, U 1 performs much better than the test based on U for testing H 0 against H 2. It was observed in [2] that the test U, when p = 1 is extremely conservative and also inefficient. The entries for this test in the two tables confirm this observation. But given the fact that the level of significance attained is very low, it is able to detect alternatives reasonably well. Example 2: Random sign censoring model [1] A random sign censoring (RSC), also known as an age-dependent censoring, is a model in which the lifetime of a unit (X) is censored by Z = X−Wη, where W, 0 Here P(η = −1) = 1−P(η = 1) = P(δ = 1) = φ. When X is exponentially distributed with parameter λ and W = aX, 0 The value a close to zero corresponds to independence of T and δ and a>0 gives Φ1(t) as an increasing function of t implying T and δ are PQD. For simulation purposes, we consider two values of a = 0.00001, and 0.5. Test based on U almost attains its level even for sample sizes as small as n = 25 as can be seen from Table 3. This test has good power for n = 100. The test based on U 1 has a slightly higher level as well as higher power. But the test based on U is a very conservative test. It has low empirical power even for n = 100. One-sided test based on U 1 is definitely a better choice for detecting PQD alternatives.

Table 3

Empirical power of the three U-tests for random sign censoring model (λ = 1)

(a,φ)	p	n = 25			n = 50			n = 100
		U_km	U_km ₁	U_PQDm	U_km	U_km ₁	U_PQDm	U_km	U_km ₁	U_PQDm
(10⁻⁵ _,0.5)	1	0.057	0.067	0.002	0.061	0.063	0	0.052	0.065	0
(10⁻⁵ _,0.5)	0.9	0.055	0.065	0.001	0.062	0.059	0.001	0.047	0.053	0
(10⁻⁵ _,0.5)	0.8	0.054	0.059	0.005	0.049	0.059	0.008	0.05	0.048	0.006
(10⁻⁵ _,0.7)	1	0.056	0.057	0	0.067	0.063	0	0.056	0.057	0
(10⁻⁵ _,0.7)	0.9	0.063	0.069	0.005	0.059	0.064	0.003	0.055	0.06	0.003
(10⁻⁵ _,0.7)	0.8	0.062	0.067	0.009	0.057	0.063	0.006	0.052	0.056	0.000
(0.5, 0.5)	1	0.352	0.474	0.029	0.577	0.702	0.11	0.85	0.923	0.381
(0.5, 0.5)	0.9	0.321	0.442	0.05	0.513	0.647	0.135	0.793	0.884	0.358
(0.5, 0.5)	0.8	0.286	0.41	0.066	0.473	0.609	0.121	0.751	0.848	0.306
(0.5, 0.7)	1	0.32	0.423	0.029	0.489	0.61	0.091	0.759	0.847	0.303
(0.5, 0.7)	0.9	0.266	0.378	0.045	0.449	0.564	0.105	0.705	0.811	0.249
(0.5, 0.7)	0.8	0.229	0.348	0.056	0.396	0.516	0.112	0.655	0.755	0.233

Example 3: Nair's data revisited [14] Here we consider the data on the failure of 37 switches due to one of the two possible causes of failures published in [14]. These data were analysed in [2] and it was shown that the failure time (T) and the cause of failure (δ)of switches were not independent. Also, the conditional probability of failure due to cause A, Φ1(t) was shown to be larger than φ and hence T and δ were PQD. We calculate three test statistics for the entire data on 37 switches as earlier. We also artificially create missing data on the cause of failure for varying values of p and repeating it for 1000 times to evaluate the empirical powers of the test statistics. The hypothesis of independence of T and δ, H 0 is rejected against H 1 at α = 5% level of significance using U (the value of the test statistic is 2.70 which is larger than 1.96) and the one-sided test, U 1 (the value of the test statistic is 2.70 which is larger than 1.64) rejects the hypothesis H 0 against H2 at α = 5% level of significance. However, the test based on PQD, U (the value of the test statistic is 1.35 which is smaller than 1.64) does not reject the hypothesis H 0 against H 2 at α = 5% level of significance. Table 4 shows the empirical powers of the tests for various values of p.

Table 4

Empirical power of the three U-tests for Nair's data (1993)

p	U_km	U_km1	U_PQDm
0.9	0.962	0.993	0.127
0.8	0.841	0.945	0.167
0.7	0.705	0.865	0.18
0.6	0.578	0.758	0.187
0.5	0.454	0.638	0.172

As seen earlier with the simulated data, the test U 1 performs well even when 60% of the causes are missing. The power of U test is unsatisfactory.

Discussion

Testing independence between the failure time T and the cause of failure δ is often important because of reduction in dimensionality and possibility of studying T and δ separately. The available tests use only completely observed data on T and δ. One cannot avoid missing data situation in practice and hence, the issue of the effect missing observations on the existing tests needs to be addressed. From the simulation studies it is clear that the two-sided test, U is performing well for both the families of distributions for sample sizes as small as 25 and when 20% of the causes of failure are not known. These observations can be made from Table 1, Table 2 and Table 3 with attained level of significance and high empirical power. The empirical powers for all the three tests are higher in the case of the parametric family of distributions of Example 1 compared to RSC model of Example 2 for all sample sizes. The performance of the one-sided test, U 1 based on Kedall's τ is clearly superior to the U as demonstrated by Table 1, Table 2 and Table 3. Even when all causes are known it observed that the test based on Kendall's τ is four times more efficient than the test based on U [2]. Even in the case of missing causes, we recommend the use of U 1 for testing independence against PQD. One obvious reason is that Kendall's τ uses information on (T,δ) for each pair of observations. Similar observations are made on the basis of real data analysis of Example 3 (Table 4). The failure times with missing information on causes of failures also provide useful information regarding departures from independence of T and δ, and hence, omitting such observations from the analysis may result in loss of efficiency. For this reason, the analysis may not be based on only the complete data on both time and causes of failures (with reduced sample size, which is random). This article is the first attempt of its kind to carry out the tests for independence under the assumption of missing completely at random. How the tests perform under the assumption of missing at random or even informative missingness remains an open research problem.

Materials and Methods

General dependence between T and δ

First we consider the problem of testing H0 : Φ1(t) = φ, for all t against H1 : Φ1(t) is not a constant, where Φ1(t) = P(δ = 1|T≥t) = S1(t)/S(t), and φ = Φ1(0) = P(δ = 1). Recall that a pair (T,δ) and (T,δ) is a concordant pair if T>T, δ = 1, δ = 0 or TT, δ = 0, δ = 1 or T2]. If δ is missing for some units then ψ(T, δ, T, δ) cannot be defined for all pairs. In Table 5 m indicates that δ is not observable and ? indicates the cases when ψ is not defined.

Table 5

Values taken by the kernel ψ for various combinations of the pairs (T,δ) and (T,δ)

(δ_i,δ_j)	(1,1)	(1,0)	(1,m)	(0,1)	(0,0)	(0,m)	(m,1)	(m,0)	(m,m)
T_i>T_j	0	1	?	−1	0	?	?	?	?
T_i≤T_j	0	−1	?	1	0	?	?	?	?

Note that when T>T and δ = 1, but δ is missing, ψ(T, δ, T, δ) will take value 1 if δ = 0 and value 0 if δ = 1. Hence, in order to retrieve the best possible information we assign weight (1+0)/2 = 1/2 in this case. Similarly, when T>T and δ = 0, but δ is missing, ψ(T, δ, T, δ) will take value −1 if δ = 1 and value 0 if δ = 0. Hence, we assign value −1/2 to the kernel in this case. Now, we redefine the kernel when some observations on δ are missing as followsHere the subscript m indicates missing data situation. Define U as the corresponding U-statisticThen the expectations of U is given byand the asymptotic variance of under H 0, denoted as Var(U) isNote that when p = 1, the variance simplifies to (4/3)φ(1−φ), which is given in [2]. Also, E(U)≠0 under H 1. From the central limit theorem of U-statistics [16] (see Text S5), it follows that U has asymptotic normal distribution for large n. Theorem 1: Under H 0, converges in distribution to N(0, σ2 ) as n→∞, where σ 2 = (4/3)p 2 φ(1−φ)+(1/3)p(1−p). We refer to the supplementary material (Text S6) for the explicit derivation of E(U), Var(U) and the proof of Theorem 1. In practice, p and φ are generally unknown and can be replaced by their consistent estimators, and respectively. A test procedure for testing H 0 against H 1 is then: reject H 0 at 100α% level of significance if is larger than z 1−, cut-off point of standard normal distribution, where σ̂ 2 is a consistent estimator of σ 2 got by replacing p and φ by p̂ and φ̂ For computational purposes, it is necessary to express U as a function of ranks. Let , , and represent numbers of observations in three groups-causes are observed to be 1, causes are observed to be 0 and causes are not observed, respectively. Let the corresponding ordered times in each group be represented by X (1), X (2),…, , Y (1), Y (2),…, , and Z (1), Z (2),…, , respectively. Let R denote the combined rank of X ( in the ordered arrangement of (n 1+n 2) samples of type X and Y, S denote the combined rank of X ( in the ordered arrangement of (n 1+n 3) samples of type X and Z, and Q denote the combined rank of Y ( in the ordered arrangement of (n 2+n 3) samples of type Y and Z. Then, the number of observations for which ψ(.)takes value Thus, the expression U in terms of ranks is Consider testing H 0 against H 2. The U-statistic for testing H 0 against H 2 iswhereThis test was proposed in [2]. If δ is missing for some units then ψ(T, δ, T, δ) cannot be defined for all pairs. Table 6 shows the pairs for which the kernel is defined completely and also the cases where it is not defined.

Table 6

Values taken by the kernel ψ for various combinations of the pairs (T,δ) and (T,δ)

(δ_i,δ_j)	(1,1)	(1,0)	(1,m)	(0,1)	(0,0)	(0,m)	(m,1)	(m,0)	(m,m)
T_i≤T_j	1	1	1	0	0	0	?	?	?
T_i>T_j	1	0	?	1	0	?	1	0	?

As in the earlier subsection, we define a modified kernel to take into account missing causes as follows.Let the corresponding U-statistic be U where the subscript m indicates missing data situation. The expectations of U are given byand under H 2. The asymptotic variance of under H 0, denoted as Var(U) isNote that when p = 1, the variance simplifies to (4/3)φ(1−φ), which is given in [2]. From the central limit theorem of U-statistics [16] (see Text S5), it follows that U has asymptotic normal distribution for large n. Theorem 2: Under H 0, converges in distribution to N(0,σ 2 ), where σ 2 = (4/3)p 2 φ(1−φ)+(1/3)p(1−p), as n→∞. We refer to the supplementary material (Text S7) for the explicit derivation of E(UPQDm), Var(U) and the proof of Theorem 2. We reject the null hypothesis for large values of where E(Û ) and σ̂ are obtained by replacing φ and p by their empirical estimators. As mentioned earlier, p and φ can be replaced by their consistent estimators and . Let R * denote the rank of T in the ordered observations (T 1, T 2, …, T). Then, it is easy to see thatNote that a one-sided test based on U, where H 0 is rejected for large values of U can also be used for testing H 0 against H 2 since E(U)≥0 under H 2. In fact, the one-sided test uses data on both the T and δ in each pairwise comparison while U uses only information on (T, δ) from one and T from the other in a pairwise comparison. We refer the one-sided test based on U as U 1. SAS source code for Example 1 (0.06 MB DOC) Click here for additional data file. SAS source code for Example 2 (0.06 MB DOC) Click here for additional data file. SAS source code for Example 3 (0.03 MB DOC) Click here for additional data file. A short guide on the use of SAS codes (0.02 MB DOC) Click here for additional data file. Central limit theorem for U-statistics (0.07 MB DOC) Click here for additional data file. Derivation of E(Ukm), Var(Ukm) and proof of Theorem 1 (0.14 MB DOC) Click here for additional data file. Derivation of E(UPQDm), Var(UPQDm) and proof of Theorem 2 (0.07 MB DOC) Click here for additional data file.

3 in total

On testing dependence between time to failure and cause of failure when causes of failure are missing.

Introduction

Results

Discussion

Materials and Methods

General dependence between T and δ

1. Estimation of competing risks with general missing pattern in failure types.

2. Comparison between two partial likelihood approaches for the competing risks model with missing cause of failure.

3. Accuracy of fatal motorcycle-injury reporting on death certificates.