Literature DB >> 26366190

Efficient Noninferiority Testing Procedures for Simultaneously Assessing Sensitivity and Specificity of Two Diagnostic Tests.

Guogen Shan¹, Amei Amei², Daniel Young³.

Abstract

Sensitivity and specificity are often used to assess the performance of a diagnostic test with binary outcomes. Wald-type test statistics have been proposed for testing sensitivity and specificity individually. In the presence of a gold standard, simultaneous comparison between two diagnostic tests for noninferiority of sensitivity and specificity based on an asymptotic approach has been studied by Chen et al. (2003). However, the asymptotic approach may suffer from unsatisfactory type I error control as observed from many studies, especially in small to medium sample settings. In this paper, we compare three unconditional approaches for simultaneously testing sensitivity and specificity. They are approaches based on estimation, maximization, and a combination of estimation and maximization. Although the estimation approach does not guarantee type I error, it has satisfactory performance with regard to type I error control. The other two unconditional approaches are exact. The approach based on estimation and maximization is generally more powerful than the approach based on maximization.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 26366190 PMCID： PMC4558434 DOI： 10.1155/2015/128930

Source DB: PubMed Journal: Comput Math Methods Med ISSN： 1748-670X Impact factor: 2.238

1. Introduction

Sensitivity and specificity are often used to summarize the performance of a diagnostic or screening procedure. Sensitivity is the probability of positive diagnostic results given the subject having disease, and specificity is the probability of a negative outcome as the diagnostic result in the nondiseased group. Diagnostic tests with high values of sensitivity and specificity are often preferred and they can be estimated in the presence of a gold standard. For example, two diagnostic tests, the technetium-99m methoxyisobutylisonitrile single photon emission computed tomography (Tc-MIBI SPECT) and the computed tomography (CT), were compared for diagnosing recurrent or residual nasopharyngeal carcinoma (NPC) from benign lesions after radiotherapy in the study by Kao et al. [1]. The gold standard in their study is the biopsy method. The sensitivity and specificity are 73% and 88% for the CT test and 73% and 96% for the Tc-MIBI SPECT test. Traditionally, noninferiority of sensitivity and specificity between two diagnostic procedures is tested individually using the the McNemar test [2-6]. Recently, Tange et al. [7] developed an approach to simultaneously test sensitivity and specificity in noninferiority studies. Lu and Bean [2] were among the first researchers to propose a Wald-type test statistic for testing a nonzero difference in sensitivity or specificity between two diagnostic tests for paired data. Later, it was pointed out by Nam [3] that the test statistic by Lu and Bean [2] has unsatisfactory type I error control. A new test statistic based on a restricted maximum likelihood method was then proposed by Nam [3] and was shown to have good performance with actual type I error rates closer to the desired rates. This test statistic was used by Chen et al. [8] to compare sensitivity and specificity simultaneously in the presence of a gold standard. Actual type I error rates for a compound asymptotic test were evaluated on some specific points in the sample space. It is well known that the asymptotic method behaves poorly when the sample size is small. Therefore, it is not necessary to comprehensively evaluate type I error rate [9-14]. An alternative to an asymptotic approach is an exact approach conducted by enumerating all the possible tables for given total sample sizes of diseased and nondiseased subjects. The first commonly used unconditional approach is a method based on maximization [15]. In the unconditional approach, only the number of subjects in the diseased and nondiseased group is fixed, not the total number of responses from both groups. The latter is considered as the usual conditional approach by treating both margins of the table as fixed. The p value of the unconditional approach based on maximization is calculated as the maximum of the tail probability over the range of a nuisance parameter [15]. This approach has been studied for many years and it can be conservative due to a smaller actual type I error rate as compared to the test size in small sample settings. One possible reason leading to the conservativeness of this approach is the spikes in the tail probability curve. Storer and Kim [16] proposed another unconditional approach based on estimation which is also known as the parametric bootstrap approach. The maximum likelihood estimate (MLE) is plugged into the null likelihood for the nuisance parameter. Other estimates may be considered if the MLE is not available [7]. Although this estimation based approach is often shown to have type I error rates being closer to the desired size than asymptotic approaches, it still does not respect test size. A combination of the two approaches based on estimation and maximization has been proposed by Lloyd [4, 17] for the testing of noninferiority with binary matched-pairs data, which can be obtained from a case-control study and a twin study. The p value of the approach based on estimation is used as a test statistic in the following maximization step. It should be noted that there could be multiple estimation steps before the final maximization step. The final step must be a maximization step in order to make the test exact. This approach has been successfully extended for the testing trend with binary endpoints [5, 18]. The rest of this paper is organized as follows. Section 2 presents relevant notation and testing procedures for simultaneously testing sensitivity and specificity. In Section 3, we extensively compare the performance of the competing tests. A real example is illustrated in Section 4 for the application of asymptotic and exact procedures. Section 5 is given to discussion.

2. Testing Approaches

Each subject in a study is evaluated by two dichotomous diagnostic tests, T 1 and T 2, in the presence of a gold standard. Suppose each subject, either diseased or nondiseased, was already determined by the gold standard before performing the two diagnostic tests. Within the diseased group, n (i = 0,1; j = 0,1) is the number of subjects with diagnostic results T 1 = i and T 2 = j, where T = 0 and T = 1 represent negative and positive diagnostic results from the kth test (k = 1,2), respectively, with p being the associated probability. The total number of diseased subjects is n = n 00 + n 10 + n 01 + n 11. Similarly, m (i = 0,1; j = 0,1) is the number of subjects with diagnostic results T 1 = i and T 2 = j in the nondiseased group, q is the associated probability, and m = m 00 + m 10 + m 01 + m 11 is the total number of nondiseased patients. Such data can be organized in a 2 × 2 × 2 contingency table (Table 1), where N = (n 00, n 10, n 01, n 11) and M = (m 00, m 10, m 01, m 11). It is reasonable to assume that the diseased group is independent of the nondiseased group.

Table 1

Test results from two diagnostic tests when a gold standard exists.

Diagnostic result	Diseased group		Nondiseased group
Diagnostic result	T ₂ = 1	T ₂ = 0	T ₂ = 1	T ₂ = 0
T ₁ = 1	n ₁₁(p ₁₁)	n ₁₀(p ₁₀)	m ₁₁(q ₁₁)	m ₁₀(q ₁₀)
T ₁ = 0	n ₀₁(p ₀₁)	n ₀₀(p ₀₀)	m ₀₁(q ₀₁)	m ₀₀(q ₀₀)

In a study with given total sample sizes n and m in the diseased and the nondiseased groups, respectively, sensitivities of diagnostic tests T 1 and T 2 are estimated as and . Similarly, and are specificities for T 1 and T 2, respectively. The estimated difference between their sensitivities isand the estimated difference between their specificities is The hypotheses for noninferiority of sensitivity and specificity between T 1 and T 2 are given in the format of compound hypotheses as againstwhere δ sen and δ spe are the clinical meaningful differences between T 1 and T 2 in sensitivity and specificity, δ sen > 0 and δ spe > 0. For example, investigators may consider a difference in sensitivity of less than 0.2 not clinically important (δ sen = 0.2). A test statistic for the hypotheses H 0: θ sen ≤ −δ sen versus H : θ sen > −δ sen iswhere is the estimated difference in sensitivities and is the estimated standard error of . The estimate of based on a restricted maximum likelihood estimation approach [3, 19, 20] is used, and the associated form is , whereThere are two reasons for using this estimate instead of some other estimates [2]. First, it has been shown to perform well [8, 20]. Second, it is applicable to a 2 × 2 contingency table with off-diagonal zero cells. We are going to consider the exact approaches by enumerating all possible tables with some of them having zero cells in off-diagonals. The traditional estimate for σ sen does not provide a reasonable estimate of variance for such tables. The test statistic for sensitivity in (5) follows a normal distribution asymptotically. The null hypothesis H 0: θ sen ≤ −δ sen would be rejected if the test statistic Z sen in (5) is greater than or equal to z , where z is the upper α percentile of the standard normal distribution. As mentioned by many researchers, the asymptotic approach has unsatisfactory type I error control especially in small or medium sample settings. An alternative is an exact approach by enumerating all possible tables for a given total of sample sizes. The first exact unconditional approach considered here is a method based on maximization (referred to as the M approach) [15]. The p value of this approach is calculated as the maximum of the tail probability. In this approach, the worst possible value for the nuisance parameter is found in order to calculate the p value, where N obs is the observed data of N. The tail set based on the test statistic Z sen for this approach is It is easy to show that (n 10, n 01∣n) follows a trinomial distribution with parameters (n; p 10, p 01). Then, the M p value is expressed as where Θ = (δ sen, min(1, (1 + δ sen)/2)) is the search range for the nuisance parameter p 01 and Pr⁡(n 10, n 01; p 01) = (n!/n 10!n 01!(n − n 10 − n 01)!)(p 01 − δ sen) p 01 (1 − 2p 01 + δ sen) is the probability density function for a trinomial distribution. The M approach could be conservative when the actual type I error is much less than the test size [5, 9]. To overcome this disadvantage of exact unconditional approaches, Lloyd [21] proposed a new exact unconditional approach based on estimation and maximization (referred to as the E + M approach). The first step in this approach is to compute the p value for each table based on the estimation approach [16], also known as parametric bootstrap. We refer to this approach as the E approach. The nuisance parameter in the null likelihood is replaced by the maximum likelihood estimate and the E p value is calculated as It should be noted that the E approach does not guarantee type I error rate. Once the E p values are calculated for each table, they will be used as a test statistic in the next M step for the p value calculation. The E + M p value is then given by where R (N obs) = {N; P (N) ≤ P (N obs)} is the tail set. The refinement from the E step in the E + M approach could possibly increase the actual type I error rate of the testing procedure which may lead to power increase for exact tests. Monotonicity is an important property in exact testing procedures to reduce the computation time and guarantee that the maximum of the tail probability is attained at the boundary for noninferiority hypotheses. Berger and Sidik [22] showed that monotonicity is satisfied for paired data for testing one-sided hypothesis based on the NcNemar test. Most importantly, the dimension of nuisance parameters is reduced from two to one [17]. We provide the following theorem to show the monotonicity of the test statistic Z sen.

Theorem 1 .

Monotonicity property is satisfied for Z under the null hypothesis: Z (n 10, n 01 + 1) ≤ Z (n 10, n 01) and Z (n 10, n 01) ≤ Z (n 10 + 1, n 01).

Proof

Let x 1 = n 10 and x 2 = n 10 + 1. For a given n 01, Under the null hypothesis, . In order to show Z sen(x 2) ≥ Z sen(x 1), we only need to prove that . From (6), we know that where A and B are given in (6). It is obvious that B is a decreasing function of n 10 and A is a positive constant number when n 01 is fixed and is an increasing function of n 10, which leads to . It follows that Z sen(x 2) ≥ Z sen(x 1). For a given n 10, similar proof will lead to a result of Z sen(n 10, n 01 + 1) ≤ Z sen(n 10, n 01). The probability of the tail set for either the M approach or the E + M approach has two nuisance parameters, p 01 and p 10. Applying the theorem for the monotonicity property, type I error of the test occurs on the boundary of the two-dimensional nuisance parameter space, p 01 = p 10. Therefore, there is only one nuisance parameter, p 01, in the definition of the two exact p values. For testing the specificity, the asymptotic approach, the M approach, the E approach, and the E + M approach can be similarly applied to test the hypotheses H 0: θ spe ≤ −δ spe against H : θ spe > −δ spe. The test statistic [3, 19, 20] would bewhere is the estimated standard error of , , C = δ spe(δ spe + 1)m 10/m, and . Under the null hypothesis, one can show that the monotonicity of Z spe is in a similar way to Z sen. When there are two diagnostic tests available, we may want to simultaneously confirm the noninferiority of sensitivity and specificity for the two tests. The population from the diseased group and the nondiseased group can be reasonably assumed to be independent of each other. Then, the joint probability is a product of two probabilities: where R is the rejection region. Let α sen and α spe be the significance levels for testing sensitivity and specificity separately. We can reject the compound null hypothesis H 0: θ sen ≤ −δ sen or θ spe ≤ −δ spe at the significance level of α when the sensitivity null hypothesis is rejected at the level of α sen and the specificity null is rejected at the level of α spe, where α sen α spe = α. For simplicity, we assume .

3. Numerical Study

We already know that both the asymptotic approach and the E approach do not guarantee type I error rate; however, it is still interesting to compare type I error control for the following four approaches: (1) the asymptotic approach, (2) the E approach, (3) the M approach, and (4) the E + M approach. We select three commonly used values of δ sen and δ spe, 0.05, 0.1, and 0.2. For each configuration of δ sen and δ spe, actual type I error rates are presented in Table 2 for sample size n = m = 20 and in Table 3 for sample size n = m = 50 at the significance level of α = 0.05. It can be seen from both tables that the asymptotic approach generally has inflated type I error rates. Both the M approach and the E + M approach are exact tests and respect the test size as expected. Although the E approach does not guarantee type I error rate, the performance of the E approach is much better than the asymptotic approach regarding the type I error control. Even for large sample size, the M approach is still conservative. The E + M approach has an actual type I error rate which is very close to the nominal level when n = m = 50.

Table 2

Actual type I error rates n = m = 20.

δ _sen	δ _spe	A approach	M approach	E approach	E + M approach
0.05	0.05	0.1285	0.0343	0.0499	0.0489
	0.1	0.0894	0.0380	0.0489	0.0490
	0.2	0.0877	0.0401	0.0479	0.0480

0.1	0.05	0.0894	0.0380	0.0489	0.0490
	0.1	0.0621	0.0421	0.0481	0.0492
	0.2	0.0610	0.0444	0.0470	0.0481

0.2	0.05	0.0877	0.0401	0.0479	0.0480
	0.1	0.0610	0.0444	0.0470	0.0481
	0.2	0.0599	0.0468	0.0460	0.0471

Table 3

Actual type I error rates n = m = 50.

δ _sen	δ _spe	A approach	M approach	E approach	E + M approach
0.05	0.05	0.0821	0.0300	0.0492	0.0498
	0.1	0.0731	0.0341	0.0489	0.0493
	0.2	0.0677	0.0356	0.0486	0.0498

0.1	0.05	0.0731	0.0341	0.0489	0.0493
	0.1	0.0650	0.0387	0.0486	0.0489
	0.2	0.0603	0.0404	0.0482	0.0494

0.2	0.05	0.0677	0.0356	0.0486	0.0498
	0.1	0.0603	0.0404	0.0482	0.0494
	0.2	0.0559	0.0422	0.0479	0.0499

The asymptotic approach will not be included in the power comparison due to inflated type I error rates. We include the E approach in the power comparison with the M approach and the E + M approach due to the good performance of type I error control in the E approach. The power is a function of four parameters: p 01, θ sen, q 10, and θ spe where ϕ = E, M and E + M approaches and R sen and R spe are the rejection region for the diseased group and the nondiseased group at a significance level of based on the ϕ approach. Given the two parameters q 10 and θ spe in the nondiseased group, the power is a function of θ sen for a given p 01. We compared multiple configurations of the parameters. Typical comparison results for balanced data are presented in Figure 1. The power difference between the E approach and the E + M approach is often negligible and both are generally more powerful than the M approach. We also compared the power for unbalanced data with the ratio of sample size 1/2, 1/3, 2, and 3. Similar results are observed as compared to the balanced data; see Figure 2. We also observe similar results in comparing the power as a function of θ spe for the given θ sen, p 01, and q 10.

Figure 1

Power curves for the E approach, the M approach, and the E + M approach for balanced data with θ spe = 0, q 10 = 0.2, p 01 = 0.3, δ sen = 0.2, and δ spe = 0.2 for the first row and θ spe = 0, q 10 = 0.2, p 01 = 0.4, δ sen = 0.4, and δ spe = 0.2 for the second row.

Figure 2

Power curves for the E approach, the M approach, and the E + M approach for unbalanced data with θ spe = 0, q 10 = 0.3, p 01 = 0.2, δ sen = 0.1, and δ spe = 0.1.

4. An Example

Kao et al. [1] compared diagnostic tests to detect recurrent or residual NPC in the presence of a gold standard test. Simultaneous comparison of sensitivity and specificity is conducted between the CT test (T 1) and the Tc-MIBI SPECT test (T 2), with n = 11 and m = 25. The diagnostic results using these two tests are presented in Table 4. The sensitivity and specificity are 73% and 88% for the CT test and 73% and 96% for the Tc-MIBI SPECT test. The clinical meaningful difference in sensitivity and specificity is assumed to be δ sen = 0.01 and δ spe = 0.01, respectively. Four testing procedures are used to calculate the p value: (1) the asymptotic approach; (2) the E approach; (3) the M approach; and (4) the E + M approach. The p values based on the asymptotic, E, M, and E + M approaches are 0.0677, 0.0317, 0.0764, and 0.0418, respectively. Both the E approach and the E + M approach reject the null hypothesis at a 5% significance level, while the asymptotic approach and the M approach do not. It should be noted that the two tests have the same sensitivities which may contribute to the significant result even with a small difference between the two tests.

Table 4

Results of CT and Tc-MIBI SPECT diagnoses of NPC in the presence of a gold standard.

Diagnostic result	Diseased group (NPC: +)		Nondiseased group (NPC: −)
Diagnostic result	CT: +	CT: −	CT: +	CT: −
Tc-MIBI SPECT: +	5	3	1	0
Tc-MIBI SPECT: −	3	0	2	22

5. Discussion

In this paper, the asymptotic approach, the E approach, the M approach, and the E + M approach are considered for testing sensitivity and specificity simultaneously in the presence of a gold standard. Although the E approach does not guarantee type I error rate, it has good performance regarding type I error rate control and the difference between the E approach and the E + M approach is negligible. Since the computational time is not an issue for this problem and the E + M approach is an exact method, the E + M approach is recommended for use in practice due to the power gain as compared to the M approach. Tang [9] has studied the E approach and the M approach for comparing sensitivity and specificity when combining two diagnostic tests. The E approach has been shown to be a reliable testing procedure. We would consider comparing the E + M approach with the E approach in this context as a future work. The intersection-union method may be considered for testing sensitivity and specificity [8].

16 in total

1. Tests for equivalence or non-inferiority for paired binary data.

Authors: Jen-pei Liu; Huey-miin Hsueh; Eric Hsieh; James J Chen
Journal: Stat Med Date: 2002-01-30 Impact factor: 2.373

2. Exact unconditional tests for a 2 x 2 matched-pairs design.

Authors: R L Berger; K Sidik
Journal: Stat Methods Med Res Date: 2003-03 Impact factor: 3.021

3. Statistical analysis of noninferiority trials with a rate ratio in small-sample matched-pair designs.

Authors: Ivan S F Chan; Nian-Sheng Tang; Man-Lai Tang; Ping-Shing Chan
Journal: Biometrics Date: 2003-12 Impact factor: 2.571

4. Sample size determination for matched-pair equivalence trials using rate ratio.

Authors: Nian-Sheng Tang; Man-Lai Tang; Shun-Fang Wang
Journal: Biostatistics Date: 2006-10-27 Impact factor: 5.899

5. A new exact and more powerful unconditional test of no treatment effect from binary matched pairs.

Authors: Chris J Lloyd
Journal: Biometrics Date: 2007-11-19 Impact factor: 2.571

6. Establishing equivalence of two treatments and sample size requirements in matched-pairs design.

Authors: J M Nam
Journal: Biometrics Date: 1997-12 Impact factor: 2.571

7. Equivalence test and confidence interval for the difference in proportions for the paired-sample design.

Authors: T Tango
Journal: Stat Med Date: 1998-04-30 Impact factor: 2.373

8. On the sample size for one-sided equivalence of sensitivities based upon McNemar's test.

Authors: Y Lu; J A Bean
Journal: Stat Med Date: 1995-08-30 Impact factor: 2.373

9. Detection of recurrent or persistent nasopharyngeal carcinomas after radiotherapy with technetium-99m methoxyisobutylisonitrile single photon emission computed tomography and computed tomography: comparison with 18-fluoro-2-deoxyglucose positron emission tomography.

Authors: Chia-Hung Kao; Yu-Chien Shiau; Yeh-You Shen; Ruoh-Fang Yen
Journal: Cancer Date: 2002-04-01 Impact factor: 6.860

10. Some tests for detecting trends based on the modified Baumgartner-Weiβ-Schindler statistics.

Authors: Guogen Shan; Changxing Ma; Alan D Hutson; Gregory E Wilding
Journal: Comput Stat Data Anal Date: 2013-01 Impact factor: 1.681

1 in total

1. Machine learning methods to predict amyloid positivity using domain scores from cognitive tests.

Authors: Guogen Shan; Charles Bernick; Jessica Z K Caldwell; Aaron Ritter
Journal: Sci Rep Date: 2021-03-01 Impact factor: 4.379

1 in total