Literature DB >> 35714754

Optimal decision theory for diagnostic testing: Minimizing indeterminate classes with applications to saliva-based SARS-CoV-2 antibody assays.

Paul N Patrone¹, Prajakta Bedekar², Nora Pisanic³, Yukari C Manabe⁴, David L Thomas⁴, Christopher D Heaney³, Anthony J Kearsley⁵.

Abstract

In diagnostic testing, establishing an indeterminate class is an effective way to identify samples that cannot be accurately classified. However, such approaches also make testing less efficient and must be balanced against overall assay performance. We address this problem by reformulating data classification in terms of a constrained optimization problem that (i) minimizes the probability of labeling samples as indeterminate while (ii) ensuring that the remaining ones are classified with an average target accuracy X. We show that the solution to this problem is expressed in terms of a bathtub-type principle that holds out those samples with the lowest local accuracy up to an X-dependent threshold. To illustrate the usefulness of this analysis, we apply it to a multiplex, saliva-based SARS-CoV-2 antibody assay and demonstrate up to a 30 % reduction in the number of indeterminate samples relative to more traditional approaches. Published by Elsevier Inc.

Entities: Chemical

Keywords: Antibody; Bathtub-type principle; Classification; Inconclusive; SARS-CoV-2; Saliva

Mesh：

Substances：
Antibodies, Viral

Year: 2022 PMID： 35714754 PMCID： PMC9195412 DOI： 10.1016/j.mbs.2022.108858

Source DB: PubMed Journal: Math Biosci ISSN： 0025-5564 Impact factor: 3.935

Introduction

The SARS-CoV-2 pandemic has highlighted the importance of antibody testing as a means to monitor the spread of diseases such as COVID-19 [1], [2]. But the widespread deployment of new assays has also revealed fundamental problems in the ability to reliably analyze the corresponding measurements. Early on, this shortcoming was attributed to low prevalence, which made it difficult to distinguish true and false positives [3]. However, it soon became clear that there were deeper issues related to statistical interpretation of raw data, suggesting the need to revisit the underlying theory of diagnostic classification [4], [5], [6]. In this context, a fundamental problem arises when many measurements fall near a cutoff used to distinguish positive and negative samples. The probability of correctly classifying these borderline cases hovers near 50%, so that even a small fraction thereof can significantly decrease overall accuracy. A common solution is to define a third, indeterminate class for which one cannot draw meaningful conclusions, although this is not always chosen to be near a cutoff [7], [8], [9], [10], [11], [12], [13]. While this approach increases the average accuracy for those samples that are classified, it also decreases testing efficiency. Thus, there is a need to develop strategies that balance the construction of indeterminate classes against overall assay performance. The present manuscript addresses this problem by answering the question: what classification scheme (I) minimizes the fraction of indeterminate samples while (II) correctly identifying the remaining ones with a minimum average accuracy ? When an indeterminate class is not permitted, common practice categorizes a sample as positive or negative if its measurement value falls in a corresponding domain or ; see Fig. 1. Moreover, it was recently shown that these domains can be optimized by solving an unconstrained optimization problem that maximizes accuracy associated with “binary” classification [5]. In contrast, the present work views (I) and (II) as a constrained optimization problem, with the size of the indeterminate class being the objective and the desired accuracy recast as a constraint. We show that the solution to this problem extends the binary classification result by constructing the smallest indeterminate class via a bathtub-type principle applied to and : one removes from them the measurements with the lowest probability of being correctly classified up to an -dependent threshold. As a practical matter, this “waterline” bounding the indeterminate domain can be efficiently and accurately estimated via numerical techniques such as bisection, making our result computationally tractable. We provide examples and numerical validation using a saliva-based, multiplex SARS-CoV-2 antibody test; the Appendix provides mathematical proofs of our main results.

Fig. 1

Training data associated with the Saliva assay described in Refs. [11], [12]. Red x denote known positives (confirmed via polymerase chain-reaction measurements), and blue o denote pre-pandemic samples, which are assumed to be negative for SARS-CoV-2 antibodies. The bold, horizontal and vertical black lines are cutoffs used to classify samples. Data falling above the horizontal line (red shaded domain) are classified positive; data in the lower right box (shaded blue) are negative, and data in the lower left box (shaded yellow) are indeterminates. The SARS-CoV-2 IgG measurements (vertical axis) are a sum of seven antibody levels measured by the assay, whereas the total IgG measurement (horizontal axis) is the total immunoglobulin-G (IgG) measurement as determined by an enzyme-linked immunosorbent assay (ELISA).

At the outset and in contrast with traditional methods, it is important to note that concepts such as specificity and sensitivity per se are not fundamental quantities of interest in our analysis. As discussed in Section 6, they describe the accuracy of a fixed and subjective classification scheme in two degenerate cases: 0% and 100% prevalence. As such, it is trivial (but useless) to optimize either quantity by assigning all samples to a single class. Rather, we treat accuracy – defined as a prevalence-weighted, convex combination of sensitivity and specificity – as fundamental, since this naturally interpolates between the aforementioned degenerate cases. This choice also highlights an important (but often-ignored) fact: the numbers of false positives and false negatives change with prevalence. Thus, sensitivity and specificity may not be useful metrics of assay performance in a setting where a disease is actively spreading. The bathtub-type principle also reveals that these quantities are not mathematically fundamental, since they arise from more granular notions of conditional accuracy. Ultimately, this analysis leads to the realization that classification accuracy has both a local and global interpretation, and the interplay between these interpretations is fundamental to both the problem considered herein the general theory of classification.2 In particular, one can construct conditional probability density functions (PDFs) and of a measurement outcome – i.e. a local property – for (known) positive and negative samples. As shown in Ref. [5], these PDFs are necessary to maximize the global accuracy , since the equation defines the boundary between and when is the prevalence. In the present work, we show that and also directly define the local accuracy , and that its global counterpart is the average value of . We next observe that the boundary given by Eq. (1) is the set for which %, its lowest possible value. The corresponding points are the first to be held out, since they contribute most to the average error.3 Moreover, one sees that systematically removing the least accurate yields the fastest increase in the global accuracy for the remaining points. Our bathtub-type principle formalizes this idea. This intuition also distinguishes our approach from Ref. [5], which considered uncertainty in classification due to effects that are external to the assay. In that work, the authors demonstrated that the optimal binary domains become ambiguous when prevalence is only given to within confidence intervals. They showed that this problem can be addressed by holding out samples whose classes were likewise ambiguous. In contrast, our approach defines the indeterminate class as those measurements with the highest inherent uncertainty as quantified in terms of local accuracy. In practice, we find that such effects are often more important, especially as several references have addressed issues pertaining to accurate, unbiased prevalence estimation [5], [6], [14]. From a practical standpoint, the main inputs to our analysis are training data associated with positive and negative samples; thus our approach is compatible with virtually any antibody assay. These data are used to construct the conditional PDFs and , so that the classification and holdout problems are reduced to mathematical modeling. This is also the key limitation of our approach insofar as such models are necessarily subjective. However, this problem is not unique to our method. Where possible, we incorporate objective information about the measurement process. See Section 4 and Ref. [5] for a deeper discussion of such issues and other limitations. The remainder of this manuscript is organized as follows. Section 2 reviews key notation and terminology. Section 3 presents the general theory for defining optimal indeterminate domains. Section 4 illustrates this analysis in the context of a saliva-based, multiplex SARS-CoV-2 saliva assay. Section 5 considers numerical validation of our analysis, and Section 6 concludes with a discussion and comparison with past works. The Appendix provides a proof of our main result and other supporting information.

Notation and terminology

One of our primary goals is to provide practical, accessible tools for constructing indeterminate classes. In doing so, we must combine abstract ideas from measure theory with concepts in applied diagnostics; it is not reasonable to assume that any reader will have a background in both. In this section, we provide necessary information so that: (I) an expert in diagnostics can understand and implement our main results (i.e. construct holdout domains) without needing to derive the associated proofs; and (II) a mathematician can verify our work and understand the applied context. Readers with background in either the mathematical notation (Section 2.1) or diagnostic terminology (Section 2.2) may skip the corresponding sections. [We also advise mathematicians to see our definition of a bathtub-type principle in Section 2.1, as it differs from Theorem 1.14 in [15].] We also refer readers to Refs. [16], [17] for deeper background on measure theory.

Mathematical notation and concepts

By a set, we mean a collection of objects, e.g. measurements or measurement values. By a domain, we typically mean a set in some continuous measurement space; see, e.g., Fig. 1. The symbol indicates set inclusion. That is, means that is in set . The symbol denotes the empty set, which has no elements. The operator denotes the union of two sets. That is, is the set containing all elements that appear in either or . The operator denotes the intersection of two sets. That is, is the set of elements shared by both and . The operator denotes the set difference. We write to mean the set of all objects in that are not also in . Note that in general, . Equivalently, can be interpreted as the “subtraction” or removal from of the elements it shares in common with . The notation defines the set as the collection of satisfying condition . By a bathtub-type principle, we mean the solution to a constrained optimization problem that determines an optimal set via a nonlinear inequality of the form for some constant and function . Unless otherwise specified, the “size” or measure of a set refers to the probability of a sample falling within that set, i.e. its probability mass. By the same token, we generally avoid using size to describe the actual dimensions (in measurement space) of a domain.

Notation and concepts from applied diagnostics

Training data correspond to samples whose true classes are known. In general, training data is used to construct probability models and/or validate an analysis. Test data corresponds to samples whose true classes are unknown, or treated as such for purposes of validation. Test data are the objects to which a classification analysis is applied. Prevalence is the fraction of samples in a population that are positive. As such, it is the probability that a person picked at random is positive, given no other information. Sensitivity (Specificity ) is the relative fraction of positive (negative) samples that are classified correctly. We take the common convention that these quantities refer to expectation values or averages.

Notation specific to the present work

The non-caligraphic sets and denote positive and negative classification domains in the binary (no-holdout) problem. Caligraphic sets and are the corresponding domains in the classification problem with indeterminate samples. is an indeterminate set. The shorthand is used throughout and denotes the set of all samples that are classified as positive or negative. The use of a superscript denotes an optimal quantity. For example, is an optimal positive classification domain.

Minimum probability indeterminate class

We begin with the mathematical setting underlying classification. Consider an antibody measurement , which can be a vector associated with multiple distinct antibody targets. We take the set of all admissible measurements to be . Our goal is to define three domains, , , and associated with positive, negative, and indeterminate (or for “hold-out”) samples. In particular, we say that a test sample is positive if it falls inside (i.e. ), and likewise for the other domains. We require that these domains have several basic properties to ensure that they define a valid classification scheme. Recalling that and are conditional probabilities associated with positive and negative samples, define the measures of a set with respect to and to be That is, is the probability of a positive sample falling in , etc. We then require that and when , for chosen from , , or . Eq. (3) states that the probability of any event falling in the positive, negative, or indeterminate domains is one; i.e. any sample can be classified. Eq. (4) states that the probability of a sample falling in more than one domain is zero, i.e. a sample has a single classification. Given and , the law of total probability [18] implies that is the PDF that a test sample yields measurement , where is the prevalence.4 The quantity is the probability that a sample is both positive and yields , with a corresponding interpretation for . This motivates us to define the total error rate The terms on the right-hand side (RHS) are the rates of false positives and false negatives (normalized by the number of tests). Eq. (6) treats any misclassification as equally undesirable, but importantly, indeterminates are not considered errors in Eq. (6). Thus, so defined is not the error rate of the assay restricted to samples that fall only within and . The latter is defined as where is the set of all samples not in the indeterminate region. Eq. (7) is a conditional expectation; i.e. it is the average error conditioned on the set of samples that can be classified. In Ref. [5] we showed that when the set has measure zero and is the empty set,5 is minimized by the binary classification scheme for prevalence . In light of the definition of , interpretation of Eqs. (8a), (8b) is straightforward: classify a sample as positive (negative) if the probability of being both positive (negative) and having value is greater than the corresponding probability of being negative (positive) and having value . [See Chapter 3 of [19] for related ideas.] While and are not the optimal sets for the problem at hand, they play a fundamental role in the analysis that follows. We also note an important corollary that when the has non-zero measure, Eqs. (8a), (8b) are generalized to where and are an arbitrary partition of . The physical interpretation of this generalization is that any point having equal probability of being negative or positive can be assigned to either class without changing the error. In practice, however, classification often reverts to Eqs. (8a), (8b) as has zero measure for many practical PDFs. In the present work, we assume that there is a desired average accuracy and that when all samples are classified. Our goal is to define a minimum probability indeterminate class and domains and for which ; that is, we wish to hold out the fewest samples so that those remaining are classified with the desired accuracy. Mathematically, we seek to minimize subject to the constraint that for . In light of Eq. (7), this constraint fixes the conditional expectation [20] of the assay accuracy restricted to ; i.e. the accuracy of the assay excluding the holdout domain must be . To solve this problem, it is useful to introduce several auxiliary concepts. In particular, define the local accuracy of the unconstrained (i.e. no indeterminate), binary classification to be where and cover the whole set up to sets of measure zero; moreover, let be the local accuracy of the optimal solution to the binary problem. Then the solution to the constrained problem given by Eqs. (10), (11) is where is the solution to the equation for any set satisfying Eq. (14). Proof of this result, as well as the strict interpretation of requires significant analysis of Eq. (11) and is reserved for Appendix. Here we provide an intuitive interpretation and describe a straightforward algorithm for computing Eqs. (13a)–(13c). Eq. (13a) informs that the points to label indeterminate are those with the lowest local accuracy up to some threshold value , which depends on . Eqs. (13b), (13c) then amount to the observations that the positive and negative domains are the same as in the unconstrained binary problem, except that we remove the corresponding points with low enough local accuracy. Eq. (14) requires that the average local accuracy for the classification sets and be . By virtue of the fact that , this fixes the boundary of the indeterminate set. That is, the upper bound on the indeterminate local accuracy is the lower bound on the accuracy for sets that can be classified. The is a bookkeeping artifact accounting for the situation in which the set of points with local accuracy has non-zero probability mass. In this case, not all of these points need to be held out if doing so would make greater than . The choice of which points to make indeterminate is subjective as they all have the same local accuracy. In practice (e.g. for smooth PDFs), is a set of measure zero with respect to , so that we can ignore it in Eq. (13a). From Eqs. (13a)–(14) it is clear that determining is the key step in defining the optimal classification domains. Fortunately, the interpretation of Eq. (14) leads to a straightforward bisection method. First note that . Let be an initial guess for the value of , and let be the th update computed iteratively as follows. For each compute , , as well as the left-hand side (LHS) of Eq. (14), which we denote by . If , then set ; if , set . If for some user-defined tolerance , or if reaches some maximum iteration number , stop the algorithm. In the former case, the classified samples will have an average accuracy in the range . In the latter case, , where is the error in the estimate of . For context, 20 iterations of this algorithm yields errors on the order of 1 in . In the second case, the existence of a non-trivial set can be deduced from the observation that does not converge, but rather cycles between two well-separated values, depending on whether is greater than or less than . In this case, the set can be defined arbitrarily but consistent with Eq. (14) once is identified to sufficient accuracy. (In practice and given the speed of convergence, we find that there is little value in considering starting points other than .) Training data associated with the Saliva assay described in Refs. [11], [12]. Red x denote known positives (confirmed via polymerase chain-reaction measurements), and blue o denote pre-pandemic samples, which are assumed to be negative for SARS-CoV-2 antibodies. The bold, horizontal and vertical black lines are cutoffs used to classify samples. Data falling above the horizontal line (red shaded domain) are classified positive; data in the lower right box (shaded blue) are negative, and data in the lower left box (shaded yellow) are indeterminates. The SARS-CoV-2 IgG measurements (vertical axis) are a sum of seven antibody levels measured by the assay, whereas the total IgG measurement (horizontal axis) is the total immunoglobulin-G (IgG) measurement as determined by an enzyme-linked immunosorbent assay (ELISA).

Example applied to a salivary SARS-CoV-2 IgG Assay

To illustrate the analysis of Section 3, we consider a saliva-based assay described in Refs. [11], [12]. We refer the reader to those manuscripts for details of assay design, sample preparation, and measurement processes. For each sample, two measurement values are output: a total immunoglobulin G (IgG) enzyme linked immunosorbent assay (ELISA); and a sum of seven SARS-CoV-2 IgG measurements associated with distinct antigen targets. As a preliminary remark, we observe that the numerical range of the data spans several decades of median fluorescence intensity (MFI), which is difficult to model directly. We also note that the measurements are bounded from below by zero and have a finite upper bound. This motivates us to transform each numerical value via , which puts the data on the scale of bits. Empirically we also find that this transformation better separates positive and negative populations. Total IgG values are then rescaled to the domain by dividing each measurement by the maximum. SARS-CoV-2 measurements are similarly rescaled to the domain , although we divide the log-transformed data by , since there were no samples with saturated values. After transformation, each sample is represented by a two-dimensional vector , where is the normalized total IgG value, and is the normalized SARS-CoV-2 counterpart. Probability models associated with the training data. See main text for a description of the probability density functions and the considerations behind their construction. Top: Raw data and probability model for positive training samples. Bottom: Negative training data and probability model. The results of this transformation are shown in Fig. 1, along with classification domains currently used with this assay.6 The goal of the analysis is to maintain accuracy while decreasing the number of indeterminate samples by finding the domain with the smallest probability mass. We remind the reader that size does not refer to the (generalized) volume in measurement space. Rather it refers to the fraction of samples expected to fall within the domain, since this is what controls the number of indeterminate samples. Thus, it is possible that can be quite large when expressed in terms of antibody levels and yet contain very few samples. To motivate our probability models, we consider the phenomena that could affect measurements. In particular, we anticipate that for positive samples, there should be a degree of correlation between total IgG and SARS-CoV-2 specific antibodies. However, at extreme total IgG values, the SARS-CoV-2 levels may become independent as (i) all measurements will revert to noise when or (ii) SARS-CoV-2 antibody levels will decouple from total antibody levels when the latter is excessively high, e.g. if an individual has been exposed to a large number of different pathogens. We also recognize that the ELISA instrument only reports numerical values on the domain . Thus, fluorescence levels above are rounded down to the upper bound, and levels below are rounded up to the lower bound. As shown in Fig. 1, this has the effect of accumulating data (and thus probability mass) on the lines and , which is a manifestation of data censoring [21], [22]. While details are reserved for the Appendix, this observation leads us to model positive and negative samples via a PDF of the form where , , is the Dirac delta function, and is assumed to be bounded and continuous on the whole domain. The functions and characterize the probability of SARS-CoV-2 antibody levels for measurement values saturated at the left () and right () bounds. We emphasize that the use of delta functions in Eq. (15) is formal and should be treated with care. A more rigorous interpretation of what is meant by Eq. (15) is discussed in Appendix. To model the function , we treat the total IgG measurements as independent normal random variables with an unknown mean and variance. Within the domain (note the strict inequalities) and , we assume that the SARS-CoV-2 measurements are well described by a Gamma distribution with a fixed (but unknown) scale factor and shape parameter with a sigmoidal dependence on . This dependence is motivated by the correlation described previously. Taken together, this yields the PDF where , , , and the are to-be-determined. The boundary functions are defined to be which describes the probability that a total IgG value below (above) () will be mapped back to the lower (upper) instrument bound. The free parameters are determined via maximum likelihood estimation using a censoring-based technique; see the Appendix and Refs. [21], [22]. As an approximation, we truncate the -domain to be and renormalize the resulting PDF on this domain. For the negative PDF , we anticipate that non-specific binding of the total IgG antibodies to the SARS-CoV-2 antigens will lead to a degree of correlation, albeit to a less extent than for positives. Thus, we use the same form of , but refit the parameters using the negative training data. Fig. 2 shows the outcome of this exercise for the two training sets. Because , , and corresponding terms for are continuous with respect to the Gamma portion of and , the former can be inferred from the contour lines in the figure (up to a normalization factor) and are thus not shown.

Fig. 2

Probability models associated with the training data. See main text for a description of the probability density functions and the considerations behind their construction. Top: Raw data and probability model for positive training samples. Bottom: Negative training data and probability model.

Fig. 3, Fig. 4 show and waterlines necessary to achieve different average accuracies. The bathtub-type principle is shown in the latter; see also Ref. [15] for related ideas. To ensure that , we only hold out samples up to the corresponding value of . Note that indeterminates are concentrated in regions where there is significant overlap between positive and negative samples. Fig. 5 shows the corresponding classification domains computed according to the bathtub-type principle for a target accuracy of 99.6%; see also Table 1. Relative to the original classification domains, the analysis reduces the empirical rate of indeterminate samples by more than 12% while increasing both accuracy and sensitivity of the assay (with empirical specificity remaining constant). See also Fig. 6 and Section 6 for additional examples of holdout domains.

Fig. 3

Fig. 4

Illustration of the bathtub-type principle used to compute the minimal probability indeterminate domain. The contour lines are different “waterlines” up to which we can hold out samples. The label on each contour is the local accuracy of the assay. In order to define the indeterminate region, we use the target global accuracy to define a maximum local accuracy up to which we hold out samples. Increasing the global accuracy of the restricted classification increases the waterline, thereby holding out more samples.

Fig. 5

Positive (yellow-green), negative (dark blue), and indeterminate (light-blue) classification domains defined for a theoretical target accuracy of 99.6% for the training data in the previous figures. Symbols have the same meaning as in previous figures. The empirical accuracy is 98.8%, with a specificity of 100% and sensitivity of 96.7%. The total accuracy is the prevalence-weighted combination of these latter quantities. Note the prevalence is associated with the restricted set of samples that are actually classified; see Section 6. Discrepancy between the theoretical and empirical accuracies is due to idealization of the modeling and stochasticity in the data. For comparison, the horizontal and vertical black lines are the same as in Fig. 1 and denote the corresponding classification domains originally used for this assay. The indeterminate region based on the bathtub-type principle reduces the number of unclassified samples by more than 12% relative to the original domains while maintaining specificity and improving sensitivity for the training data. See also Table 1 and Section 6 for other examples of indeterminate domains.

Table 1

Summary of fraction of holdouts, sensitivity, and specificity for the data in Fig. 5, Fig. 6. The rectilinear classification method is described in Fig. 1, while the optimal method is given by Eqs. (13a)–(14). For sensitivity, specificity, and accuracy calculations, the numbers in brackets are empirical 95% confidence intervals.

Data & Method	COVID samples	Pre-COVID samples	All samples
Training samples	147	283	430
Holdouts (Rectilinear)	32/147, 21.8%	64/283, 22.6%	96/430, 22.3%
Holdouts (Optimal)	28/147, 19.1%	56/283, 19.8%	84/430, 19.5%
	Sensitivity	Specificity	Accuracy
Classification (Rectilinear)	111/115, 96.5% [92.0%, 98.9%]	219/219, 100% [98.6%, 100%]	330/334, 98.8% [97.2%, 99.6%]
Classification (Optimal)	115/119, 96.6% [92.3%, 99.0%]	227/227, 100% [98.7%, 100%]	342/346, 98.8% [97.3%, 99.6%]

Validation samples	87	192	279
Holdouts (Rectilinear)	6/87, 6.9%	66/192, 34.4%	72/279, 25.8%
Holdouts (Optimal)	5/87, 5.8%	34/192, 17.7%	39/279, 14.0%
	Sensitivity	Specificity	Accuracy
Classification (Rectilinear)	81/81, 100% [96.3%, 100%]	125/126, 99.2% [96.3%, 100%]	206/207, 99.5% [97.7%, 100%]
Classification (Optimal)	81/82, 98.8% [94.4%, 100.0%]	157/158, 99.4% [97.0%, 100%]	238/240, 99.2% [97.3%, 99.9%]

Fig. 6

Positive (yellow-green), negative (dark blue), and indeterminate (light-blue) classification domains for validation data and defined for a theoretical target accuracy of 99.6%. The validation data was not used for training the probability models. Symbols have the same meaning as in previous figures. The empirical accuracy is 99.2%, with a specificity of 99.4% and sensitivity of 98.8%. The indeterminate region based on the bathtub-type principle reduces the number of unclassified samples by almost 40%. See Section 6 for other examples of indeterminate domains.

Local accuracy of the assay according to the probability models shown in Fig. 2. Note that approaches 100% in regions where and do not overlap. Conversely, in regions where the PDFs overlap, it is more challenging to correctly identify samples. Thus decreases towards its minimal value of 1/2 in such regions. Note that is never less than 1/2 (50/50 odds of correct classification). Illustration of the bathtub-type principle used to compute the minimal probability indeterminate domain. The contour lines are different “waterlines” up to which we can hold out samples. The label on each contour is the local accuracy of the assay. In order to define the indeterminate region, we use the target global accuracy to define a maximum local accuracy up to which we hold out samples. Increasing the global accuracy of the restricted classification increases the waterline, thereby holding out more samples. Positive (yellow-green), negative (dark blue), and indeterminate (light-blue) classification domains defined for a theoretical target accuracy of 99.6% for the training data in the previous figures. Symbols have the same meaning as in previous figures. The empirical accuracy is 98.8%, with a specificity of 100% and sensitivity of 96.7%. The total accuracy is the prevalence-weighted combination of these latter quantities. Note the prevalence is associated with the restricted set of samples that are actually classified; see Section 6. Discrepancy between the theoretical and empirical accuracies is due to idealization of the modeling and stochasticity in the data. For comparison, the horizontal and vertical black lines are the same as in Fig. 1 and denote the corresponding classification domains originally used for this assay. The indeterminate region based on the bathtub-type principle reduces the number of unclassified samples by more than 12% relative to the original domains while maintaining specificity and improving sensitivity for the training data. See also Table 1 and Section 6 for other examples of indeterminate domains. Positive (yellow-green), negative (dark blue), and indeterminate (light-blue) classification domains for validation data and defined for a theoretical target accuracy of 99.6%. The validation data was not used for training the probability models. Symbols have the same meaning as in previous figures. The empirical accuracy is 99.2%, with a specificity of 99.4% and sensitivity of 98.8%. The indeterminate region based on the bathtub-type principle reduces the number of unclassified samples by almost 40%. See Section 6 for other examples of indeterminate domains. Summary of fraction of holdouts, sensitivity, and specificity for the data in Fig. 5, Fig. 6. The rectilinear classification method is described in Fig. 1, while the optimal method is given by Eqs. (13a)–(14). For sensitivity, specificity, and accuracy calculations, the numbers in brackets are empirical 95% confidence intervals.

Numerical validation

To validate that the sets , , and obtained in Section 3 are optimal, we consider a numerical experiment wherein we perturb as a function of these domains. For point and , we formally define a “point-swap derivative” to be In principle can be an arbitrary definition of local accuracy, although in practice we take in this section. The interpretation of Eq. (19) is as follows. In taking point from and adding it to and vice-versa for , we must ensure that the constraint Eq. (11) remains satisfied. The ratio provides the “rate-of-exchange” of probability. For example, if , then adding to will infinitesimally decrease the global accuracy, so that we must hold out a larger yet still infinitesimal fraction of in the vicinity of . It is clear that Eq. (19) goes through a singularity when and becomes negative for and . The interpretation of this is straightforward: we should always reverse any swap for which a point with local accuracy greater than the average is put in the indeterminate class. Such points are not considered in the analysis below. More rigorous interpretations of Eq. (19) are considered in the Appendix, especially in the context of the singular PDF given by Eq. (15). The benefit of Eq. (19) is that it allows us to estimate a “set-partial derivative” by computing the relative probability exchange for any point in the indeterminate domain. In particular, we compute for the optimal domains and . Fig. 7 shows the logarithm of Eq. (20) for a mesh of points in the indeterminate region, taking . Note that swapping any point in the indeterminate region with one in the positive and negative classification domains increases the size of the indeterminate, as expected.

Fig. 7

The logarithm of the swap derivative given by Eq. (19) computed for the optimal domains , , and . The 0-level line is the boundary of the indeterminate region. Note that the logarithm is everywhere positive. Thus, swapping any infinitesimal regions between and will increase the probability mass in the indeterminate, provided constraint Eq. (11) is satisfied.

To validate that swapping points between and does not increase the accuracy of the assay or decrease the size of the indeterminate domain, we examine the quantity directly. In particular, the Appendix shows that for all guarantees that and are optimal for the indeterminate region . Fig. 3 demonstrates that this inequality holds for the solution given by Eqs. (13a)–(14). Thus, no rearrangement of points decreases the size of the indeterminate domain. The logarithm of the swap derivative given by Eq. (19) computed for the optimal domains , , and . The 0-level line is the boundary of the indeterminate region. Note that the logarithm is everywhere positive. Thus, swapping any infinitesimal regions between and will increase the probability mass in the indeterminate, provided constraint Eq. (11) is satisfied.

Discussion: Historical context, open direction, and limitations

The role of prevalence

Examination of Eq. (11) reveals that the terms of the LHS are proportional to prevalence-weighted estimates of sensitivity and specificity. In particular, recognize that are the sensitivity and specificity restricted to the domain . When there is no indeterminate domain, the normalization factors , so that Eqs. (21a), (21b) revert to the standard definitions of these quantities. In this case, we see that Eq. (11), which no longer acts as a constraint, amounts to the statement that the prevalence-weighted sum of sensitivity and specificity is equal to ; that is When we permit an indeterminate class, however, the interpretation is not as straightforward. In particular, the presence of the term on the right-hand side (RHS) appears problematic, for note that it implies The normalization factor differs from its counterparts in Eqs. (21a), (21b). Thus, it is not obvious what our constraint enforces about the sensitivity and specificity of the assay restricted to . The resolution to this conundrum is to recognize that the prevalence of the population also changes when we restrict classification to . This is not to say that the value of itself (i.e. associated with the total population) changes, but rather that the relative fraction of positives and negatives differs on . This is not unexpected, since the shape of the indeterminate region is a function of the local accuracy , which depends on the specifics of the probability models. Mathematically, we understand these observations by rewriting Eq. (23) in the form where and are the required normalization constants. Eq. (24) becomes an analogue to Eq. (22) of the form where is the prevalence restricted to the domain . Note that has the properties necessary to be a prevalence: which is a consequence of the definition of . Thus, we see that the constraint corresponds to a domain-restricted-prevalence weighted sum of sensitivity and specificity. From a theoretical standpoint, Eq. (26) is extremely serendipitous. The constraint as defined by Eq. (11) only refers to the prevalence of the full population. It is not obvious that this equation will remain a prevalence-weighted sum when holding out samples, especially as the restricted-prevalence does not in general equal . Further implications of this observation are explored in the next section. However, an immediate practical consequence of Eq. (26) is that the relative fraction of positives from an assay using indeterminates is not a reliable estimator of total prevalence. In order for the restricted prevalence to equal , one requires which implies That is, only occurs when the holdout domain removes equal mass from the probability models, which is extremely restrictive. To overcome this problem, we recall that Ref. [5], demonstrated how an unbiased estimate of the total prevalence can be constructed without classifying samples using a simple counting exercise on subdomains of . The validity of that method is independent of the assay accuracy, so that it can be used to estimate in the present work. Indeed, such techniques are necessary to construct the optimal classification domains, given the fundamental role of in their definitions. We refer the reader to Ref. [5] for a deeper discussion of such issues.

Other notions of optimality

A common practice in the testing community is to preferentially optimize an assay so that either the specificity or sensitivity reaches a desired target, but not explicitly a linear combination of the two. Eq. (25) and the bathtub-type principle suggest a route by which our method can solve an analogue of this problem. However, a deeper investigation of sensitivity and specificity is first necessary to motivate this generalization and understand how such methods differ from Eqs. (13a)–(14). [See also Ref. [23] for additional notions of optimality, as well as Refs. [24], [25], [26] for other approaches to defining classification domains.] Examination of the binary problem reveals that when , the domains and equally weight sensitivity and specificity; that is, errors in either are treated as equally undesirable. It is straightforward to show that increasing will increase sensitivity at the expensive of specificity, and vice versa. The interpretation of this observation is that as the number of positive samples increases, we should increase the size of the positive classification domain so as to capture the their increasing share of the population. It is therefore possible and even likely that when the prevalence approaches 0 or 100%, either sensitivity or specificity may be unacceptably low, since the corresponding contribution to the total accuracy becomes negligible. A possible solution to this problem is to recast Eq. (11) as an inequality constraint of the form together with the additional constraints where and are user-defined lower bounds. While an optimal solution to this problem is beyond the scope of the current manuscript, the bathtub-type principle suggests a construction akin to active-set methods [27]. First, solve the optimization problem associated with Eqs. (10)–(11) and check the resulting values of sensitivity and specificity. If these quantities are deemed to small, remove samples up to user-defined waterlines and (which may be different), where and apply only to samples in the negative and positive classification domains. Fig. 8 shows an example of this approach applied to the data in previous figures. We originally set but required that the empirical specificity be 100% for the training set. To accomplish this, we set , which augments the size of the indeterminate domain (teal strip added to the light blue domain) without decreasing the number of true negatives.

Fig. 8

Holdout domain computed with a target accuracy of 99% and according to the constraints given by inequalities (29), (30). For the latter, we set indirectly by holding out samples up to in the positive classification domain. This yields an empirical specificity of the training data was 100% while keeping the empirical sensitivity above 94%. Note that the indeterminate domain (light-blue) is increased only into the positive classification domain (yellow-green) in attempting to satisfy inequality (30). The teal strip adjacent to the light blue and yellow-green is the modified indeterminate domain. After increasing the empirical specificity to 100%, the optimized domains holds out 15.1% of samples, as opposed to 22.3% for the rectilinear method; see Table 1.

Relationship between prevalence, sensitivity, and specificity

Eq. (25) and the examples of Secs. 6.1, 6.2 beg the question: to what extent is prevalence-weighted accuracy a preferred or natural framework for diagnostic classification, as opposed to methods based on explicit reference to sensitivity and specificity? To unravel this, consider that the latter two are purely theoretical properties of a specific choice of classification domain and are only loosely connected to the reality of testing. This is evident from the definitions given by Eqs. (21a), (21b). The concept of prevalence, i.e. implying existence of a population, does not enter; rather all that is needed is a choice of the classification domains. Thus, an assay can have exceptional sensitivity and yet still be wrong half the time if the prevalence is low. In a related vein, it is clear that specificity and sensitivity only characterize assay accuracy in the limits and , respectively. Here we encourage a new perspective. As a baseline strategy, the most important task is to correctly classify samples; at least this is of the utmost importance to patients. Moreover, computing accurate prevalence estimates is critical for epidemiologists (although we have shown previously that this problem is solved accurately without recourse to classification). With this goal in mind, the sensitivity and specificity are subservient to accuracy via Eq. (11), and it is not unreasonable to let them change with prevalence if doing so increases overall testing accuracy. We highlight this because under such a paradigm, and lose their status as the key performance metrics that define the “quality” of an assay, and they cannot be viewed as static properties. Such observations are not to say that and are useless, however. Clearly there are times when it is more important to correctly identify samples from one class, and this motivates the generalization of Section 6.2. But these observations clarify our perspective of why the prevalence sets a natural scale for classification. In particular, Eq. (11) has two equivalent interpretations: (i) the accuracy of the assay must be ; and (ii) the prevalence-weighted sensitivity and specificity must be . The equivalence of these interpretations arises from the fact that notions of accuracy assume the existence of a population to which the test is applied. Thus, Eq. (25) is perhaps unsurprising in light of Eq. (11) because both are self-consistent statements about the properties of a population. The benefit of treating prevalence-weighting as a natural framework for diagnostic classification is that one can easily identify when subjective elements (i.e. not intrinsic to the population) have been added to the analysis. For example, the indeterminate domain in Fig. 8 associated with the inequalities (28)–(30) is not optimal insofar as there is a smaller counterpart that yields the same average accuracy for the classified data. However, it is clear by construction how we have modified the latter, i.e. by adding a user-defined constraint on the specificity. Likewise, even Eq. (11) should be viewed as a subjective modification of the unconstrained, prevalence-weighted classification problem. Ultimately the choice of classification method is best determined by assay developers, and there may be situations in which prevalence weighting is inappropriate. Nonetheless, we feel that the analysis herein highlights the assumptions behind our work and attempts to ground it in objective elements inherent to the population of interest.

Implications of an indeterminate class

The use of an indeterminate class in diagnostics can have consequences, especially for individual patients. We briefly consider such issues here. We first note what does not change: it is still possible to estimate prevalence of an entire population, even though some samples are held out. This is a consequence of the methods in Ref. [5], [14], which yield unbiased predictions of that converge in mean-square without ever classifying data. These results are understood physically by recognizing that prevalence estimation entails determining the number of positive samples, not identifying their classes. The latter task is more specific and amounts to a choice about how to interpret the data, which has no bearing on its underlying statistical properties (such as prevalence). This observation highlights the subjective nature of classification insofar as Eqs. (6)–(11) are choices of the types of errors we wish to minimize. These choices are informed by the measurement setting and dictate what role an indeterminate class plays in reporting results to individuals. In mass surveillance studies, for example, the primary goal may be to deduce prevalence of various populations. In such cases, a large fraction of inconclusive results could reduce individual confidence in testing, although it should not affect the overall aims of the study. In other settings, such as antibody testing to assess immunity, the specific test results matter more. In some cases an indeterminate class as constructed herein (in terms of local accuracy) could in fact increase confidence and/or usefulness of the diagnostic. For example, in testing an immunocompromised individual, the potential loss associated with an incorrect result suggests the need for a more stringent criterion (e.g. using local accuracy) to determine if an individual has seroconverted.

Limitations and open directions

A fundamental limitation of our analysis is the assumption that the probabilistic models describing positive and negative samples can be used outside the scope of training data. This problem is common to virtually any classification scheme and is primarily an issue of modeling. Such issues have been explored in a previous manuscript, to which we refer the reader [5]. We note here, however, that model-form errors may introduce uncertainty on the order of a few percent in the conditional probability densities. Thus, it is likely that modeled estimates of accuracy will be incorrect by a proportional amount. This is seen, for example, in the holdout domain computed in Fig. 5. However, Section 6.2 provides means of ensuring that the indeterminate domains are recomputed to satisfy any constraints on empirical estimates of accuracy. We also note that approaches that do not explicitly account for prevalence and/or conditional probabilities are likely to have significantly more model-form errors than estimates based on our approach. Regarding the indeterminate analysis, Eqs. (13a)–(14) and the generalization considered in Section 6.2 may be a challenging optimization problem to solve, although the solution could be extremely useful for satisfying regulatory and/or public health requirements. Moreover, formalizing the algorithm described in that section and studying its properties relative to the optimal solution may be useful. A practical limitation of our analysis is the definition of assay performance, provided we allow for variable, prevalence-dependent classification domains. Current standards advocate using sensitivity and specificity estimated for a single validation population having a fixed prevalence. To realize the full potential of our analysis, it is necessary to (i) estimate assay accuracy and uncertainty therein, (ii) characterize the admissible classification domains, and (iii) compute sensitivities and specificities, all as a function of the variable prevalence. While such issues have been partly considered in [5], and deeper investigation of this uncertainty quantification is necessary for widespread adoption of these techniques.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

15 in total

1. Sensitivity, specificity, receiver-operating characteristic (ROC) curves and likelihood ratios: communicating the performance of diagnostic tests.

Authors: Christopher M Florkowski
Journal: Clin Biochem Rev Date: 2008-08

2. ORF8 and ORF3b antibodies are accurate serological markers of early and late SARS-CoV-2 infection.

Authors: Asmaa Hachim; Niloufar Kavian; Carolyn A Cohen; Alex W H Chin; Daniel K W Chu; Chris K P Mok; Owen T Y Tsang; Yiu Cheong Yeung; Ranawaka A P M Perera; Leo L M Poon; J S Malik Peiris; Sophie A Valkenburg
Journal: Nat Immunol Date: 2020-08-17 Impact factor: 25.606

3. COVID-19 Serology at Population Scale: SARS-CoV-2-Specific Antibody Responses in Saliva.

Authors: Nora Pisanic; Pranay R Randad; Kate Kruczynski; Yukari C Manabe; David L Thomas; Andrew Pekosz; Sabra L Klein; Michael J Betenbaugh; William A Clarke; Oliver Laeyendecker; Patrizio P Caturegli; H Benjamin Larman; Barbara Detrick; Jessica K Fairley; Amy C Sherman; Nadine Rouphael; Srilatha Edupuganti; Douglas A Granger; Steve W Granger; Matthew H Collins; Christopher D Heaney
Journal: J Clin Microbiol Date: 2020-12-17 Impact factor: 5.948

4. Differences in Performance Characteristics Among Four High-Throughput Assays for the Detection of Antibodies Against SARS-CoV-2 Using a Common Set of Patient Samples.

Authors: David M Manthei; Jason F Whalen; Lee F Schroeder; Anthony M Sinay; Shih-Hon Li; Riccardo Valdez; Donald A Giacherio; Carmen Gherasim
Journal: Am J Clin Pathol Date: 2021-02-04 Impact factor: 2.493

5. A statistical model of COVID-19 testing in populations: effects of sampling bias andtesting errors.

Authors: Lucas Böttcher; Maria R D'Orsogna; Tom Chou
Journal: Philos Trans A Math Phys Eng Sci Date: 2021-11-22 Impact factor: 4.226

6. Performance Characteristics of Four High-Throughput Immunoassays for Detection of IgG Antibodies against SARS-CoV-2.

Authors: Elitza S Theel; Julie Harring; Heather Hilgart; Dane Granger
Journal: J Clin Microbiol Date: 2020-07-23 Impact factor: 5.948

7. A comparison of four serological assays for detecting anti-SARS-CoV-2 antibodies in human serum samples from different populations.

Authors: Ludivine Grzelak; Sarah Temmam; Cyril Planchais; Caroline Demeret; Laura Tondeur; Christèle Huon; Florence Guivel-Benhassine; Isabelle Staropoli; Maxime Chazal; Jeremy Dufloo; Delphine Planas; Julian Buchrieser; Maaran Michael Rajah; Remy Robinot; Françoise Porrot; Mélanie Albert; Kuang-Yu Chen; Bernadette Crescenzo-Chaigne; Flora Donati; François Anna; Philippe Souque; Marion Gransagne; Jacques Bellalou; Mireille Nowakowski; Marija Backovic; Lila Bouadma; Lucie Le Fevre; Quentin Le Hingrat; Diane Descamps; Annabelle Pourbaix; Cédric Laouénan; Jade Ghosn; Yazdan Yazdanpanah; Camille Besombes; Nathalie Jolly; Sandrine Pellerin-Fernandes; Olivia Cheny; Marie-Noëlle Ungeheuer; Guillaume Mellon; Pascal Morel; Simon Rolland; Felix A Rey; Sylvie Behillil; Vincent Enouf; Audrey Lemaitre; Marie-Aude Créach; Stephane Petres; Nicolas Escriou; Pierre Charneau; Arnaud Fontanet; Bruno Hoen; Timothée Bruel; Marc Eloit; Hugo Mouquet; Olivier Schwartz; Sylvie van der Werf
Journal: Sci Transl Med Date: 2020-08-17 Impact factor: 17.956

8. Evaluation of Serological Tests for SARS-CoV-2: Implications for Serology Testing in a Low-Prevalence Setting.

Authors: Katherine Bond; Suellen Nicholson; Seok Ming Lim; Theo Karapanagiotidis; Eloise Williams; Douglas Johnson; Tuyet Hoang; Cheryll Sia; Damian Purcell; Francesca Mordant; Sharon R Lewin; Mike Catton; Kanta Subbarao; Benjamin P Howden; Deborah A Williamson
Journal: J Infect Dis Date: 2020-09-14 Impact factor: 5.226

9. Comparative performance of multiplex salivary and commercially available serologic assays to detect SARS-CoV-2 IgG and neutralization titers.

Authors: Christopher D Heaney; Nora Pisanic; Pranay R Randad; Kate Kruczynski; Tyrone Howard; Xianming Zhu; Kirsten Littlefield; Eshan U Patel; Ruchee Shrestha; Oliver Laeyendecker; Shmuel Shoham; David Sullivan; Kelly Gebo; Daniel Hanley; Andrew D Redd; Thomas C Quinn; Arturo Casadevall; Jonathan M Zenilman; Andrew Pekosz; Evan M Bloch; Aaron A R Tobian
Journal: J Clin Virol Date: 2021-10-09 Impact factor: 3.168

1 in total

1. Modeling in higher dimensions to improve diagnostic testing accuracy: theory and examples for multiplex saliva-based SARS-CoV-2 antibody assays.

Authors: Rayanne A Luke; Anthony J Kearsley; Nora Pisanic; Yukari C Manabe; David L Thomas; Christopher D Heaney; Paul N Patrone
Journal: ArXiv Date: 2022-06-28

1 in total