Literature DB >> 33490349

Biological variation: Understanding why it is so important?

Abstract

This Review will describe the increasing importance of the concepts of biological variation to clinical chemists. The idea of comparison to 'reference' is fundamental in measurement. For the biological measurands, that reference is the relevant patient population, a clinical decision point based on a trial or an individual patient's previous results. The idea of using biological variation to set quality goals was then realised for setting Quality Control (QC) and External Quality Assurance (EQA) limits. The current phase of BV integration into practice is using Patient-Based Real-Time Quality Control (PBRTQC) and Patient Based Quality Assurance (PBQA) to detect a change in assay performance. The challenge of personalised medicine is to determine an individual reference interval. The Athletes Biological Passport may provide the solution.

Entities: Chemical Disease Gene Species

Keywords: Analytical goals; Biological passport; Patient based quality assurance

Year: 2021 PMID： 33490349 PMCID： PMC7809190 DOI： 10.1016/j.plabm.2020.e00199

Source DB: PubMed Journal: Pract Lab Med ISSN： 2352-5517

Introduction

Measurement is defined by the International Bureau of Weights and Measures [1] to be the process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity. The measurement process commonly used in laboratory medicine uses a ratio scale, where proportions are constant, and there are no zero points [2]. Other ratios, particularly the ‘signal to noise,’ will play a part in the discussion. To interpret the result of a clinical laboratory test requires comparison to a reference interval, a clinical decision point, or previous results. A reference interval is derived from a reference population, whereas detecting a significant change in consecutive results requires an understanding of analytical and biological variation (BV). Even a clinical decision point requires the ability to separate two patient populations. The common theme running through this article is biological and analytical variation. The reason for using BV as a component of performance goals is that to detect or rule out disease, an assay must be able to see a change from normal. Thus BV, or specifically, being able to detect a difference in a measurand’ signal to noise’ of BV to analytical precision is the basis for many goals, including: how quality specifications are set often linked to the sigma metric [3,4], method selection and assessment [5], the definition of common reference intervals in a geographic area [[6], [7]], the analysis of the consequences of poor calibration [8], evaluation of laboratory results in external quality assurance programs (EQA) [9], the quality of evidence-based clinical guidelines [10], and, goals for PoCT performance [11]. The principles behind setting performance goals look at the impact of bias and precision on patients’ misclassification. Analytical bias or precision will move a patient result relative to a reference interval or clinical decision point. Petersen has reviewed these situations [2,12]. There are two general models for classifying a result in a patient based on a reference interval, either a unimodal or bimodal. In the bimodal approach, there is a decision point based on the prevalence of the disease. There is an assumed increased risk in the unimodal modal as the measured concentration approaches a decision point or reference limit. The decision point/reference limit separates the otherwise homogeneous distribution into two groups with high and low risk of some disease, respectively. This approach contrasts with the bimodal distribution in which there is a well-defined (but often unknown) prevalence for the condition under consideration. Typically, the assay’s performance in identifying the diseased group is described using the fraction of true positives, true negatives, false positives, and false negatives. In the bimodal model, the impact of bias and precision is dependent on the prevalence. Petersen et al. have also investigated the effect of bias and precision on ordinal measurands such as dipstick tests [13]. However, there is often an interplay between data-based and theory-based models as a deeper understanding of these problems’ complex nature. A model is posed and used until it fails some significant observation. A data-based approach is used until a newer model can be developed based on the fresh insights provided by more data mining. There have been some recent reviews of Biological Variation (Callum G. Fraser, 2017; Ricos et al., 2009) as well as an excellent reference text [16] and Chapter [17]. This Review aims to track the history of how biological variation has influenced the Total Error concept, Analytical Goal setting, Quality Control (QC), EQA practice, through Patient-based real-time Quality Control and the future of EQA using patient population parameters.

Biological Variation

BV for a biomarker entails in each individual a “subject mean” or central tendency, control level, or “setpoint” concentration of homeostatic regulation arising from such factors as genetic characteristics, diet, physical activity, and age [18,19]. BV’s formal study began in 1835 though understanding differences between individual humans or animals is a fascination and necessary skill for any human. This formal study was driven by the development of statistical tools created in the scientists of the age of Enlightenment (Fermat, Pascal, Newton, Brahe, Galileo, Daniel Bernoulli, Laplace, Gauss). The equation describing the Gaussian distribution came from investigating errors in astronomical measurements published in the early eighteenth seventeenth century by Gauss [20]. The first significant application of the Gaussian distribution to humans’ measurement was by an astronomer turned social scientist named Adolphe Quetelet [21,22]. His most influential book was entitled Treatise on Man, but a literal translation of the French would be “On Man and the Development of his Faculties, or Essays on Social Physics.” Quetelet described the concept of the average man who has variables, encompassing social and physical characteristics that follow a normal distribution. Quetelet believed that understanding the statistical description of these variables would reveal God’s work as well as being a force for good administration. Using this approach, he gained insight into the relationships between crime and other social factors, including age, gender, climate, poverty, education, and alcohol consumption. Quetelet postulated a theory of human variance around the average, with human traits being distributed according to a ‘normal’ curve. He proposed that normal variation provided a basis for the idea that populations produce sufficient variation for artificial or natural selection to operate. This concept of a ‘normal range’ influenced laboratory medicine as it was possible to classify ‘normal’ from ‘abnormal or diseased.’ However, the question of what is ‘normal’ or ‘average’ arises. Consider a disease such as sickle cell anaemia. It is a disease, yet it is protective against malaria, so the abnormal in one population becomes normal. The diseased in one group represents non-diseased in another. What is normal or average in one population is not what is healthy. What is the ‘normal’ population? Note that we use the words’ normal’ and ‘Gaussian’ almost interchangeably. Poincare pointed out that mathematicians believe the normal distribution is an empirical fact and that to scientists, it is a mathematical law [23]. This view has not changed to this day. This is an example of a theoretical model we sometimes use despite data that indicates the contrary. To explore this further, some interpretations of the concept ‘normal’ are given below in Table 1 by Murphy.

Table 1

Interpretations of ‘normal’ (modified from Murphy [24]).

	Conceptions of normal	Suggested alternatives
1	Determined statistically	Gaussian
2	Most representative of its class	Average, median, mode
3	Most commonly encountered	Habitual
4	Wild type: most suited to survival & reproduction	Fittest
5	Harmless ‘carrying no penalty.’	Innocuous/harmless
6	Most often aspired to	Conventional
7	The most perfect of its class	Ideal

Interpretations of ‘normal’ (modified from Murphy [24]).

Reference intervals

Laboratory medicine started developing a broader range of tests during the mid-twentieth century. These tests were developed inhouse and provided to a local medical community, usually in isolation from other sites. At this time, the clinical practice was to compare a patient’s results with an ill-defined or at least inconsistently defined, range of values or ‘normal limits,’ called the ‘normal range.’ This was derived from a population of supposedly ‘normal’ (meaning healthy) individuals [25]. It soon became apparent that multiple ‘normal ranges’ were required for different patient populations and individual laboratories, and that these were different because of methodological variation. It was the pioneering work of Sunderman who revealed using the first EQA surveys the extent of the variation in results caused by these different in-house methods [26]. Grasbeck and Saris in 1969 [27] challenged this concept of a normal range, arguing that normality is a relative term and situational. They suggested a new name, the ‘Reference Interval,’ which would describe fluctuations of analyte concentrations in well-characterised groups of individuals. The fundamental idea was to have a point of reference against which to interpret an individual’s results, rather than defining normality [28]. The population reference interval may not account for factors such as age, ethnicity, or gender unless they have a significant impact. Therefore, the reference interval approximates what can be expected in the population from which the patient belongs.

Components of BV

Many of the analytes we measure in laboratory medicine change over time, varying in predictable rhythms; indeed, sometimes the loss of these rhythm represents disease. However, most quantities do not have these rhythms, and for each individual, the quantity varies around a homeostatic set point in a truly random manner [29]. This is the true meaning of BV, though the variation we see will be due to different setpoints and variation for an individual, often lost in the total variation. Of course, the added variation is caused by the pre-analytical processes of collection, transport, preservation, and the measurement system’s variation. The underlying reasons for using biological variation in the many situations described in this article are compelling; however, the value relies upon accurate estimates of that variation. Many investigators have produced estimates for the intra-individual (CVI) and group (CVG) variation, but often the needed measurement rigor and supporting information was not available [15,[30], [31], [32], [33], [34]]. For example, there have been studies conducted with small sample sizes and poorly defined groups with inappropriate statistical methods. There is now a better understanding of the need to produce and promulgate accurate estimates generated from significant sample sizes using the best statistical tools available [30]. These biological variation estimates are available at the curated EFLM site [30]. Important statistical considerations include determining the BV parameters, outlier removal, and their confidence intervals. The method of determining CVI and CVG involves selecting a group of reference individuals in steady-state and measuring the analyte of interest at regular intervals over a period which will reflect any ‘normal’ biological variation. The variation will, of course, also involve any additional variation due to the collection and measurement process. The analytical variation is estimated by measuring replicates of the individual samples. The statistical analysis used is ANOVA, which can identify the significant variations due to intra-individual, group, and analytical causes. The specific ANOVA model is a nested random design in two levels, assuming that the three terms are independent and normally distributed with constant variances [14,35]. Historically the sample sizes have been small of the order of 10–20 individuals. However, Roraas et al. [36] have produced power tables with numbers of individuals, samples, and replicate measurements providing guidance. They also offered a ‘width’ of the confidence interval based on the numbers above to understand the limits’ uncertainty. There is skewness around the confidence intervals depending on the ratio of analytical and intra-individual biological variation, another example of the impact of noise on measurement.

Index of individuality

As mentioned earlier, it was only in 1956 that Williams first established intra-individual biochemical variation based on published studies of various biomarkers [29]. He hypothesized that the ‘normal’ man is an illusion and that every individual has some variation in some respect. Further, he postulated the ‘principle of genetic gradients,’ which stated that, ‘Whenever an extreme genetic character appears in an individual organism, it should be taken as an indication (unless there is proof to the contrary) that less extreme and graduated genetic characters of the same sort exist in other individual organisms.’ Williams termed this ‘biochemical individuality.’ Some analytes are tightly controlled within an individual around a setpoint, but the setpoint may vary between individuals; the classic example is creatinine. The significance of this to diagnosis is that an individual may be within a reference interval for the population but have significant renal function loss. With these analytes, there is a need to have more granularity in the reference interval by partitioning by age, sex, and body mass; however, this may still not be sufficient. A comparison with previous results may be necessary. However, there have not been many databases containing histories of patient results that could be used for this purpose in practice until recently. Recognising that with some analytes there is significant intra-individual variation, the index of individuality was introduced by Harris [37] as the ratio (combined biological within-subject variation and analytical precision divided by the biological between-subject variation), but is now often simplified to . The use of the index allowed the distinction between situations where the population-based reference intervals were relevant (index > 1.4), and those (index < 0.6) for which use of cumulative (with respect to time) systems for reporting results for individuals would be favoured. In those situations, the population should be stratified into more homogeneous subgroups (each of which, of course, would have a higher index). Still, ideally, use should be made of individual reference intervals. Note that the index of individuality is another example of using the signal to noise ratio to determine when a measure is useful. It also demonstrates that more data challenges assumptions on models and leading to improved understanding and new models.

Biological Passport

Personalised medicine (PM) has the potential to tailor therapy with the best response and highest safety margin to ensure better patient care [38]. In Laboratory Medicine, the concept would involve personalised reference intervals [39,40]. These ranges have been considered for over a decade or so but are technically challenging to generate in practice [41,42]. However, the sports anti-doping agencies developed an Athlete Biological Passport (ABP) to identify variations in an individual’s biomarker profile, which promises to achieve the goal of a model suitable for a personalised reference interval. The ABP uses Bayesian statistics as its underlying model and is based on previous success with clinical trials [43]. Bayesian analysis uses prior knowledge of the inter-individual variation of a biomarker measurement to progressively integrate previous results and represent the levels more accurately for an individual. Each new measurement of the biomarker is compared to a critical range, which is derived from the population reference interval. Then, this range progressively adapts with each unique measurement point. Bayes theorem predicts that the probability of an effect given each cause is the likelihood ratio [[44], [45], [46]]. Three distinct modules can be distinguished in the Biological Passport: the hematological, steroidal, and endocrinological modules. The hematological module of the passport aims to detect any form of blood doping. In 2008, the Union Cycliste Internationale was the first sports organization to implement the hematological module of the passport to deter blood doping in elite cycling. The steroidal module aims to detect direct and indirect forms of doping with anabolic agents [[47], [48], [49], [50]]. The endocrinological module measures biomarkers such as IGF-1 and growth hormone. There are further modules being investigated, such as the ’omics’ module [51,52]. The model can be complicated and considers specific genetic variation to the metabolism of an analyte. Thus, over a reasonably short period, an individualised reference interval is created. Any significant variation is then identified, which may represent doping or an individual’s case, a variation towards disease [44]. It is possible that using this type of logic will lead to a workable model of individual reference intervals that will be created and various routine measurements that could feed into the model to detect incipient disease. The inputs will eventually come from wearable devices [53,54].

Analytical goals

Any measurement made needs to be fit for the purpose for which it is to be used. In laboratory medicine, that purpose is to differentiate between ‘healthy’ and diseased. In the previous section, we have discussed the problems associated with defining what is health. Next, we will investigate how this can be achieved with our measurement processes. We will limit ourselves to measurement procedures that give numerical results on a ratio scale, so the measurement parameters include accuracy, precision, reproducibility, and repeatability. The required accuracy and precision of a measurement are termed the quality goal for that analysis. The importance of defining quality goals has been a continuing global effort, and in 2014 in Milan, there was an update to the Stockholm’ Consensus Agreement’ [[55], [56]]. Revising the original goals involved modification and explanatory additions to simplify the hierarchy and improve its application by various stakeholders. This simplified scheme resulted in three models, as shown below: Based on the effect of analytical performance on clinical outcomes This can, in principle, be done using different types of studies: Direct outcome studies – investigating the impact of analytical performance of the test on clinical outcomes. This approach requires a Randomised Clinical Trial, which is impractical for reasons of non-uniqueness of the biomarker/disease except in a few conditions where there is a well-defined clinical pathway with close links to outcomes – e.g., HbA1c, Troponin, cholesterol [[57], [58]]. These tests have decision limits (established based on existing analytical performance) and are used as a part of disease definition, or they are used as proxy outcome measures to assess patient well-being or response to treatment. Indirect outcome studies – investigating the impact of the test’s analytical performance on clinical classifications or decisions and thereby on the probability of patient outcomes, e.g., by simulation or decision analysis. This model is suitable when the clinical findings associated with the test results are well defined. There is evidence about the diagnostic accuracy of the test that is generalizable to the patient population setting, and the consequences of correct/incorrect classification have already been established. Examples of goals based on clinical outcomes using modeling or simulation include the following; fasting plasma glucose (FPG) measurement on the misclassification of Diabetes Mellitus (DM) [59] and IFG [60]; FPG, OGTT, HbA1c measurement on classification of DM [61]; point of care glucose meter results on potential insulin dosing errors [62]; direct measurements of HDL and LDL cholesterol affected by bias due to hypertriglyceridaemia on cardiovascular risk prediction [63]; and, cardiac troponin methods on the misclassification of patients presenting with Acute Coronary Syndrome to Emergency Departments [64,65]. An example of setting performance goals based on a survey of clinicians was described by Panteghini [66]. Based on components of biological variation of the measurand This model has the advantage that it can be applied to most measurands for which group or intra-individual biological variation data can be established. Based on state-of-the-art State-of-the-art represents the highest analytical performance level achievable or achieved by a certain percentage of laboratories in an EQA scheme. The highest goals are based on model 1, clinical outcome, which should be specifications based on decision levels and evaluation of tolerable false-positive (and false-negative) results. The problems with this approach include there are few tests have a definitive role in managing a specific. Condition’s clinical pathway and most laboratory tests are used for both diagnosis and monitoring disease, often in combination with other laboratory or medical tests. Thus the link between testing and health outcomes is indirect [67]. Thus, it is unlikely that clinical outcomes will be suitable to define performance goals for every measurand in every clinical situation. Therefore, the next best approach will be to use model 2, quality specifications derived from biological variation data. The principle behind this approach is to minimize the “signal-to-noise” ratio between analytical variation and the (natural) biological variation. Thus, analytical performance specifications based on biological variation are now broadly used in clinical chemistry, whereas in clinical guidelines, performance based on clinical criteria is, in most cases, preferred [68,69]. Setting performance goals to improve analytical performance may not improve patient outcomes [70]. The best performing assay does not lead to clinical action or affect patient compliance with treatment [71]. These specifications give the maximal total allowable variation in bias and precision, or the Total Allowable Error (TE), that will be tolerated before the assay will no longer be fit for the clinical purpose of identification of abnormal. We will discuss this concept next.

Total (allowable) error

Total Allowable Error began in 1974 when Westgard et al. [72] described a model for formulating criteria that can be used to judge whether an analytical method has acceptable precision and accuracy. The ‘total analytic error’ was defined as the sum of random analytic error (RE) and systematic analytic error (SE), which is, in turn, the sum of proportional (PE) and constant error (CE). For each of these errors (RE, PE, and CE), and Total Error (TE = RE + CE + PE = RE + SE), these authors devised an acceptable level of error (Table 2). This paper also describes how each of these components is calculated. An acceptable analytic error was defined to be 95% of acceptable medical error (EA). The acceptable medical error was not determined but seen to be the limiting value of the TE equation.

Table 2

Performance criteria for analytic errors.

Analytical Error	Experiment	Acceptable Error
RE	Replicates	2∗SDTU<EA
PE	Recovery	\|%R(UorL)−100\|UXC<EA
CE	Interference	\|Bias\|+t∗(SDd√N)<EA
SE	Patient Comparison	\|(a+bXC±W)−XC\|C<EA
TE	Replicates and Comparison	\|(a+bXC)−XC\|+√{(SDTU}2+W2}<EA

Abbreviations used in the Table. EA Acceptable medical error; XC concentration at critical decision level; SDTU analytical SD at upper 95% level; %R(U or L) %recovery at upper or lower recovery experiment; t is t-test level at p = 0.05; W is the width of the confidence band; a, b is the intercept and slope from the regression of critical test concentration.

Performance criteria for analytic errors. Abbreviations used in the Table. EA Acceptable medical error; XC concentration at critical decision level; SDTU analytical SD at upper 95% level; %R(U or L) %recovery at upper or lower recovery experiment; t is t-test level at p = 0.05; W is the width of the confidence band; a, b is the intercept and slope from the regression of critical test concentration. The model for the equation describing TE, however, became the linear sum of bias and stable precision (s), TE = B + z∗s, the z being the z-value corresponding to a defined one-sided confidence probability. This assumes that the TE is composed of maximum values of its components B, z, and s; there is a linear relationship between B and s [73,74]. However, this is not always the case. This equation’s significance is that it combined RE and SE for the first time, hence ‘total’. The components SE and RE are determined from linear regression equations, a concept that flows into the TE equation structure, which is the sum of two parts. One assumption is that the bias is unchanged over time, ‘Systematic’ implies a specific point in time. The bias here is estimated from the linear regression based on recovery experiments. The concept of bias was a relative one, bias from and existing method that may or may not be a reference method. As Klee [75] pointed out, bias depends on the time interval considered. Although bias should be removed when possible, bias is inevitably encountered in some circumstances, such as systematic differences between analyzers measuring the same analyte. Related to TE is the concept of an Error Budget [73,76], which has systematic and random error components shared within the analytical goal, i.e., within the specified tolerance between true concentration XC and reported result XR (Fig. 1) [77]. The sum of the errors must be less than the TE limit.

Fig. 1

Representation of standard error budget components within an Analytical Goal.

Representation of standard error budget components within an Analytical Goal. Krouwer [78] showed that the original approach for estimating TE using just precision and bias can give a misleading impression of assay performance because estimates of error from various individual sources (e.g., drift, bias) are generally not combined to give a picture of total error (i.e., the error that might be observed by a clinician). He suggested various models to estimate total analytical error, but these are impractical for most laboratories because of the resources needed to identify all his variables. An additional complication is that Krouwer’s total error models require quantification of nonspecific interferences. Next, we will analyse the various assumptions in determining these two major components or the TE equation’s bias and precision.

Setting quality specifications for laboratory medicine based on biological variation

Determining Quality specifications using the BV components of within and between-subject variation, have been proposed by various professional groups [[79], [80], [81], [82]]. Specifically, BV’s appearance in many laboratory medicine areas was driven by the Stockholm Consensus Conference and the Milan meetings, as described earlier. The Stockholm meeting produced a unified approach to a hierarchy of performance specifications, which was generally accepted by the laboratory medicine community, even though these different approaches had been discussed for many years [[83], [84], [85]]. These meetings proposed that BV be used above ‘state of the art’ and professional opinion when setting performance specifications. Only data from clinical outcomes studies were superior when setting performance specifications. Note that the ‘usual’ source of ‘state of the art’ information was EQA programs. So EQA data was used to set performance goals, yet for most EQA programs, the ‘target’ values were not traceable. The realisation that performance specifications can be unified using BV for many different areas of measurement, including Reference Intervals and Clinical Decision Points, Method Evaluation, QC, and EQA performance limits then followed [56,[86], [87], [88]]. There were always simplifications; however, even though most biological distributions are log-Gaussian or non-Gaussian, a simple Gaussian distribution was used. However, this assumption has been challenged. The effect of different distributions on the calculation of intra-individual BV (CVI) has been investigated and shown to have a significant impact [36]. We will now deal with the derivation of the BV components of the TE equation.

Quality specification for analytical precision

The idea of using a biological variation to set a limit on acceptable analytical precision was due in no small part to Tonks [89]. Using observations of the differences between 170 participating laboratories in a Canadian EQA scheme, he formulated their performance goals. empirically, Tonks stated that the allowable error, which was taken to be twice the CVa, should be: Barnett [90] had suggested that desirable precision should be such that any analytical error did not significantly widen the normal range. This will be achieved he demonstrated if the SA (analytical SD) does not exceed one-twelfth to one-twentieth of the normal range. If the SA is one-twelfth of the normal range, it will widen that range by 5.4%, whereas if it is one-twentieth of the normal range, it will widen that range by 2%. These numbers were arbitrary but were based on the impact of misclassification of a patient’s result relative to a reference interval [8,91]. Cotlove et al. [18] described quality specifications for precision based on the idea that analytical precision should not affect the interpretation of results used to monitor patients rather than diagnosis, as used by Barnett. By looking at the impact of an analytical error in interpreting serial results in a patient, this group set performance goals using intra-individual variance (SI2) with the assumption that the analytical error influence would not change between measurements. They proposed that to minimize any artifactual increase in total variation , the SA = 0.5∗SI. They showed that above this level ST increases by 40% or more if SA ≥ SI. Cotlove et al. utilised what became the Reference Change Value (RCV) or Critical Difference concept [92]. The RCV is defined to be what would be a significant change in results in a patient over two results, assuming there is no change in the analytical precision (or bias): If there is a change in the analytical system, we can add the two analytical errors in this equation: precision or bias. Adding analytical error SA and bias B, too (1) yields: Usually, the assumption is that there is no significant drift between the same analyte measurements on the same patient, but this may not be valid, mainly if the laboratory involved uses multiple instruments to measure the same measurand. There are some forms of patient-based quality assurance techniques used, such as delta checks, which seek to identify misidentification by comparing previous to the most recent results where a ‘significant’ difference is greater than the RCV [93]. There have been many variations on the concept, but the value of this has been challenged because of the low detection sensitivity and widespread use of barcodes [[94], [95], [96], [97]]. Harris [37] investigated the impact of analytical precision on the misclassification of a patient using a reference interval, clinical decision point, or in the case of monitoring. He found that a decrease in the ratio of analytical precision to biological variation (inverse of normal signal to noise) of up to 50% leads to only a four percent improvement in overall variation (ST). This change in analytical precision produced only minor changes in clinical sensitivity and specificity. His investigation of the impact of analytical precision on monitoring a single patient also showed a limited influence with the CVA/CVB of <0.5. These findings provided theoretical support to the Barnett criterion [37,85]. Thus, the goal of analytical precision was set and validated. ‘A striking feature is the fact that all of the individual approaches recommend numbers for analytical standard deviation near or equal to 0.5 times the biological standard deviation’ [98].

Allowable bias

The introduction of analytical bias can lead to an entire community’s misclassification as the population of results moves relative to a fixed reference interval or decision point [99]. Harris [85] expanded his earlier work to include the impact of bias by simply having a bias (B) in the goal, SA ≤ 0.5∗SI became {SA2 + B2} ≤ 0.5∗SI. The approach to dealing with bias involves quantifying patients’ misclassification caused by the shift in patient results relative to the reference limits. This may be an unrecognised problem. When a clinical decision point is used, the laboratory must be using a method/calibration which is traceable to the laboratory where the decision point was determined, which is not often the case [100]. Gowans et al. [101] used the reference interval (Gaussian) model generated using the guidelines of the Expert Panel on Theory of Reference Values of the International Federation for Clinical Chemistry [102] as a starting point. Then, using known sample size, they calculated the effect of different confidence intervals (z = 0.8, 0.9, 0.95 or 0.99) and sample sizes (10–10,000) on the percentage of the population outside mean ± 1.96∗SD, i.e. the reference limits. Incidentally, they found that above a sample size of 800, the confidence interval for the reference limits becomes negligible. This confidence interval for 120 patients around each reference limit (upper and lower) becomes ± 0.25∗SB, where SB is the total biological standard deviation (), which can then converted into a maximal allowable bias subject with a 90% confidence interval. The maximum acceptable percentage of the population outside the limit for the 0.90 confidence interval of each of the mean ± 1.96∗SD reference limits is 4.6% for a population sample size of 120. Based on this, the maximum acceptable precision, with no bias, is 0.6 of the total biological standard deviation, and the maximum acceptable bias, with no precision, is 0.25∗SB. This leads to 1.3 and 4.4%, respectively, of the population outside each (lower and upper respectively) reference limit. This was defined as an acceptable misclassification error, and hence any laboratory with a method bias less than this value could use the reference interval from the reference laboratory/method. Stamm [103], using a similar approach to Gowans, proposed that BV’s ratio to analytical between-day precision should be ≥ 2. If the BV is unknown, then the quotient Reference Interval/Analytical between-day precision should be ≥ 8.

Combining bias and precision

The assumption so far has been to see the two components of error as not being related. Still, precision will also lead to patients’ misclassification as the precision ’adds’ to the total variance and widens the ‘apparent’ reference interval. We assume that the same percentage of misclassified patients is acceptable and a linear relationship between the two errors. We can calculate the maximum bias and precision that will produce 4.4% of the population lying outside the reference limits. These values are 0.275∗CVB for bias (when the precision is zero) and 0.597∗CVB for the precision (when the bias is zero). Thus, TE in this model, is not a constant but varies from 0.275∗CVB to z∗0.597∗CVB. With 4.4% the new upper limit of the distribution, the equivalent z-value becomes 1.68. The EGE Lab Working Group combined these two errors, Cotlove’s requirement for maximum allowable precision with Gowan’s maximum acceptable bias into a two-component linear model [104]. This was the worst case. In a broad review of analytical goals, Petersen and Horder [105] reviewed the influence of changing bias and precision on reference intervals and decision limits. the signal underpinned Their approach to noise concept where the noise could be either controllable (diet, posture, biological cycles) or non-controllable (intra-individual variation) and they also combined the maximum bias and precision into a goal. The goal becomes; The next step was to consider the inter-relationships between these two terms. With the same assumptions as Gowans et al. we have that [106]: The second component represents the dispersion of the population-based reference interval. Petersen et al. [107] compared the different approaches to combining the bias and precision concepts described above. Using a transformation to the axis B/CVB versus CVA/CVB revealed the relationship between bias, precision, and BV more closely. Gowans’ specifications for sharing common reference intervals become a cusp with a maximum at CVA/CVB of 0.5. The EGE Lab working group’s model becomes a rectangular function when plotted on B/CVB versus CVA/CVB. This plot did emphasize the impact of the ratio of CVI/CVB, which is the index of individuality described earlier. If the CVI = 0.7∗CVB, then as an example, with a bias of zero, the desirable level for CVA/CVB is twice the level of when CVI = 0.3∗CVB Note the use of both CV and SD in these calculations. CV allows the move to dimensionless formulae but does assume that the SD increases linearly with concentration to produce a constant CV [108]. Skendzel et al. [109] described medically useful analytical coefficients of variation (CVA) were calculated in the following way for monitoring: . Also, Fraser et al. [79] described a general theory for the allowable analytical precision when patient monitoring is involved. They found that where Δ is the significant clinical change as a percent.

Quality specification for external quality assurance

The Stockholm consensus’s impact was that the EQA Organizers Working Group [98] looked at applying the TE concepts to EQA performance specifications. Their goal was to ensure that analytical performance did not influence clinical strategies, so suggestions for acceptable limits involved understanding the purpose for which the measurand was to be used. In the monitoring situation, SA ≤ 0.5∗SI (in the absence of unidirectional systematic changes), or ΔSE ≤ 0.33∗SI (when precision is negligible). For analytes that would be used for diagnostic testing B ≤ 0.25∗SB (when precision is insignificant) and SA ≤ 0.58∗SB (when bias is minor). However, as we saw earlier, the same analyte can be used for different purposes in different clinical situations. Fraser and Petersen [106] further proposed that is there was only a single determination of each EQA sample, the 95% acceptance range for each laboratory from the target value should be the sum of both values: Although the purpose of expression (5) was the application in EQA, it is used in other situations, such as selecting appropriate limits for conventional QC [110,111].

The three-level model

A problem with the general application of these analytical goals is that the available methods may not achieve the required precision and accuracy. For example, those measurands subject to tight homeostatic control (e.g. electrolytes) have a TE that produces minimal performance specifications. To allow for some flexibility, three acceptable precision and bias levels have been adopted [16,81]. With the ratio of B to SB at 0.25, the percentage of the population outside the upper and lowers reference limits moves from a symmetric 2.5% at both limits to 1.3 (at the lower end) and 4.4% (at the upper end), leading to an increase in the population outside the reference interval of 0.8% of the group. This equates to a rise of 0.8/5.0% or 16% misclassified. This precision level was called ‘desirable’ by Fraser et al. [81]. Using similar calculations, if B/SB becomes 0.375, then the percentages outside the lower and upper reference limits become 1.0% and 3.3%, leading to a 1.7/5.0 or 34% more misclassified above the reference limit. This level of bias was termed minimum performance. Lastly, if B/SB becomes 0.125, then the lower and upper reference limits become 1.8% and 3.3%, which causes a misclassification error of 2%. This was termed optimal performance. The impact of different precision factors has also been considered. If the ratio of SA/SB is 0.5, the Tonks number, then an additional 12% of the variability is added to the TE function. This is termed desirable precision. With a ratio of SA/SB of 0.75, there is an increase in variability of 25% (minimal), whereas if SA/SB is 0.25, then only 3% (optimal) is added. Typically a z value for 95% confidence interval is used. However, it can be increased to a 99% confidence interval in some instances, illustrating that all the components of this equation are variables [16]. The Index of Individuality demonstrated that it is best to use intra-individual reference intervals for some measurands when setting performance goals for these same assays. Thus different specifications are suggested for screening, diagnosis, and monitoring. However, in practice, none of the plans may be unachievable by current state-of-the-art procedures.

Oosterhuis and Sandberg model

Work continues on the development of a TE model (equation (5)) to overcome some of the issues identified by Oosterhuis and Sandberg, including [112]: the bias term is applied in monitoring, although this expression had come from the diagnosis-based reference value model. Both the maxima of allowable bias and precision are used. Their approach is based on the same principles but with an accurate calculation of reference limits or reference change values, which leads to more realistic goals for analytes with a high analytical variation relative to the biological variation. Mathematically, this model is an adaptation of Gowan’s model and the reference change values’ model. With the two underlying principles: The model describes the maximum bias and precision allowable that still maintains the validity of reference values (or reference change values in case of monitoring). The reference limits are defined by both biological and analytical variation. A key difference is introducing a temporal aspect of the analytical variation, CVA0, from the time (t = 0) the reference limits were determined or validated. In the situation of diagnosis, the CV of the reference value is defined as: The actual (total) variation of test results in a reference population is based on biological variation and CVA (CV actual analytical): Using Gowan’s criteria, a maximum of 4.6% of the test results outside a reference limit is considered acceptable. By including CVA0 in the model, it focuses on any increase in CVA relative to CVA0. The rise in CVA rather than the absolute value of CVA determines compliance with quality goals. This model also starts with a precision Gaussian distribution of to which is then added any additional analytical variation and bias. This contrasts with Gowan’s model, where a Gaussian distribution is assumed with CV = CVB with reference limits at the point where 2.5% of the results are outside the boundaries. Analytical variation (or analytical variation in combination with bias) is then added to the model with a limit of 4.6% test results outside the reference limits. Hence in this model, there is a 4.4% misclassification. Thus, the model becomes for diagnosis:where CVA = CVA0 The bias sum terms come from the 2.5% and 4.4% limits of the Gaussian distribution.

The variance model

The variance model uses squared components because this weighs outliers more heavily than data closer to the mean, and it prevents differences above the mean from canceling out those below. There are two main models for the combination of random and systematic error based on variances: the classical variance model suggested by Harris is [85,113], where s is the within-lab variation, and the laboratory-bias term is considered as a random variable (between-laboratory variance) to combine the two concepts in a variance model. This model has mainly been used in relation to the description of within- and between-individual variation, whether for biological or analytical variation. The measurement uncertainty model introduced as an accreditation model (ISO 17025) and described by EURACHEM [114] uses the variance model above to encompass the concepts of GUM (Guide to Uncertainty in measurement) [115] by combining known and assumed uncertainties (type A and type B uncertainties). After the correction of all biases and introducing these corrections’ uncertainties, these uncertainties – described as standard deviations or coefficients of variation – are combined as variances.

Problems with linear TE models

A fundamental problem in the estimation of analytical quality specifications is the combination of random and systematic errors. The nature of random errors is different from systematic errors, the former described by the standard deviation (or coefficients of variation), whereas the latter by differences (from defined target values). ‘It would be logical to keep these two incommensurable concepts separated, but there are a considerable desire and pressure for creating such models [2]. Bias is a problem with these models. The concept of the ‘true value’ has been abandoned in metrology. If an actual value cannot be known, TE cannot be estimated. We must use ‘surrogate true values’ to estimate TE, but the TE models do not appreciate bias estimation/correction uncertainty. This has been addressed by Frenkel et al. [116].

Quality control and external quality assurance

Sigma metrics

The sigma metric (σ = {TE – B}/SDa) has been used extensively in Industry and lately in laboratory medicine for comparing methods [[117], [118], [119], [120], [121], [122], [123]]. Capability, Cpa = TE/CVa (equivalently TE/SDa), is an objective measure of an assay’s ability to meet pre-defined (user) requirements. Clearly, Cpa is the sigma metric when no bias is assumed. Note that the sigma metric is derived from the TE, so the sigma metric becomes a TE metric in a sense [124]. In Industry, the sigma metric is data based on the original tolerance limits based on customer expectations, not a TE model. In the application of sigma to laboratory medicine, the concept has further been modified to include a bias term in the numerator, which can be problematic [[125], [126], [127], [128]]. The assumption of zero bias is valid in many routine clinical chemistry tests. It is also in an EQA program if the result is compared to the method or group performance. The sigma or Capability metric indicates the number of standard deviations of the process inside the analytical goals, so the higher the value, the better. Conventional QC relies on using manufactured control material and statistically-derived control rules to identify assay failure caused by a change in reagents, instrument settings, or calibrator change. The laboratory should select an appropriate QC algorithm to ensure good error detection of critical-sized shifts in performance. Where analytical precision is low relative to performance goals, good error detection is hard to achieve regardless of the QC algorithm used [129]. There are algorithms available for selecting the ‘best’ QC rules and frequency of QC samples based on using sigma metrics [117,130]. Any QC strategy aims to have well-defined QC rules, which will have at least a probability of error detection (Ped) of 0.9. Westgard and Westgard [118] have constructed sigma quality control assessment tools giving QC rules and QC sample frequencies for assays with Cpa of 4, 5, and 6 with a Ped of 0.9. Less capable assays require more complex sets of QC rules and a higher frequency of QC samples. The following are examples of some different rules for different capabilities [130]. CPs <4 multi-rule or 12.5s with n = 4 QC samples per run (2, twice per shift) CPs 4–6 multi-rule or 12.5s with n = 2 QC samples once per shift. CPs >6 13s rule with n = 2 QC samples once per shift.

Patient based quality control (PBRTQC) and quality assurance

We will investigate the change in how QC and EQA are evolving to monitor changes in the patient population parameters directly. Now we will consider PBRTQC techniques. As stated above, QC in clinical chemistry evolved using proxy materials to identify out of control situations. It is noteworthy that, like sigma metrics to laboratory medicine, this was not the usual situation in Industry. The QC process would be to test the product to ensure it meets customer expectations. As analysers and methods become more reliable, laboratories tended to run fewer QC samples, sometimes as few as one per day (Rosenbaum et al., 2018). This increases the number of patient samples released before it becomes evident that a systematic error has occurred. With the demands on laboratories for high turnaround and large patient volumes, how labs may deal with failure and when they could or would repeat samples during the failure period are concerning [131]. It is now more apparent that there are problems with conventional QC including non-commutability of the sample, non-Gaussian nature of the distribution of QC values, low cost, and insensitivity of the QC rules, particularly for assays with low sigma (≤4) [[132], [133], [134], [135], [136]]. The model that we have used for many years in QC is still based on the fact that there are three types of error systematic, random, and proportional, although the last is usually subsumed into the first [72]. However, it is now apparent that this was an oversimplification and that errors can be hybrids of these; they don’t follow this model at all [137]. There are other problems with conventional quality control that go to the core of the approach, including its statistical basis [132]. PBRTQC methods have been used since the seventies in hematology analysers. Still, there has been little uptake in clinical chemistry because of a lack of computing capacity or understanding of the laboratory’s patient populations being served. As the capabilities of widely used middleware and analyser resident software have advanced, patient-based QC techniques are being adopted. The timing of these programs’ availability is fortuitous as the consolidation of laboratories and a large number of screening tests performed on community patients leads to higher testing volumes that presumably represent a more stable population [133]. PBRTQC methods use the population (or subpopulation after removal/exclusion of outliers) reference interval as the control limits [138,139]. Other population-based parameters, such as standard deviation and percentage abnormal results, can be used to monitor assay performance [140]. A form of PBRTQC using an average of delta has been proposed, which uses the population delta check average. But this requires a patient base with multiple retesting to work. Hence it is not suitable for a community setting [141]. Given the relative infrequency of conventional QC events, an attractive feature about using patient results as part of a real-time QC strategy is the possibility of detecting out-of-control error conditions earlier than the next scheduled quality control specimen. While the amount of QC information in a single patient result may be relatively small, the amount of information continually grows with each additional patient sample that is tested. This is the interpretation of ‘real-time.’ Each result is adding to the information. There are some assays where PBRTQC will never be appropriate such as small volume, volatile population (pediatric or high number of abnormals (some hormone, tumour markers) [138]. The other advantage of PBRTQC is that it is patient-centric. You are seeing changes in the clinical impact of erroneous patient results. Having a 3SD error in IQC does not tell us the clinical effect of the error until we retrospectively retest the samples in question or examine the data in some ways. PBRTQC allows us to see shifts in patient results. For any QC strategy to be functional, there must be clear procedures for the laboratory technologist to follow in rerunning samples and detect the point in the analytical run where the error becomes apparent. There is now growing awareness that the future of either conventional QC or a PBRTQC approach will involve more generous use of Artificial Intelligence to monitor the process and manage the analyser response to out of control situations, including troubleshooting, recalibration, and rerunning samples [138,142].

Patient-based external QA

Patient population-based parameters have also been used in external quality assurance. Thienpont et al. [[143], [144], [145]] developed two online tools, ‘Percentiler’ and ‘Flagger,’ based on moving medians and population percentiles. ‘Percentiler’ monitors daily outpatient means/medians to reflect the stability/variation of performance at the individual laboratory level and its peer group. ‘Flagger’ monitors flagging of results against reference intervals or decision limits used in the individual laboratory, as well as at the peer group level. It is complementary to ‘Percentiler’ in that it directly translates the effect of analytical quality/(in)stability on flagging as surrogate medical decision making. One of the EQA programs’ key roles is to monitor IVD test stability across laboratories and peers/manufacturers, providing post-market surveillance [146]. Many of the problems of EQA are the same as QC, non-commutability of the sample, the inconsistency of acceptability criteria, and frequency of challenge [147]. Using a patient-based sample approach will solve some of these problems.

Conclusion

This Review has described the increasing importance of the concepts of biological variation to clinical chemists. The idea of comparison to ‘reference’ is fundamental in measurement. For the biological measurands, that reference is the relevant patient population, a clinical decision point based on a trial or an individual patient’s previous results. The idea of using biological variation to set quality goals was then realised for setting QC and EQA limits. The current phase of BV integration into practice is using PBRTQC and PBQA to detect a change in assay performance. The progress from the idea of a ‘normal range’ and ‘average man’ to a biological passport to provide personalised reference intervals demonstrates the greater awareness of individual biological variation. Diagnosis and classification based on clinical decision points will always be an essential purpose of laboratory results, but in the future, monitoring of individuals during the ‘well’ phase of their lives will be the norm. Throughout the study and application of biological variation and its use in defining acceptability has been a tension between empirical and model-based approaches, from Gaussian to nonparametric statistical approaches, from using very well defined ‘healthy’ to large patient populations, from linear models of bias and precision to uncertainty (variance) models. We see with QC, and EQA schemes move away from a model of elegant statistical process control to using patient data to detect the error, from model-based to big data. These changes occur because we have the volumes of patient data available, the computing capability to perform the necessary calculations, and a growing realisation that the models being used are too crude. It may be that just as the data-driven approach of using ANOVA on large patient data sets can identify intra-individual, intra-group, and analytical variation to be used for estimates of Biological Variation, the same data-based approach could be used for EQA and even QC. We need data to develop better models, but all models will eventually fail. New data analysed at a deeper level cannot be understood using the existing model, as has always been the way [148]. Understanding why it is so important?

Funding sources

There was no funding of this work.

Declaration of interest

There were no financial or personal interests that could affect this review.

Permissions

There are no permissions required for the content of this review.

Author agreement

I was the sole author of this review.

Declaration of competing interest

There are no conflicts of Interest.

117 in total

1. Setting performance goals and evaluating total analytical error for diagnostic assays.

Authors: Jan S Krouwer
Journal: Clin Chem Date: 2002-06 Impact factor: 8.327

2. Integration of data derived from biological variation into the quality management system.

Authors: Carmen Ricós; Virtudes Alvarez; Fernando Cava; José Vicene García-Lario; Amparo Hernández; Carlos Víctor Jiménez; Joana Minchinela; Carmen Perich; Margarita Simón
Journal: Clin Chim Acta Date: 2004-08-02 Impact factor: 3.786

3. From biomarkers to medical tests: the changing landscape of test evaluation.

Authors: Andrea R Horvath; Sarah J Lord; Andrew StJohn; Sverre Sandberg; Christa M Cobbaert; Stefan Lorenz; Phillip J Monaghan; Wilma D J Verhagen-Kamerbeek; Christoph Ebert; Patrick M M Bossuyt
Journal: Clin Chim Acta Date: 2013-09-27 Impact factor: 3.786